QPS Limit Exceeded on EKS Start-up: The Image Pull Thundering Herd

Из ленты dev.to devops — кратко, чтобы не потерять.

I scaled our dev EKS cluster down to zero overnight to save cost. The next morning it didn’t come back up cleanly — pods got stuck and the events were full of “QPS limit exceeded”. The cause wasn’t the automation. It was every pod trying to pull its image at the same second. Here’s the thundering herd, and how I fixed it. Why I started stopping the dev cluster at night A dev cluster doesn’t need to run 24/7. There are 168 hours in a week, but a dev environment realistically only needs ~50 (10 hours a day, 5 days a week). So I set up a schedule: scale the node groups to zero at night, bring them back at 8 AM. The control plane stays up; the expensive worker nodes go to zero. Savings: roughly 60–70% on dev worker-node compute. Then the cluster woke up angry The automation worked perfectly go

Полный текст и контекст у первоисточника: https://dev.to/srinu_nuthi_5ff587c586662/qps-limit-exceeded-on-eks-start-up-the-image-pull-thundering-herd-14ge