Lessons From The Trenches: Auto Scaling Groups

On the surface the problem and solution for auto scaling seems so simple. When you get a burst of work, scale up resources to handle the load. When the burst dies off, scale down resources to an appropriate level. Find a cold drink and a hammock, your job just became a lot easier.

The reality is that defining when to scale up and down is actually the difficult part. There are metrics, time intervals, cool down periods, instance spin up times, and many other variables to consider. So I thought I would provide a few key lessons learned from the trenches of Auto Scaling Groups (ASG).

The Cooldown Period Is For the Entire Auto Scale Group
In each scale up or down policy you specify a cooldown period, the time to wait until taking another scaling action. It's actually pretty clear in the documentation how it works, but because it's defined in the policy, you can jump to the conclusion that the cooldown period is for that policy.

So take a policy like: scale down 1 instance and cooldown for an hour. You might think the ASG could scale up 20 minutes after the scale down happens. But this isn't the case. The cool down period affects the entire group, not the policy. So if you are in the cooldown period, none of your policies will cause auto scaling events. I've generally found that cooldown periods of below a half hour keep my ASG responsive to potential spikes, and also keep it from thrashing by scaling up and down.

How do I scale up and down?
There are a lot of options here, and what you choose will vary based on the type of workload you are dealing with. For web applications I've found that it is best to pick the first thing that will heat up under load. If you have a memory intensive application, it is probably memory you want to watch. But if your app is compute intensive, then you probably want to watch CPU. I've done a few applications where I have used CPU as the scaling metric. I define over 80% utilization for a period as a scale up event, and 60% utilization for a period of time as a scale down event. It seems to work quite well.

But you can also scale your resources based on metrics that are external to your instances. One that is easy and effective is Simple Queue Service queue depth. If the queue gets beyond a certain depth, scale up. When the queue is depth decreases to a comfortable level, scale down.

You can specify multiple metrics to watch like maybe CPU and memory. But my recommendation is to keep it as simple as possible and nothing more.

Scale Up Fast, Scale Down Slow
This phrase is tossed around a lot, but it's not very helpful. What is fast, and what is slow? In one of my first takes at auto scaling I thought that it was probably better to scale on more than I needed since the costs were low.

I had the policy set in a way that added multiple instances when a CPU threshold was exceeded for 5 minutes. Then it would scale down once a low CPU threshold was met for 30 minutes, with a cooldown of an hour. In my mind I was scaling up in minutes (fast), but scaling down over hours (slow).

We scaled successfully a few times with this policy, and I didn't realize that fast was actually too fast until the day I had a double spike traffic pattern. The first wave of traffic came in and triggered my scale up rule. The group scaled on more power than I needed to handle that load. But I was OK with that. More power is better, right?

Well the traffic plateaued and because I scaled too much power, my low CPU threshold was met about a half hour later and an instance was scaled off. The ASG was still in its 1 hour cooldown period when the second wave hit, pushing us to even higher traffic numbers than the first. To get enough instances online, we had to manually fire scale up events through the AWS console.

The take away was that fast can be too fast. But the bigger take away was that the way our slow scale down policy was setup made the ASG slow to respond to changes. The incident caused us to reevaluate our auto scaling policies for that app. We slowed down the scale up policy so it wouldn't add too much power.

We also slowed the scale down policy further by setting our evaluation period for the scale down policy to be hours instead of half an hour. This allowed us to lower the cooldown period to be less than a half hour. The low cooldown period keeps the ASG responsive while the longer evaluation period helps us make sure the load is really past before we start scaling down.

Hope it helps