What Is Kubernetes HPA and How Can It Help You Save on the Cloud? (opens in new tab)

(cast.ai)

35 pointsdeletriusotis3y ago21 comments

21 comments

In my experience, HPA are awesome! Once you defined your sweetspot of buffer pods for quick scaling, they are well worth the effort!

It the super simple stuff, scaling down staging on the weekend or even scaling all feature deployments to 0, when you know nobody will be working on it, that will end up saving you big bucks on your cloud budget.

If you pair the HPA with a decent node autoscaler, THAT in my opinion is the game changer of cloud managed kubernetes over the bare metal deployments that I have done.

Spivak3y ago

I'm surprised to hear you say that you like having two layers of autoscaling rather than it being some accidental complexity that you just have to put up with because of how the different systems intersect. Having multiple non-orthogonal dimensions of scaling to me always feels like a task ill-suited to humans. I kinda wish HPA, VPA, and Cluster Autoscaler were all just be one thing.

jrockway3y ago

I think the boundary makes a lot of sense. Cluster autoscaling only responds to scheduling pressure; if there are pending pods, a new node is added to the cluster so those pods can run. Meanwhile, horizontal pod autoscaling is a totally different system; it adds pods for that service when system-level metrics indicate that it should. Vertical pod autoscaling is again mostly unrelated; if metrics indicate that a certain pod should be bigger, a bigger version is scheduled.

I do see why more integration would be useful, though, including disruption budgets. Mostly for consolidating the incremental cluster autoscaling results onto one node from time to time, without waiting for the workload to naturally disappear or decrease in scale. Also, it would be nice to say "hey if ARM spot nodes are cheaper than AMD64, just reschedule these workloads onto ARM". Basically, it's still the very early days of optimizing cost, latency, and throughput.

2 more replies

jhoelzel3y ago

Of course a complete analysis needs something like kubecost running for a couple of weeks to determine where your peaks really are and a couple more for actual fine tuning, but i think its well worth it in the end.

Node autoscaling works best for me with buffer nodes depending on resources and having "one more than you need" is super easy in the cloud.

Dont get me wrong there is still plenty of room for improvenment, but the hard part is defenitly just finding out how much resources your app really needs.

And of course, the application needs to be able to handle scaling to begin with.

hosh3y ago

It's more reliable this way. N number of pods are scheduled on M number of nodes, and if there are multiple sets of pods that each have their own scaling parameters (target utilizations, scaling cooldowns, etc), there is not always a one-to-one mapping with how many nodes are needed.

The cluster autoscaler already has a fairly complex logic just in its own control loop. It uses predicate logic and a simulated scheduler to determine whether a pending pod, based upon node selector, affinity, anti-affinity, taints, tolerations, qos, priority whether expanding a nodegroup would make the pod schedulable.

So it's actually easier (at least for me) to reason out what might happen, with two control loops that work independently in adjacent dimensions than a single one that tries to cover everything. I would not want HPA, VPA, and cluster-autoscaler to be one thing.

I have never used VPA, and in our use-case, we do a different kind of vertical scaling. (Different deployments that target different nodegroups with differently-sized number of cores on the base machine)

igetspam3y ago

Look at Karpenter. It's working really well for some of our workloads.

throwabro6133y ago

> I kinda wish HPA, VPA, and Cluster Autoscaler were all just be one thing.

Go and write it, there are a bajillion open source controllers for Kubernetes that add a ton of value.

chupasaurus3y ago

> Having multiple non-orthogonal dimensions of scaling to me always feels like a task ill-suited to humans.

It's ill-suited for anyone unless P=NP with a nice solution.

drewcoo3y ago

> Having multiple non-orthogonal dimensions of scaling to me always feels like a task ill-suited to humans.

That plus a steep learning curve leads to Stockholm syndrome.

snoth3y ago

That's what we did where I work. Thanks to GKE + HPA + cluster-autoscaler, our cluster grows at the same time as our req/s.

probably_a_gpt3y ago

Same.

We broke it at least once but now it’s fixed.

cbushko3y ago

HPAs can definitely save you a lot of money when running Kubernetes and they are extremely useful, especially for non-production environments where you want to be efficient as possible.

Strategies I have used in the past for saving money are:

  1) Set requests very low for your pods. Look at the minimum CPU/Memory that your pods need go start and set it to that. Limits can be whatever.

  2) Set min replicas to 1. This is a non-production environment, nobody cares if an idle pod goes away in the middle of the night.

  3) Use spot instances for your cluster nodes. 80% savings is nice!

  4) Increase the number of allowed pods per node. GKE sets the default to 110 pods per node but it can be increased.

  5) Evaluate your nodes and determine if it makes more sense to have `fewer large sized nodes` or `several smaller nodes`. If you have a lot of daemonsets then maybe it makes sense to have fewer large nodes.

  6) Look at the CPU and Memory utilization of your nodes. Are you using a lot of CPU but not much memory?  Maybe you need to change the machine type you are using so that you get close(r) to 100% CPU and Memory utilization. You are just wasting money if you are only using 50% of the available memory of your nodes.

  7) Use something like knative or KEDA for 'intelligent autoscaling'. I've used both extensively and I found KEDA to be considerably simpler to use. Being able to scale services down to 0 pods is extremely nice!

WarChortle3y ago

> Set requests very low for your pods. Look at the minimum CPU/Memory that your pods need go start and set it to that. Limits can be whatever.

Wouldn't this lead to Node Over provisioning?

I ask because my company's workload is very spiky and usage is very minimal until it isn't. We are looking into ways to optimize it.

cbushko3y ago

My example was for non-prod and saving money there as I found that our development clusters tended to be the most under utilized per dollar spent. In development it was ok to put as many idle pods as possible on the nodes. If there was a spike, then yes you could get new nodes but I found that they scaled down nodes quite often.

My apologies in advance as the advice can be terrible depending on your environment and services. Below is not an exact science as you are dealing with requests and limits while trying to find optimal performance.

For production you need to calculate your minimum, average and max CPU/Memory for your a pod.

  1) Set your replicas to 1

  2) Determine what your true maximum CPU/Memory is for a pod. 

  Set your limits to very high and performance test against your pod. If your response time slows to a crawl then your limit is too high and your code may not be able to handle the load. If your response time is good while hitting the limit, increase the limit until performance goes down.

  3) Get your minimum CPU/Memory for your pod to start.

  5) Get your average CPU/Memory DURING THE SPIKES. You should be able to get this from past metrics. This can also be difficult to get because your load might be spread over several pods in your metrics.

  6) I use the following formulas:

     requests = (min + average)/2
     limits = (average + max)/2

  7) You now have a baseline for the future so that you can tweak the values.

  8) Set your autoscaler to something high like 80% CPU. You want this value to stay constant. I think GKE sets it to 60% but I found that to be far too low and wasteful.

  9) Observe and tweak the values to see if you can get things 'better' depending on your needs.

There are two other things I always do in production that help with stability and reliability.

  - Set the autoscaler behaviour to scale up quickly and scale down slowly. It stops these cycles of add 3 pods, remove 1 pod, add 2 pods, remove 3 pods chaos in short periods of time during spikes. The behavior field was added to the autoscaler resource a couple releases ago.

  - Set your minimum replicas to 2 for redundancy. I always do this in production.

I hope this helps and I apologize once again for the hand wavyness of things.

eljimmy3y ago

Highly recommend checking out KEDA (https://keda.sh/) which leverages HPAs under the hood.

If you need to scale based on some internal data like database records, Redis queues, Kafka topics, etc. KEDA scalers are incredibly easy to hook up to do that. You could even write your own custom scaler if there is no existing one for your type of event data source.

advisedwang3y ago

If the author is here: The illustration in "How does Horizontal Pod Autoscaler work?" section has incorrect before/after CPU utilization % based on the text/logic.

sdfsdffs333y ago

I'm curious if anyone has found a sweet spot for autoscaling ingress gateways in terms of CPU% saturation. I found tail latencies start to get high over 60%.

j / k navigate · click thread line to collapse

21 comments

jhoelzel3y ago

In my experience, HPA are awesome! Once you defined your sweetspot of buffer pods for quick scaling, they are well worth the effort!

If you pair the HPA with a decent node autoscaler, THAT in my opinion is the game changer of cloud managed kubernetes over the bare metal deployments that I have done.

Spivak3y ago

jrockway3y ago

2 more replies

jhoelzel3y ago

Node autoscaling works best for me with buffer nodes depending on resources and having "one more than you need" is super easy in the cloud.

Dont get me wrong there is still plenty of room for improvenment, but the hard part is defenitly just finding out how much resources your app really needs.

And of course, the application needs to be able to handle scaling to begin with.

hosh3y ago

igetspam3y ago

Look at Karpenter. It's working really well for some of our workloads.

throwabro6133y ago

> I kinda wish HPA, VPA, and Cluster Autoscaler were all just be one thing.

Go and write it, there are a bajillion open source controllers for Kubernetes that add a ton of value.

chupasaurus3y ago

> Having multiple non-orthogonal dimensions of scaling to me always feels like a task ill-suited to humans.

It's ill-suited for anyone unless P=NP with a nice solution.

drewcoo3y ago

> Having multiple non-orthogonal dimensions of scaling to me always feels like a task ill-suited to humans.

That plus a steep learning curve leads to Stockholm syndrome.

snoth3y ago

That's what we did where I work. Thanks to GKE + HPA + cluster-autoscaler, our cluster grows at the same time as our req/s.

probably_a_gpt3y ago

Same.

We broke it at least once but now it’s fixed.

cbushko3y ago

HPAs can definitely save you a lot of money when running Kubernetes and they are extremely useful, especially for non-production environments where you want to be efficient as possible.

Strategies I have used in the past for saving money are:

  1) Set requests very low for your pods. Look at the minimum CPU/Memory that your pods need go start and set it to that. Limits can be whatever.

  2) Set min replicas to 1. This is a non-production environment, nobody cares if an idle pod goes away in the middle of the night.

  3) Use spot instances for your cluster nodes. 80% savings is nice!

  4) Increase the number of allowed pods per node. GKE sets the default to 110 pods per node but it can be increased.

  5) Evaluate your nodes and determine if it makes more sense to have `fewer large sized nodes` or `several smaller nodes`. If you have a lot of daemonsets then maybe it makes sense to have fewer large nodes.

  6) Look at the CPU and Memory utilization of your nodes. Are you using a lot of CPU but not much memory?  Maybe you need to change the machine type you are using so that you get close(r) to 100% CPU and Memory utilization. You are just wasting money if you are only using 50% of the available memory of your nodes.

  7) Use something like knative or KEDA for 'intelligent autoscaling'. I've used both extensively and I found KEDA to be considerably simpler to use. Being able to scale services down to 0 pods is extremely nice!

WarChortle3y ago

> Set requests very low for your pods. Look at the minimum CPU/Memory that your pods need go start and set it to that. Limits can be whatever.

Wouldn't this lead to Node Over provisioning?

I ask because my company's workload is very spiky and usage is very minimal until it isn't. We are looking into ways to optimize it.

cbushko3y ago

For production you need to calculate your minimum, average and max CPU/Memory for your a pod.

  1) Set your replicas to 1

  2) Determine what your true maximum CPU/Memory is for a pod. 

  Set your limits to very high and performance test against your pod. If your response time slows to a crawl then your limit is too high and your code may not be able to handle the load. If your response time is good while hitting the limit, increase the limit until performance goes down.

  3) Get your minimum CPU/Memory for your pod to start.

  5) Get your average CPU/Memory DURING THE SPIKES. You should be able to get this from past metrics. This can also be difficult to get because your load might be spread over several pods in your metrics.

  6) I use the following formulas:

     requests = (min + average)/2
     limits = (average + max)/2

  7) You now have a baseline for the future so that you can tweak the values.

  8) Set your autoscaler to something high like 80% CPU. You want this value to stay constant. I think GKE sets it to 60% but I found that to be far too low and wasteful.

  9) Observe and tweak the values to see if you can get things 'better' depending on your needs.

There are two other things I always do in production that help with stability and reliability.

  - Set the autoscaler behaviour to scale up quickly and scale down slowly. It stops these cycles of add 3 pods, remove 1 pod, add 2 pods, remove 3 pods chaos in short periods of time during spikes. The behavior field was added to the autoscaler resource a couple releases ago.

  - Set your minimum replicas to 2 for redundancy. I always do this in production.

I hope this helps and I apologize once again for the hand wavyness of things.

eljimmy3y ago

Highly recommend checking out KEDA (https://keda.sh/) which leverages HPAs under the hood.

advisedwang3y ago

If the author is here: The illustration in "How does Horizontal Pod Autoscaler work?" section has incorrect before/after CPU utilization % based on the text/logic.

sdfsdffs333y ago

I'm curious if anyone has found a sweet spot for autoscaling ingress gateways in terms of CPU% saturation. I found tail latencies start to get high over 60%.

j / k navigate · click thread line to collapse