It the super simple stuff, scaling down staging on the weekend or even scaling all feature deployments to 0, when you know nobody will be working on it, that will end up saving you big bucks on your cloud budget.
If you pair the HPA with a decent node autoscaler, THAT in my opinion is the game changer of cloud managed kubernetes over the bare metal deployments that I have done.
I do see why more integration would be useful, though, including disruption budgets. Mostly for consolidating the incremental cluster autoscaling results onto one node from time to time, without waiting for the workload to naturally disappear or decrease in scale. Also, it would be nice to say "hey if ARM spot nodes are cheaper than AMD64, just reschedule these workloads onto ARM". Basically, it's still the very early days of optimizing cost, latency, and throughput.
Node autoscaling works best for me with buffer nodes depending on resources and having "one more than you need" is super easy in the cloud.
Dont get me wrong there is still plenty of room for improvenment, but the hard part is defenitly just finding out how much resources your app really needs.
And of course, the application needs to be able to handle scaling to begin with.
The cluster autoscaler already has a fairly complex logic just in its own control loop. It uses predicate logic and a simulated scheduler to determine whether a pending pod, based upon node selector, affinity, anti-affinity, taints, tolerations, qos, priority whether expanding a nodegroup would make the pod schedulable.
So it's actually easier (at least for me) to reason out what might happen, with two control loops that work independently in adjacent dimensions than a single one that tries to cover everything. I would not want HPA, VPA, and cluster-autoscaler to be one thing.
I have never used VPA, and in our use-case, we do a different kind of vertical scaling. (Different deployments that target different nodegroups with differently-sized number of cores on the base machine)
Go and write it, there are a bajillion open source controllers for Kubernetes that add a ton of value.
It's ill-suited for anyone unless P=NP with a nice solution.
That plus a steep learning curve leads to Stockholm syndrome.
We broke it at least once but now it’s fixed.
Strategies I have used in the past for saving money are:
1) Set requests very low for your pods. Look at the minimum CPU/Memory that your pods need go start and set it to that. Limits can be whatever.
2) Set min replicas to 1. This is a non-production environment, nobody cares if an idle pod goes away in the middle of the night.
3) Use spot instances for your cluster nodes. 80% savings is nice!
4) Increase the number of allowed pods per node. GKE sets the default to 110 pods per node but it can be increased.
5) Evaluate your nodes and determine if it makes more sense to have `fewer large sized nodes` or `several smaller nodes`. If you have a lot of daemonsets then maybe it makes sense to have fewer large nodes.
6) Look at the CPU and Memory utilization of your nodes. Are you using a lot of CPU but not much memory? Maybe you need to change the machine type you are using so that you get close(r) to 100% CPU and Memory utilization. You are just wasting money if you are only using 50% of the available memory of your nodes.
7) Use something like knative or KEDA for 'intelligent autoscaling'. I've used both extensively and I found KEDA to be considerably simpler to use. Being able to scale services down to 0 pods is extremely nice!Wouldn't this lead to Node Over provisioning?
I ask because my company's workload is very spiky and usage is very minimal until it isn't. We are looking into ways to optimize it.
My apologies in advance as the advice can be terrible depending on your environment and services. Below is not an exact science as you are dealing with requests and limits while trying to find optimal performance.
For production you need to calculate your minimum, average and max CPU/Memory for your a pod.
1) Set your replicas to 1
2) Determine what your true maximum CPU/Memory is for a pod.
Set your limits to very high and performance test against your pod. If your response time slows to a crawl then your limit is too high and your code may not be able to handle the load. If your response time is good while hitting the limit, increase the limit until performance goes down.
3) Get your minimum CPU/Memory for your pod to start.
5) Get your average CPU/Memory DURING THE SPIKES. You should be able to get this from past metrics. This can also be difficult to get because your load might be spread over several pods in your metrics.
6) I use the following formulas:
requests = (min + average)/2
limits = (average + max)/2
7) You now have a baseline for the future so that you can tweak the values.
8) Set your autoscaler to something high like 80% CPU. You want this value to stay constant. I think GKE sets it to 60% but I found that to be far too low and wasteful.
9) Observe and tweak the values to see if you can get things 'better' depending on your needs.
There are two other things I always do in production that help with stability and reliability. - Set the autoscaler behaviour to scale up quickly and scale down slowly. It stops these cycles of add 3 pods, remove 1 pod, add 2 pods, remove 3 pods chaos in short periods of time during spikes. The behavior field was added to the autoscaler resource a couple releases ago.
- Set your minimum replicas to 2 for redundancy. I always do this in production.
I hope this helps and I apologize once again for the hand wavyness of things.If you need to scale based on some internal data like database records, Redis queues, Kafka topics, etc. KEDA scalers are incredibly easy to hook up to do that. You could even write your own custom scaler if there is no existing one for your type of event data source.