Automatic K8s pod placement to match external service zones (opens in new tab)

(github.com)

82 pointstoredash7mo ago46 comments

46 comments

Hi HN,

I wanted to share something I've worked a bit to solve regarding Kubernetes: its scheduler has no awareness of the network topology for external services that workloads communicate with. If a pod talks to a database (e.g AWS RDS), K8s does not know it should schedule it in the same AZ as the database. If placed in the wrong AZ, it leads to unnecessary cross-AZ network traffic, adding latency (and costs $).

I've made a tool I've called "Automatic Zone Placement", which automatically aligns Pod placements with their external dependencies.

Testing shows that placing the pod in the same AZ resulted in a ~175-375% performance increase. Measured with small, frequent SQL requests. It's not really that strange, same AZ latency is much lower than cross-AZ. Lower latency = increased performance.

The tool has two components:

1) A lightweight lookup service: A dependency-free Python service that takes a domain name (e.g., your RDS endpoint) and resolves its IP to a specific AZ.

2 ) A Kyverno mutating webhook: This policy intercepts pod creation requests. If a pod has a specific annotation, the webhook calls the lookup service and injects the required nodeAffinity to schedule the pod onto a node in the correct AZ.

The goal is to make this an automatic process, the alternative is to manually add a nodeAffinity spec to your workloads. But resources moves between AZ, e.g. during maintenance events for RDS instances. I built this with AWS services in mind, the concept is generic enough to be used for on-premise clusters to make scheduling decisions based on rack, row, or data center properties.

I'd love some feedback on this, happy to answer questions :)

stackskipton7mo ago

How do you handle RDS failovers? Mutating Webhook is only fired when Pods are created so if AZ zone does not fail, there is no pods to be created and affinity rules to be changed.

toredashOP7mo ago

As it stands now, it doesn't. Unless you modify the Kyverno Policy to be of a background scanning.

I would create a similar policy where Kyverno at intervals would check the Deployment spec to see if the endpoint is changed, and alter the affinity rules. It would then be a traditional update of the Deployment spec to reflect the desire to run in another AZ, if that made sense?

darkwater7mo ago

Interesting project! Kudos for the release. One question: how are the failure scenario managed, i.e. AZP fails for whatever reason and it's in a crash loop? Just "no hints" to the scheduler, and that's it?

toredashOP7mo ago

If the AZP deployment fails, yes your correct there is no hints anywhere. If the lookup to AZP fails for whatever reason, it would be noted in the Kyverno logs. And based on if you -require- this policy to take affect or not, you have to decide if it you want pods to fail or not in the scheduling step. In most cases, you don't want to stop scheduling :)

mathverse7mo ago

Typically you have multi-az setup for app deployment for HA. How would you without traffic management controll solve this?

toredashOP7mo ago

I'm not sure I follow. Are you talking about the AZP service, or ... ?

dserodio7mo ago

It's a best practice to have a Deployment run multiple Pods in separate AZs to increase availability

1 more reply

stackskipton7mo ago

This is one of those ideas that sounds great and appears simple at first but can grow into mad monster. Here my potential thoughts after 5 minutes.

Kyverno requirement makes it limited. There is no "automatic-zone-placement-disabled" function in case someone wants to temporarily disable zone placement but not remove the label. How do we handle RDS Zone changing after workload scheduling? No automatic look up of IPs and Zones. What if we only have one node in specific zone? Are we willing to handle EC2 failure or should we trigger scale out?

toredashOP7mo ago

> Kyverno requirement makes it limited.

You don't have to use Kyverno. You could use a standard mutating webhook, but you would have to generate your own certificate and mutate on every Pod.CREATE operations. Not really a problem but, it depends.

> There is no "automatic-zone-placement-disabled"

True. Thats why I choose to use preferredDuringSchedulingIgnoredDuringExecution instead of requiredDuringSchedulingIgnoredDuringExecution. In my case, where this solutions originated from, Kubernetes was already a multi AZ solution where there was always at least one node in each AZ. It was nice if the Pod could be scheduled into the same AZ, but it was not a hard requirement,

> No automatic look up of IPs and Zones. Yup, it would generate a lot of extra "stuff" to mess with: IAM Roles, how to lookup IP/subnet information from multi account AWS setup with VPC Peerings. In our case it was "good enough" with a static approach. Subnet/network topology didnt change frequently enough to add another layer of complexity.

> What if we only have one node in specific zone?

Thats why we defaulted to preferredDuringSchedulingIgnoredDuringExecution and not required.

solatic7mo ago

I don't really understand why you think this tool is needed and what exact problem/risk it's trying to solve.

Most people should start with a single-zone setup and just accept that there's a risk associated with zone failure. If you have a single-zone setup, you have a node group in that one zone, you have the managed database in the same zone, and you're done. Zone-wide failure is extremely rare in practice and you would be surprised at the number (and size of) companies that run single-zone production setups to save on cloud bills. Just write the zone label selector into the node affinity section by hand, you don't need a fancy admission webhook if you want to reduce chance's factor.

If you decide that you want to handle the additional complexity of supporting failover in case of zone failure, the easiest approach is to just setup another node group in the secondary zone. If the primary zone fails, manually scale up the node pool in the secondary zone. Kubernetes will automatically schedule all the pods on the scaled up node pool (remember: primary zone failure, no healthy nodes in the primary zone), and you're done.

If you want to handle zone failover completely automatically, this tool represents additional cost, because it forces you to have nodes running in the secondary zone during normal usage. Hopefully you are not running a completely empty, redundant set of service VMs in normal operation, because that would be a colossal waste of money. So you are presuming that, when RDS automatically fails over to zone b to account for zone a failure, that you will certainly be able to scale up a full scale production environment in zone b as well, in spite of nearly every other AWS customer attempting more or less the same strategy; roughly half of zone a traffic will spill over to zone b, roughly half to zone c, minus all the traffic that is zone-locked to a (e.g. single-zone databases without failover mechanisms). That is a big assumption to make and you run a serious risk of not getting sufficient capacity in what was basically an arbitrarily chosen zone (chosen without context on whether there is sufficient capacity for the rest of your workloads) and being caught with zonal mismatches and not knowing what to do. You very well might need to failover to another region entirely to get sufficient capacity to handle your full workload.

If you are both cost- and latency-sensitive to stick to a single zone, you're likely much better off coming up with a migration plan, writing an automated runbook/script to handle it, and testing it on gamedays.

stronglikedan7mo ago

> I don't really understand why you think this tool is needed and what exact problem/risk it's trying to solve.

They lay out the problem and solution pretty well in the link. If you still don't understand after reading it, then that's okay! It just means you're not having this problem and you're not in need of this tool, so go you! But at least you'll come away with the understanding that someone was having this problem and someone needed this tool to solve it, so win win win!

toredashOP7mo ago

> Most people should start with a single-zone setup and just accept that there's a risk associated with zone failure. If you have a single-zone setup, you have a node group in that one zone, you have the managed database in the same zone, and you're done.

I don't disagree, but there is one issue with this approach and that is that RDS is a multi AZ service by itself. That means that when a maintenance event occur on your insaance, AWS will start a new instance in a new zone, and fail over to that one.

You could of course manually failover RDS afterwards to your primary zone. Not sure if that is better than manually scaling up a node pool if a zone fails.

> So you are presuming that, when RDS automatically fails over to zone b to account for zone a failure, that you will certainly be able to scale up a full scale production environment in zone b as well, in spite of nearly every other AWS customer attempting more or less the same strategy;

Thats up to the user to decide via the Kyverno policy. We used the preferredDuringSchedulingIgnoredDuringExecution affinity setting to instruct the scheduler to attempt to schedule the pods in the optimal zone.

I believe the only way to be 100% sure that you have compute capacity available in your AWS account is the use EC2 On-Demand Capacity Reservations (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capa...). If your current zone is at full capacity, and for some reason the nodes your VMs are running on dies, that capacity is lost, and you wont get it back either.

solatic7mo ago

> That means that when a maintenance event occur on your insaance, AWS will start a new instance in a new zone, and fail over to that one.

Not true for single-AZ deployments. There is downtime during the maintenance event, but this is also true in multi-AZ deployments when the instance in the second AZ is promoted; a multi-AZ maintenance window has slightly less downtime, but not much; downtime is downtime, but generally not enough to affect a 99.9% SLA anyway.

> EC2 On-Demand Capacity Reservations

Also quite expensive to maintain just for outage recovery events.

The point I'm trying to make is that formal risk analysis forces you to think about actual sources of risk, and SRE/FinOps principles force you think about how much budget you are willing to spend to address those risks. And I don't understand how a tool like this fits into formal risk analysis and where it presents an optimum solution for those risks.

toredashOP7mo ago

> And I don't understand how a tool like this fits into formal risk analysis and where it presents an optimum solution for those risks.

Seems it does not fit your risk analysis?

pikdum7mo ago

Wasn't aware that there was noticeably higher latency between availability zones in the same AWS region. Kinda thought the whole point was to run replicas of your application in multiple to achieve higher availability.

dilyevsky7mo ago

They also charge you like 1c/GB for traffic egress between the zones. To top it off there are issues with AWS loadbalancers in multi-zone setups. Ultimately i've come to the conclusion that large multi-zonal clusters is a mistake. Do several single-zone disposable clusters if you want zone redundancy.

frenchtoast87mo ago

At $WORK traffic between zones ($REGION-DataTransfer-Regional-Bytes) is our second largest cost on our AWS bill, more than our EC2/EKS cost. It adds up to mid six figures each year. We try to minimize this where it is easy to do so. For example, our EKS pods perform reads against RDS read replicas in the same AZ only, but you're out of luck for writes to the primary instance. To reduce this in any significant way can eat up a lot of time, and for us, the cost is enough to be painful but not enough to dedicate an engineer to fixing.

This is precisely how Amazon's bread is buttered. An outage affecting an entire AZ is rare enough that I would feel pretty happy making all our clusters single-AZ, but it would be a fool's errand for me to convince management to go against Amazon's official recommendations.

toredashOP7mo ago

I would LOVE to pitch something else I'm working on that is solving this problem in EKS, cross zone data transfer.

It's a plugin that enables traffic re-direction for any service that is using an IP in any given VPC. If you have say multiple RDS Reader instances, it will first attempt to use local AZ instances first, but the other instances are available if local instances are non-functional. So you do not loose HA or failover features.

The plugin does not require any reconfiguration on your apps. It works similar to Topology Aware Routing (https://kubernetes.io/docs/concepts/services-networking/topo...) in Kubernetes, but it works for services outside of Kubernetes. The plugin even works for non-Kubernetes setup as well.

This AZP solution is fine for services that is have one IP or primary instance, like RDS Writer instance. It does not work for anything that is "stateless" and multi-AZ, like RDS Read-only instances or ALBs.

dilyevsky7mo ago

I assume with this much traffic you’re running multiple clusters? In that case what is there to gain by running each cluster as multi-zone?

stackskipton7mo ago

It's generally sub 2MS. Most people take slight latency increase for higher availability, but I guess in this case, that was not acceptable.

danpalmer7mo ago

2ms per RPC is pretty high if you need to make dozens of RPCs to serve a request.

toredashOP7mo ago

That was the origin for this solution. A client app had to issue millions of small SQL queries where the first query had to complete before the second query could be made. Millions of MS adds up.

Lowest possible latency would of course be running the client code on the same physical box as the SQL server, but thats hard to do.

stackskipton7mo ago

It’s generally sub that. On average it seems to be about .7 MS.

1 more reply

toredashOP7mo ago

I was surprised to. Of course it makes sense when you look at it hard enough, two seperate DCs won't have the same latency than internal DC communication. It might have the same physical wire-speed, but physical distance matter.

mystifyingpoi7mo ago

Cool idea, I like that. Though I'm curious about the lookup service. You say:

> To gather zone information, use this command ...

Why couldn't most of this information be gathered by lookup service itself? A point could be made about excessive IAM, but a simple case of RDS reader residing in a given AZ could be easily handled by simply listing the subnets and finding where a given IP belongs.

toredashOP7mo ago

Totally agree!

This service is published more as a concept to be built on top of, than a complete solution.

You wouldn't even need IAM rights to read RDS information, you need subnet information. As subnets are zonal, it does not if the service is RDS or Redis/ElastiCache. The IP returned from the hostname lookup, at the time your pod is scheduled, determines which AZ that Pod should (optimally) be deployed to.

Where this solution was created, was in a multi AWS account environment. Doing describe subnets API calls across multiple accounts is a hassle. It was "good enough" to have a static mapping of subnets, as they didn't change frequently.

westurner7mo ago

Couldn't something like this make CI builds faster by running builds near already-cached container images?

toredashOP7mo ago

Are you thinking about already-cached container images on the host level ? Not sure how AZP fits in here?

Since you mentioned it, what I've done before when it comes to improving CI builds, is to use karpenter + local SSD mounts with very large instance types in an idle timeout of ~1h. This allowed us to have very performant build machines at a low cost. The first build of the day took a while to get going, but for the price-benefit perspective it was great.

westurner7mo ago

Are the container image repositories and the container images also "external resources" that could make CI build pod placement more efficient?

Thanks; that sounds faster than most self-hosted CI services.

toredashOP7mo ago

If the image repositories were AZ bound resources, that would make the CI build process more efficient.

Or, if the resources that CI build is utilizing within the image (after the image is pulled and started) is AZ bound, then yes the build process would be improved since the CI build would fetch AZ local resources, rather than crossing the AZ boundary

ruuda7mo ago

> Have you considered alternative solutions?

How about, don't use Kubernetes? The lack of control over where the workload runs is a problem caused by Kubernetes. If you deploy an application as e.g. systemd services, you can pick the optimal host for the workload, and it will not suddenly jump around.

indemnity7mo ago

> The lack of control over where the workload runs is a problem caused by Kubernetes.

Fine grained control over workload scheduling is one of the K8s core features?

Affinity, anti-affinity, priority classes, node selectors, scheduling gates - all of which affect scheduling for different use cases, and all under the operator's control.

glennpratt7mo ago

Comparing systemd and Kubernetes for this scenario is like comparing an apple tree to a citrus grove.

You can specify just about anything, including exact nodes, for Kubernetes workloads.

This is just injecting some of that automatically.

I'm not knocking systemd, it's just not relevant.

mystifyingpoi7mo ago

> The lack of control

This project literally sets the affinity. That's precisely the control you seem to negate.

arccy7mo ago

k8s doesn't lack control, you can select individual nodes, AZs, regions, etc with the standard affinity settings.

Spivak7mo ago

You need it to jump around because your RDS database might fail over to a different AZ.

Being able to move workloads around is kinda the point. The need exists irrespective of what you use to deploy your app.

toredashOP7mo ago

The nice thing about this solution, its not limited to RDS. I used RDS as an example as many are familiar with it and are known to the fact that it will change AZ during maintenance events.

Any hostname for a service in AWS that can relocate to another AZ (for whatever reason), can use this.

aduwah7mo ago

Mind you, that you are facing the same problem with any Autoscaling group that lives in multiple AZs. You don't need kubernetes for this

toredashOP7mo ago

Agree, Kubernetes isn't for everyone. This solution came from an specific issue with a client which had ad hoc performance problems when a Pod was placed in the "in-correct" AZ. So this solution was created to place the Pods in the most optimal zone when they were created.

kentm7mo ago

Sure, but there are scenarios and architectures where you do want the workload to jump around, but just to a subset of hosts matching certain criteria. Kubernetes does solve that problem.

j / k navigate · click thread line to collapse

46 comments

toredashOP7mo ago

Hi HN,

I've made a tool I've called "Automatic Zone Placement", which automatically aligns Pod placements with their external dependencies.

The tool has two components:

1) A lightweight lookup service: A dependency-free Python service that takes a domain name (e.g., your RDS endpoint) and resolves its IP to a specific AZ.

I'd love some feedback on this, happy to answer questions :)

stackskipton7mo ago

How do you handle RDS failovers? Mutating Webhook is only fired when Pods are created so if AZ zone does not fail, there is no pods to be created and affinity rules to be changed.

toredashOP7mo ago

As it stands now, it doesn't. Unless you modify the Kyverno Policy to be of a background scanning.

darkwater7mo ago

toredashOP7mo ago

mathverse7mo ago

Typically you have multi-az setup for app deployment for HA. How would you without traffic management controll solve this?

toredashOP7mo ago

I'm not sure I follow. Are you talking about the AZP service, or ... ?

dserodio7mo ago

It's a best practice to have a Deployment run multiple Pods in separate AZs to increase availability

1 more reply

stackskipton7mo ago

This is one of those ideas that sounds great and appears simple at first but can grow into mad monster. Here my potential thoughts after 5 minutes.

toredashOP7mo ago

> Kyverno requirement makes it limited.

> There is no "automatic-zone-placement-disabled"

> What if we only have one node in specific zone?

Thats why we defaulted to preferredDuringSchedulingIgnoredDuringExecution and not required.

solatic7mo ago

I don't really understand why you think this tool is needed and what exact problem/risk it's trying to solve.

stronglikedan7mo ago

> I don't really understand why you think this tool is needed and what exact problem/risk it's trying to solve.

toredashOP7mo ago

You could of course manually failover RDS afterwards to your primary zone. Not sure if that is better than manually scaling up a node pool if a zone fails.

solatic7mo ago

> That means that when a maintenance event occur on your insaance, AWS will start a new instance in a new zone, and fail over to that one.

> EC2 On-Demand Capacity Reservations

Also quite expensive to maintain just for outage recovery events.

toredashOP7mo ago

> And I don't understand how a tool like this fits into formal risk analysis and where it presents an optimum solution for those risks.

Seems it does not fit your risk analysis?

pikdum7mo ago

dilyevsky7mo ago

frenchtoast87mo ago

toredashOP7mo ago

I would LOVE to pitch something else I'm working on that is solving this problem in EKS, cross zone data transfer.

dilyevsky7mo ago

I assume with this much traffic you’re running multiple clusters? In that case what is there to gain by running each cluster as multi-zone?

stackskipton7mo ago

It's generally sub 2MS. Most people take slight latency increase for higher availability, but I guess in this case, that was not acceptable.

danpalmer7mo ago

2ms per RPC is pretty high if you need to make dozens of RPCs to serve a request.

toredashOP7mo ago

That was the origin for this solution. A client app had to issue millions of small SQL queries where the first query had to complete before the second query could be made. Millions of MS adds up.

Lowest possible latency would of course be running the client code on the same physical box as the SQL server, but thats hard to do.

stackskipton7mo ago

It’s generally sub that. On average it seems to be about .7 MS.

1 more reply

toredashOP7mo ago

mystifyingpoi7mo ago

Cool idea, I like that. Though I'm curious about the lookup service. You say:

> To gather zone information, use this command ...

toredashOP7mo ago

Totally agree!

This service is published more as a concept to be built on top of, than a complete solution.

westurner7mo ago

Couldn't something like this make CI builds faster by running builds near already-cached container images?

toredashOP7mo ago

Are you thinking about already-cached container images on the host level ? Not sure how AZP fits in here?

westurner7mo ago

Are the container image repositories and the container images also "external resources" that could make CI build pod placement more efficient?

Thanks; that sounds faster than most self-hosted CI services.

toredashOP7mo ago

If the image repositories were AZ bound resources, that would make the CI build process more efficient.

ruuda7mo ago

> Have you considered alternative solutions?

indemnity7mo ago

> The lack of control over where the workload runs is a problem caused by Kubernetes.

Fine grained control over workload scheduling is one of the K8s core features?

Affinity, anti-affinity, priority classes, node selectors, scheduling gates - all of which affect scheduling for different use cases, and all under the operator's control.

glennpratt7mo ago

Comparing systemd and Kubernetes for this scenario is like comparing an apple tree to a citrus grove.

You can specify just about anything, including exact nodes, for Kubernetes workloads.

This is just injecting some of that automatically.

I'm not knocking systemd, it's just not relevant.

mystifyingpoi7mo ago

> The lack of control

This project literally sets the affinity. That's precisely the control you seem to negate.

arccy7mo ago

k8s doesn't lack control, you can select individual nodes, AZs, regions, etc with the standard affinity settings.

Spivak7mo ago

You need it to jump around because your RDS database might fail over to a different AZ.

Being able to move workloads around is kinda the point. The need exists irrespective of what you use to deploy your app.

toredashOP7mo ago

The nice thing about this solution, its not limited to RDS. I used RDS as an example as many are familiar with it and are known to the fact that it will change AZ during maintenance events.

Any hostname for a service in AWS that can relocate to another AZ (for whatever reason), can use this.

aduwah7mo ago

Mind you, that you are facing the same problem with any Autoscaling group that lives in multiple AZs. You don't need kubernetes for this

toredashOP7mo ago

kentm7mo ago

Sure, but there are scenarios and architectures where you do want the workload to jump around, but just to a subset of hosts matching certain criteria. Kubernetes does solve that problem.

j / k navigate · click thread line to collapse