Intelligent Kubernetes Load Balancing at Databricks (opens in new tab)

(databricks.com)

130 pointsayf7mo ago25 comments

25 comments

Less featureful than this, but we’ve been doing GRPC client side load balancing with kuberesolver[1] since 2018. It allows GRPC to handle the balancer implementations. It’s been rock solid for more than half a decade now.

1: https://github.com/sercand/kuberesolver

azaras7mo ago

What is the difference between Kuberesolver and using a Headless Service?

In the README.md file, they compare it with a ClusterIP service, but not with a Headless on "ClusterIP: None".

The advantages of using Kuberesolver are that you do not need to change DNS refresh and cache settings. However, I think this is preferable to the application calling the Kubernetes API.

euank7mo ago

I can give an n=1 anecdote here: the dns resolver used to have hard-coded caching which meant that it would be unresponsive to pod updates, and cause mini 30s outages.

The code in question was: https://github.com/grpc/grpc-go/blob/b597a8e1d0ce3f63ef8a7b6...

That meant that deploying a service which drained in less than 30s would have a little mini-outage for that service until the in-process DNS cache expired, with of course no way to configure it.

Kuberesolver streams updates, and thus lets clients talk to new pods almost immediately.

I think things are a little better now, but based on my reading of https://github.com/grpc/grpc/issues/12295, it looks like the dns resolver still might not resolve new pod names quickly in some cases.

gaurav3247mo ago

kuberesolver is an interesting take as well. Directly watching the K8s API from each client could raise scaling concerns at very large scale, but it does open the door to using richer Kubernetes metadata for smarter load-balancing decisions. thanks for sharing!

debarshri7mo ago

I think with some rate limiting, it can scale. But it might be a security issue as ideally you don't want client to be aware of kubernetes also, it would be difficult to scope the access.

arccy7mo ago

if you don't want to expose k8s then there's the generic xds protocol

hanikesn7mo ago

I've been using a standardized xds resolver[1]. The benefit here is that you don't have to patch grpc clients.

[1] https://github.com/wongnai/xds

atombender7mo ago

Do you know how this compares to the Nginx ingress controller, which has a native gRPC mode?

darkstar_167mo ago

We use a headless service and client side load balancing for this. What's the difference ?

arccy7mo ago

instead of polling for endpoint updates, they're pushed to the client through k8s watches

jedberg7mo ago

I wonder why they didn't use rendezvous hashing (aka HRW)[0]?

It feels like it would solve all the requirement that they laid out, is fully client side, and doesn't require real time updates for the host list via discovery.

[0] https://en.wikipedia.org/wiki/Rendezvous_hashing

deviation7mo ago

HRW would cover the simple case, but they needed way more-- e.g. per-request balancing, zone affinity, live health checks, spillover, ramp-ups, etc. Once you need all that dynamic behavior, plain hashing just doesn’t cut it IMO. A custom client-side + discovery setup makes more sense.

dastbe7mo ago

the problem is that they want to apply a number of stateful/lookaside load balancing strategies, which become more difficult to do in a fully decentralized system. it’s generally easier to asynchronously aggregate information and either decide routing updates centrally or redistribute that aggregate to inform local decisions.

bbkane7mo ago

Thanks for writing - I found the Power of Two Choices algorithm particularly interesting (I haven't seen it before).

From the recent grpConf ( https://www.youtube.com/playlist?list=PLj6h78yzYM2On4kCcnWjl... ) it seems gRPC as a standard is also moving in this "proxyless" model - gRPC will read xDS itself.

walth7mo ago

You might be interested in nginx's implementation

https://nginx.org/en/docs/http/ngx_http_upstream_module.html...

thewisenerd7mo ago

> kube-proxy supports only basic algorithms like round-robin or random selection

this is "partially" true.

if you're using ipvs, you can configure the scheduler to just about anything ipvs supports (including wrr). they removed the validation for the scheduler name quite a while back.

kubernetes itself though doesn't "understand" (i.e., can NOT represent) the nuances (e.g., weights per endpoint with wrr), which is the problem.

kouzant7mo ago

Is this something that could be solved with Consul? Consul will return the IP address of the Pod(s) and it already hooks in Kubernetes liveness for failure detection. It also supports weighted results for more complex routing.

thewisenerd7mo ago

we have the same issue with HTTP as well, due to HTTP keepalive, which many clients have out-of-the box.

the "impact" can be reduced by configuring an overall connection-ttl, so it takes some time when new pods come up but it works out over time.

that said, i'm not surprised that even a company as large as databricks feels that adding a service mesh is going to add operational complexity.

looks like they've taken the best parts (endpoint watch, sync to clients with xDS) and moved it client-side. compared to the failure mode of a service mesh, this seems better.

pm907mo ago

I haven’t been keeping up but is there still hype over full mesh like istio/linkerd? Ive seen it tried in a couple of places but didn’t work super well; the last place couldn’t because datadog apparently bills sidecar containers as additional hosts so using sidecar proxy would have doubled our datadog bill.

dastbe7mo ago

> the last place couldn’t because datadog apparently bills sidecar containers as additional hosts so using sidecar proxy would have doubled our datadog bill.

that seems like the tail wagging the dog

gaurav3247mo ago

Yes, we’ve leaned toward minimizing operational overhead. Taking the useful parts of a mesh (xDS endpoint and routing updates) into the client has worked extremely well in practice and has been very reliable, without the extra moving parts of a full mesh.

agrawroh7mo ago

When we started we had a lot of pieces like Certificate Management in-house and adding a full blown Service Mesh was a big operational overhead. We started with building only the parts we needed and started integrating things like xDS natively in rest of our clients.

dilyevsky7mo ago

Curios why cross-cluster loadbalancing would be necessary in a setup where you operate “thousands of clusters”? I assume these are per-customer isolated environments?

closeparen7mo ago

A lot of people seem to run a cluster per microservice.

barryvand7mo ago

Aah all the cool things you can do when you control the client! Great write up. Also: > thousands of Kubernetes clusters Impressive!

j / k navigate · click thread line to collapse

25 comments

shizcakes7mo ago

1: https://github.com/sercand/kuberesolver

azaras7mo ago

What is the difference between Kuberesolver and using a Headless Service?

In the README.md file, they compare it with a ClusterIP service, but not with a Headless on "ClusterIP: None".

The advantages of using Kuberesolver are that you do not need to change DNS refresh and cache settings. However, I think this is preferable to the application calling the Kubernetes API.

euank7mo ago

I can give an n=1 anecdote here: the dns resolver used to have hard-coded caching which meant that it would be unresponsive to pod updates, and cause mini 30s outages.

The code in question was: https://github.com/grpc/grpc-go/blob/b597a8e1d0ce3f63ef8a7b6...

That meant that deploying a service which drained in less than 30s would have a little mini-outage for that service until the in-process DNS cache expired, with of course no way to configure it.

Kuberesolver streams updates, and thus lets clients talk to new pods almost immediately.

gaurav3247mo ago

debarshri7mo ago

I think with some rate limiting, it can scale. But it might be a security issue as ideally you don't want client to be aware of kubernetes also, it would be difficult to scope the access.

arccy7mo ago

if you don't want to expose k8s then there's the generic xds protocol

hanikesn7mo ago

I've been using a standardized xds resolver[1]. The benefit here is that you don't have to patch grpc clients.

[1] https://github.com/wongnai/xds

atombender7mo ago

Do you know how this compares to the Nginx ingress controller, which has a native gRPC mode?

darkstar_167mo ago

We use a headless service and client side load balancing for this. What's the difference ?

arccy7mo ago

instead of polling for endpoint updates, they're pushed to the client through k8s watches

jedberg7mo ago

I wonder why they didn't use rendezvous hashing (aka HRW)[0]?

It feels like it would solve all the requirement that they laid out, is fully client side, and doesn't require real time updates for the host list via discovery.

[0] https://en.wikipedia.org/wiki/Rendezvous_hashing

deviation7mo ago

dastbe7mo ago

bbkane7mo ago

Thanks for writing - I found the Power of Two Choices algorithm particularly interesting (I haven't seen it before).

From the recent grpConf ( https://www.youtube.com/playlist?list=PLj6h78yzYM2On4kCcnWjl... ) it seems gRPC as a standard is also moving in this "proxyless" model - gRPC will read xDS itself.

walth7mo ago

You might be interested in nginx's implementation

https://nginx.org/en/docs/http/ngx_http_upstream_module.html...

thewisenerd7mo ago

> kube-proxy supports only basic algorithms like round-robin or random selection

this is "partially" true.

if you're using ipvs, you can configure the scheduler to just about anything ipvs supports (including wrr). they removed the validation for the scheduler name quite a while back.

kubernetes itself though doesn't "understand" (i.e., can NOT represent) the nuances (e.g., weights per endpoint with wrr), which is the problem.

kouzant7mo ago

thewisenerd7mo ago

we have the same issue with HTTP as well, due to HTTP keepalive, which many clients have out-of-the box.

the "impact" can be reduced by configuring an overall connection-ttl, so it takes some time when new pods come up but it works out over time.

that said, i'm not surprised that even a company as large as databricks feels that adding a service mesh is going to add operational complexity.

looks like they've taken the best parts (endpoint watch, sync to clients with xDS) and moved it client-side. compared to the failure mode of a service mesh, this seems better.

pm907mo ago

dastbe7mo ago

> the last place couldn’t because datadog apparently bills sidecar containers as additional hosts so using sidecar proxy would have doubled our datadog bill.

that seems like the tail wagging the dog

gaurav3247mo ago

agrawroh7mo ago

dilyevsky7mo ago

Curios why cross-cluster loadbalancing would be necessary in a setup where you operate “thousands of clusters”? I assume these are per-customer isolated environments?

closeparen7mo ago

A lot of people seem to run a cluster per microservice.

barryvand7mo ago

Aah all the cool things you can do when you control the client! Great write up. Also: > thousands of Kubernetes clusters Impressive!

j / k navigate · click thread line to collapse