Envoy Proxy Performance on Kubernetes (opens in new tab)

(getambassador.io)

40 pointskelseyevans6y ago14 comments

14 comments

This is somewhat valuable, but misses any investigation as to why there were outlying latency spikes when using non-Envoy software for load balancing. Furthermore, using the average latency like this doesn't tell us much, especially with outliers making the graphs worthless for steady-state performance analysis.

My first thought is that the spikes are somewhat clearly the result of requests getting sent to pods that no longer exist, or are starting and not prepared to process requests. This might just speak to the method of configuration for all three of these underlying softwares and say absolutely nothing about how well they actually fare doing any load balancing.

If someone came to me with this at work, I would say it is the beginning of a series of troubleshooting steps to answer the question of why there are such outlying requests when using our load balancer of choice, and not an analysis of which software to pick.

Edit: Even worse is that this appears to be from a company that sells.. an API gateway built on top of Envoy.

rdli6y ago

(one of the authors here)

Thanks for the feedback.

So regarding your hypothesis on the spikes being sent to pods that no longer exist/are starting: 1) it is the responsibility of the ingress controller on K8S to properly handle that situation 2) it would be highly unlikely for people to implement their own custom ingress controller around a given proxy (it's actually somewhat complicated) and 3) the pod theory wouldn't address the latency spikes seen on reconfiguration.

But you're right that there probably should be some explanation around why we think this is happening (I just didn't want to speculate too much; I suspect that the issue is with the hitless reloads implementation in the proxies which is tricky to do well).

w8vY7ER6y ago

Could it be at all related to the circuit-breaking behavior that nginx describes[1] in some of their reference architecture? Unclear to me which (if any) of these properties might be in play for this test.

[1]https://www.nginx.com/blog/microservices-reference-architect...

1 more reply

rumanator6y ago

The article focus on latency spikes, which happen only sporadically. Can anyone more knowledgeable on the subject chime in and comment if this comparison is fair or relevant? Personally I was expecting to see histograms or empirical distribution functions of each service's latency.

jchw6y ago

Personally, I prefer looking at something more like 95th percentile latency, versus average, which is what I think this article is showing. I suppose a histogram would give you the fullest picture, though.

rumanator6y ago

FTA:

> We measure latency for 10% of the requests, and plot each of these latencies individually on the graphs.

So for what it's worth these spikes may very well be single requests that are not relevant and are only triggered by the way the Kubernetes cluster was being manipulated for the test.

1 more reply

jwkane6y ago

Nginx-ingress (the free version) does a reload when pods are added/removed. I would expect to see latency spikes. As I understand it the paid nginx-ingress can do these reconfigs on-the-fly. If you have pods adding/removing frequently, like maybe from "We then scale up the backend service to four pods and then scale it back down to three pods every thirty seconds", this is what you see. The more interesting question is if that is a realistic expectation for your workload. Unless you are doing aggressive auto-scaling pods can be very stable, only add/removing during upgrades.

rdli6y ago

This is a good point. I think an interesting test would be to add multiple services, and see if scaling up/down a single service affects latency to other services.

bradwood6y ago

traefik?

aairey6y ago

I am wondering as well how that compares to this.

j / k navigate · click thread line to collapse

14 comments

nickvanw6y ago

Edit: Even worse is that this appears to be from a company that sells.. an API gateway built on top of Envoy.

rdli6y ago

(one of the authors here)

Thanks for the feedback.

w8vY7ER6y ago

[1]https://www.nginx.com/blog/microservices-reference-architect...

1 more reply

rumanator6y ago

jchw6y ago

rumanator6y ago

FTA:

> We measure latency for 10% of the requests, and plot each of these latencies individually on the graphs.

So for what it's worth these spikes may very well be single requests that are not relevant and are only triggered by the way the Kubernetes cluster was being manipulated for the test.

1 more reply

jwkane6y ago

rdli6y ago

This is a good point. I think an interesting test would be to add multiple services, and see if scaling up/down a single service affects latency to other services.

bradwood6y ago

traefik?

aairey6y ago

I am wondering as well how that compares to this.

j / k navigate · click thread line to collapse