Rainbow Deployments with Kubernetes | Better HN

Rainbow Deployments with Kubernetes | Better HN

38 comments

andrewaylett8y ago

It feels like it should be possible to fix the reconnect experience, especially in the planned termination of underlying container case: if you ask the client to reconnect, rather than abruptly disconnecting them then they could possibly even wait until their new session was fully established before dropping the old one.

That doesn't take away from my appreciation of the pattern, though: I'm very much in favour of rolling releases forwards, rather than being limited to two colours.

bdimcheffOP8y ago

We actually have this functionality as well: we can send a signal to the process which will cause it to display a message to the user asking them to reload to upgrade. There is a similar feature in slack and riot that I've seen as well.

Making this handoff automatic is definitely possible as well, though we do want people to reload occasionally to get new client-side code.

specialist8y ago

I worked on an autobahn like thingie. Server-side could tell clients to reconnect. Useful for draining servers, etc.

I haven't checked to see if autobahn has this functionality.

https://crossbar.io/autobahn/

I agree. I can think of reasons for urgent updates - kernel security, software bugs, etc. - and this feels like it would lead their engineers to possibly support weeks-old versions of their systems. And they're running a multiple of the needed servers.

Question for the OP-

I haven't ever worked on chat services, so this may not be reasonable. Would it be possible to use some other termination endpoint that sits in front of the service, that allows you to maintain persistent connections to the clients, but make for more transparent swaps of backend services?

So, for example could you leverage nginx or haproxy as the "termination" point for the chat connection, with those proxying back to the kubernetes service endpoint, which then proxy back to the real backend service. So, when you go to swap out the backend service, nginx / haproxy start forwarding to the new service transparently, while still maintaining the long-lived connection with the client.

If this was doable, it would mean you'd only have to drain if you needed to swapout the proxy layer, which is likely a less-frequent task, and thus allows you more agility with your backend services.

bdimcheffOP8y ago

Yes, as lotyrin pointed out below, the backend is already effectively an XMPP <-> Websockets bridge. Every time a user logs in, one way or the other we need to establish a connection between the websockets backend and the XMPP backend. The backend does do other things besides simply proxy, and we certainly could separate out the part that needs to keep the state. This deployment strategy is effectively a way to avoid having to do that work, at least for now.

jkarneges8y ago

This is essentially how Pushpin (http://pushpin.org) works. It can hold a raw WebSocket connection open with a client, but it speaks HTTP to the backend server, and the backend can be restarted without the client noticing.

codegladiator8y ago

This is a nginx module for this functionality

https://nchan.io/

I have used it before. Super easy to setup. Even with kubernetes.

It sounds like the product is already a proxy (Websocket -> XMPP) so I'm not sure what exactly they're deploying multiple times per day.

It also seems like they could have done simple blue/green and extend their websocket protocol and client to support a hand-off "hey, there's a new version of the proxy, reconnect in x seconds" message (and have idempotency of messages) such that they could have a rather smooth schedule where everyone can reconnect then disconnect gradually over some period of minutes instead of hours and not have either sudden spikes or any period of interruption for clients.

ah-8y ago

Or just a generic reconnect in case the websocket connection breaks for any reason really.

In practice you probably would want pretty thin proxy layer as you said, which is then forwarding requests to other services as needed...but you would still need to re-deploy this proxy layer, and would thus need a similar solution to the one proposed in the article.

Seems like the kind of thing that a Deployment should be able to manage on its own... some kind of DrainPolicy object maybe?

Also, if the previous ReplicaSet a Deployment is rolling past has several pods, maybe only some of them need to stay alive (maybe some drain sooner than others.)

Perhaps the whole endeavor should just be to make Pod drainage a bit more explicit than just terminationGracePeriodSeconds... perhaps letting a pod signal with a positive confirmation that it's shutting down (letting connections drain) and the rest of the k8s controllers can just leave it alone until it terminates itself.

Although really, I think a combination of setting terminationGracePeriodSeconds to unlimited, and having a health check that ensures that it doesn't get wedged and miss the termination signal (by checking that a pod status of "shutting down" corresponds to some property of the container, like a health endpoint saying the shutdown is in progress...) and then nothing else needs to be done. Basically, color me skeptical when they say:

"We used service-loadbalancer to stick sessions to backends and we turned up the terminationGracePeriodSeconds to several hours. This appeared to work at first, but it turned out that we lost a lot of connections before the client closed the connection. We decided that we were probably relying on behavior that wasn’t guaranteed anyways, so we scrapped this plan."

(This also depends on the container obeying the standard SIGTERM contract to properly drain connections but not accept new ones, which is pretty standard in most web servers nowadays.)

bdimcheffOP8y ago

yeah I don't know why terminationGracePeriodSeconds hacks didn't work. It could have been a different, unrelated factor that we didn't discover. It certainly could have been service-loadbalancer/haproxy's fault instead of the termination grace period itself. I'm certainly happy to be proven wrong there.

smarterclayton8y ago

Not 100% sure about your scenario, but if you set a preStop hook to an exec probe you can arbitrarily delay shutdown inside the gracePeriod, because the kubelet won’t terminate the container until preStop returns.

So if you set a 5 hour grace period, and a preStop hook that invokes a script that doesn’t return until all connections are closed (but which tells the container process not to accept new ones) you can control the drain rate.

There is some app level smarts required - to have new connections rejected and have any proxies rebalance you. Haproxy does this in most cases, but the service proxy won’t (in iptables mode).

If that’s not the behavior you’re seeing, please open a bug on Kube and assign me (this is something I maintain)

I'm not sure what problem the author is solving. I might be misunderstanding something.

The author points out that the issue with Blue/Green/AnyColors deploys is that they need 16 pods per color at all time (which in their case would end up being 128 pods) and 24/48 hours for each connection pool to drain.

But how is using a SHA instead of a COLOR any different? Unless I am missing something, and, if running 128 pods and 24/48 hours of draining is the issue, then using SHA instead of colors is not solving those 2 issues.

You'll still need 16 pods and 24/48 hours per SHA-deploy, and you're actually making it worst by not using fixed colors since you have quite a lot more SHA at your disposition.

thomaslangston8y ago

It appears the issue was running pods for deployment colors not in use if they only deploy 1/week, and this solves it because they are cleaning them up regularly. This does nothing for the overhead of needing lots of pods to support a high number of deploys/week.

You could do the same with $Color, just seems to be clearer since people often think of $Color as being static deployment infrastructure, whereas people are used to SHA's pointing to branches that are naturally cleaned up.

bdimcheffOP8y ago

The way this Rainbow Deploy works means that if I only deploy once a month, I only have a single "color" running for that whole month, plus couple days of overlap where there are 2. If I have blue/green, I have 2 colors running all month. If I have more fixed colors, even more. The sha thing is just a convenient way of creating "colors" dynamically when we need to do a new deploy, without having to use a meaningless representation like "blue", "green", "taupe", "chartreuse", etc.

rockostrich8y ago

I think their problem is that they need to keep old deploys alive for a long time because they have stateful long-running connections. So a blue/green deploy wouldn't work because their blue deploy has to stay around for at least a month after green is deployed. There's no difference between the SHA and COLOR, I think the choice to use a git hash is because it's the logical choice (instead of randomly choosing a color from a list).

erikrothoff8y ago

This was really interesting. I'm thinking about moving to Kubernetes and have wondered how to gracefully deal with websocket connections.

I'm curious though, if the rollout was over a couple of hours for example, why would reconnections be that big of a problem? We host about 10,000+ websocket on a $20 VPS, and the Go server hosting it crashes from time to time. A surge of 10,000 reconnections instantly afterwards has never lasted for more than a minute or so, so why is it so bad? Moments of peak load aren't that big of a deal, or?

bdimcheffOP8y ago

In our case it's not the websockets that are the problem, it's the XMPP connection that each websocket connection creates. Logging in thousands of users takes several minutes. While a user reconnects, any conversations that the users are having with their website visitors are disrupted.

(work with OP on the same team) Basically there are a lot of other things that happen when a websocket connection is established and we don't necessarily have the capacity to handle that volume in a complete reconnect scenario, especially if the system is already near the daily load peak. We have hopes that autoscaling some things in the future will make it possible to handle peaks like this more gracefully.

derekperkins8y ago

This is a great use case for kube-metacontroller that was introduced in the Day 2 Keynote at Kubecon. With minimal work, you can replicate a deployment or stateful set, but with custom update strategies.

Live demo: https://youtu.be/1kjgwXP_N7A?t=10m46s Code: https://github.com/kstmp/metacontroller

discordianfish8y ago

Nice pattern! You could throw in a HPA to automatically scale deployments to zero that aren't in use.

> So far we haven’t found a good way of detecting a lightly used deployment, so we’ve been cleaning them up manually every once in a while.

Am I missing something, or wouldn’t it be as “simple” as connecting to the running container and running netstat and conditionally killing the pod based on the number of connections? I bet you thought of that, so I’m curious why it didn’t work for you.

bdimcheffOP8y ago

One thing I didn't put in here that's also turned out to be useful: We can prerelease things relatively easily this way too. Each deployment has a git sha, and we can have a canary/beta/dogfood version that points at an entirely different sha.

_asummers8y ago

> We still have one unsolved issue with this deployment strategy: how to clean up the old deployments when they’re no longer serving (much) traffic.

Could probably solve this with a readiness probe / health check of sorts that is smart enough to know what low usage means.

bdimcheffOP8y ago

Yeah I think if restartPolicy were changeable at runtime, we could simply have the pods exit once their connections are drained enough. If we were to exit using the current strategy, they'd just be restarted by kube.

SEJeff8y ago

Or a horizontal pod autoscalar that can scale down to 0 replicas perhaps?

Curious about the 24h-48h burndown...could it potentially be longer for you guys or is there some mechanism in place to force disconnection (and thus risk a spike) after some TTL?

bdimcheffOP8y ago

We force reconnects eventually, there just aren't that many people affected at that point. There's a very long tail of people keeping their browsers open for days, but it's only a handful of people.

drdrey8y ago

So... just a blue/green deployment with a 24h delay before deleting the old cluster?

sulam8y ago

I don't think so actually, it seems like they are having a series of old SHAs hanging out, not just one new and one old. I did have the reaction you did initially though and had to do some reading between the lines to come to the conclusion that this is not what they're doing, so you could be right!

45h34jh53k4j8y ago

Using the 6 hex digits of the git commit hash for color is genius. I really like this pattern!

Dang, now I want to figure out how to print my git-log commit hashes in colors based on the hashes themselves...

xir788y ago

TL;DR

You can drain stuff by changing a Service's selector but leaving the Deployment alone. Instead of changing a Deployment and doing a rolling update, create a new deployment and repoint the Service. Existing connections will remain until you delete the underlying Deployment.

j / k navigate · click thread line to collapse