Stateful Apps on Kubernetes: A quick primer (opens in new tab)

(cockroachlabs.com)

139 pointsloiselleatwork7y ago35 comments

35 comments

I would recommend against running stateful apps in kubernetes. It's not really ready for it. Big problems include routing (it works fine for http requests, but not for DBs, message brokers, etc) and just the pain of setting up stateful sets.

If you don't believe me, take it from someone who should know what they're talking about: https://twitter.com/kelseyhightower/status/96341350830081229...

lobster_johnson7y ago

We run stateful apps on Kubernetes. There are obvious rough areas (the lack of persistent volume resizing, for example, which is scheduled for 1.11), but overall, it's great.

What a lot of naysayers leave out, or choose to ignore, is that the challenges running stateful apps on Kubernetes mirror those of running stateful apps anywhere. If you run Postgres on a VM, for example, you're completely reliant on that VM staying up -- this is no different from Kubernetes. Some will also point out the dangers of co-locating lots of software (such as Postgres) on the same machine as many other containers, as they will compete for CPU and I/O; but this is also no different than on Kubernetes, which provides plenty of tools (affinities/anti-affinities, node selectors) to isolate containers to machines. And so on. Containers bring some new challenges, but Kubernetes meets them quite well.

What specific issues do you have? I'm not sure I understand the point about routing. I also don't understand what the "pain" of stateful sets refers to.

FBISurveillance7y ago

The the original commenter but I'll jump in:

1. While "we already rely on VM staying up", with k8s we reply on both VM staying up and kubernetes infra on top of that VM staying UP 2. Maintaining a complex stateful system on k8s _requires_ having and maintaining an operator for that system. 3. You reduce your options when it comes to tweaking systems, e.g. local SSDs on GCP are available in SCSI and NVMe flavors, while GKE supports only SCSI; harder to perform fine-tuning and other tasks on the underlying VMs that would have been trivial with Chef or similar. 4. Enterprise systems like Splunk explicitly mention that their support does not cover Splunk clusters running on kubernetes. 5. As mentioned, you can't even resize a disk without going through dance of operations that would take days or weeks when you're working something like Kafka at scale. 6. Some stateful services like Zookeeper require stable identities and this is far from perfect on kubernetes. 7. More complex traffic routing that involves additional fees because to achieve (6) you sometimes need to expose things publicly.

That's just from the top of my head.

Disclaimer: We run 10+ stateful services on Kubernetes.

williamstein7y ago

I've used Kubernetes a massive amount during the last two years for running stateful apps. In contrast, I do recommend it. Yes, it is challenging, since stateful app are. However, the challenges are all well worth solving in the context of Kubernetes (great benefits from health checks, automated reproducible deployments, etc.). The situation is pretty good these days in my experience; at least, a lot better than 2 years ago!

bajsejohannes7y ago

Good point about it getting better. A lot of the pain was from trying to do it before 1.8.

loiselleatworkOP7y ago

Author of post here and CRL employee: just for some additional detail, we reached out to Kelsey about the problems he's seen running databases in Kubernetes.

https://twitter.com/kelseyhightower/status/96347131657256140...

He said "You still need to worry about database backups and restores. You need to consider downtime during cluster upgrades."

These things are totally true. K8s doesn't automate backups (edit: by default; though, it can) and if you need to take K8s down for upgrades, then everything is down. For its part, though, CockroachDB supports rolling upgrades with no downtime on Kubernetes.

As for routing, that is tough problem if you want to run K8s across multiple regions, though we have some folks who've done it.

And if one finds setting up StatefulSets challenging, we have a tutorial on how to do it written by a former Kubernetes engineer: https://www.cockroachlabs.com/docs/stable/orchestrate-cockro...

hardwaresofton7y ago

Note that none of those things are impossible on kubernetes, k8s just doesn't offer them by default (which is good IMO).

There are projects that help you run databases in kubernetes and also make backups of many things hosted:

- Automatic CephFS for your cluster -> https://rook.io/docs/rook/master/

- Backups for cluster resources and volumes -> https://github.com/heptio/ark

- Spin up dynamic postgres clusters -> https://github.com/zalando-incubator/postgres-operator

Databases are just applications with different resource needs. Please stop pushing forward the notion that they can't be run in containers or container orchestration systems. Databases are just programs. If the substrate for running your containers doensn't reliably support flock or fsync or something your database needs, then maybe pick a better substrate that does -- container runtimes these days and kubernetes don't stand in your way these days.

1 more reply

AndyNemmity7y ago

We run many, many stateful apps on kubernetes. Not without challenges certainly, but I am not sure any of them are really kubernetes specific.

They just don't act like other services, and require more care. That's about it. I think that's what Kelsey is referring to, you can't just treat them the same as other pods.

rrdharan7y ago

> Because Kubernetes itself runs on the machines that are running your databases, it will consume some resources and will slightly impact performance. In our testing, we found an approximately 5% dip in throughput on a simple key-value workload.

5% seems like a surprisingly large overhead. What is k8s doing in this situation that would have that kind of impact?

smarterclayton7y ago

CPU Cache contention, network overhead introduced by Kubernetes service proxy model, even the liveness checks.

We haven’t yet evolved Kubernetes services to prefer specific cores and avoid app workloads quite yet (although cpu management is getting closer).

Docker is also somewhat hefty memory wise and you may contend on disk if not careful.

5% seems pretty reasonable to me in general, just as a consequence of having something heavier weight on the same node managing workloads.

a-robinson7y ago

Yeah, it appeared to just be general resource contention from having to share the machine -- CPU interrupts, less memory available, etc.

I'll note, though, that the 5% number is when using host networking for both Cockroach and the client load generator. Using GKE's default cluster networking through the Docker bridge is closer to 15% worse than running directly on equivalent non-Kubernetes VMs.

nimos7y ago

Hard to say without knowing what they ran on what. There is a non trivial amount of memory that gets eaten up by the various k8s processes, docker and networking if you are using small nodes. I have a completely empty k8s cluster up right now with 1 worker and 1 master and the worker has about ~230 mb of ram used up.

stefanatfrg7y ago

I'd like to know how to solve the storage dilution problem with stateful apps in k8s where you have to buy 3-18x more raw capacity than desired to meet availability & durability guarantees.

For example if you ran CDB on a baremetal cluster of 3 nodes with 30TB of raw capacity, 15TB is lost to RAID10, 10TB is lost to running a replicated database such as cockroach DB, leaving you with 5TB effective capacity which is a 1/6 dilution of your initial capacity.

If you ran cockroach DB on a replicated network volume, with a replication factor of three, it gets worse. If you bought 30 TB of disks, you'd lose 20 TB to volume replication, ~6.67TB to CDB replication leaving you with 3.3TB of effective capacity or a 1/9 dilution. If those disks were configured with RAID your effective capacity would drop to a 1/18 dilution.

You could achieve a 1/3 dilution which is the effective minimum for a replicated database if you didn't configure RAID, but you increase the impact of disk failure, in that it would take much much longer to recover a cluster.

lowbloodsugar7y ago

>Given its pedigree of literally working at Google-scale

I understood that a team at google developed k8s but google doesn't actually run it for their "google-scale" workloads. Am I misinformed?

bajsejohannes7y ago

You are correct.

> [kubernetes is] a simplified clone of Google’s internal borg system

https://medium.com/@steve.yegge/honestly-i-cant-stand-k8s-48...

outside12347y ago

That said, Google, to my understanding, does run a completely containerized infrastructure internally, including databases and other stateful things, so it is not wildly off to suggest running a database on Kubernetes.

2 more replies

puzzle7y ago

Kubernetes might be simpler than Borg in so many ways (let me count the ones I care about...), but it also has better features that Borg did not implement (labels and selectors) or that are offered only by some other internal services, which obviously are configured through very different mechanisms (ingress).

daxfohl7y ago

Wow, I'd never heard of Fargate until this. It seems so somewhat obvious now for things that Lambda doesn't fit. Heroku for containers, not VMs.

1 more reply

lobster_johnson7y ago

It's my understanding, based on comments by googlers here on HN, that Google does run a bunch of apps on GKE. We don't know about which apps, but it's not surprising that they want to dogfood their own cloud platform.

daxfohl7y ago

Has anyone looked at Service Fabric (Microsoft tech) for things like this? That has offered stateful services for years now. I'm pretty sure it runs on Linux, and I've seen that it's Docker compatible. I know it's kinda in the same space as K8s but I don't really know the details. Would SF be able to do something like this in a similar (or better?) way?

zapita7y ago

It's complicated, because the definition of Service Fabric seems to be in flux.

The "original" Service Fabric is a high-level framework which requires invasive source code changes (you can't just drop an existing app on top of it), but gives you lots of benefits (scale, reliability etc) if you make the effort.

Recently container-based platforms - Docker, Kubernetes, etc - have come along with a different tradeoff: better compatibility with existing applications in exchange for less magical benefits. That approach is getting much more traction, and I think internally at Microsoft there is some infighting between the "Service Fabric camp" and the "Containers camp". One consequence of the infighting is that Service Fabric is extending its scope to include features like "container support". It's not clear to what extent that is done in collaboration with the "container people", or as a way to bypass them. I think they are still trying to decide whether to embrace Kubernetes, or replicate the functionality in-house. My prediction is that the container-based approach will win, but if will take time for the politics to fully play out. In the meantime things will continue to be confusing.

Bottom line: when evaluating Service Fabric, watch out for confusing and inconsistent use of the brand. It's a common pattern with large vendors - for example IBM with "Bluemix", SAP with "Hana", etc.

daxfohl7y ago

Okay that's about what it looked like to me too. There's only so many magic words you can throw at a tech and expect it to work together happily. Looking into it, the stateful service side of SF doesn't seem particularly compatible with the container side of it. A stateful service is a stateful SF service, and a container service is its own thing. Maybe there's a way to plug them together but unfortunately I didn't see it.

jrbancel7y ago

Disclaimer: I work at Microsoft, not on Service Fabric but I have built complex stateful services on top of Service Fabric.

As zapita said, Service Fabric now handles containers but I think it is just because containers became trendy and FOMO kicked in.

Where Service Fabric is decades ahead of the container orchestration solutions is as a framework to build truly stateful services, meaning the state is entirely managed by your code through SF, not externalized in a remote disk, Redis, some DB, etc...

It offers high level primitives like reliable collections [0], as well as very low level primitives like a replicated log to implement custom replication between replicas [1]. I feel that publicly this is not advertised enough and it is unfortunate because it is a key differentiator for Service Fabric that the competitors won't have for a while, if ever because it is a completely opposite approach: containers are all about isolation, being self-contained and plateform independent while SF stateful services are deeply integrated with Service Fabric.

[0] https://docs.microsoft.com/en-us/azure/service-fabric/servic...

[1] https://docs.microsoft.com/en-us/dotnet/api/system.fabric.fa...

tapirl7y ago

Are there any cloud providers providing remote disks without replications? It looks such needs are popular for deploying databases in which replications are maintained by the databases themselves.

jen207y ago

That's effectively what EBS is, no?

tapirl7y ago

I have the impression that EBS does replication automatically, is it wrong?

j / k navigate · click thread line to collapse

35 comments

bajsejohannes7y ago

If you don't believe me, take it from someone who should know what they're talking about: https://twitter.com/kelseyhightower/status/96341350830081229...

lobster_johnson7y ago

We run stateful apps on Kubernetes. There are obvious rough areas (the lack of persistent volume resizing, for example, which is scheduled for 1.11), but overall, it's great.

What specific issues do you have? I'm not sure I understand the point about routing. I also don't understand what the "pain" of stateful sets refers to.

FBISurveillance7y ago

The the original commenter but I'll jump in:

That's just from the top of my head.

Disclaimer: We run 10+ stateful services on Kubernetes.

williamstein7y ago

bajsejohannes7y ago

Good point about it getting better. A lot of the pain was from trying to do it before 1.8.

loiselleatworkOP7y ago

Author of post here and CRL employee: just for some additional detail, we reached out to Kelsey about the problems he's seen running databases in Kubernetes.

https://twitter.com/kelseyhightower/status/96347131657256140...

He said "You still need to worry about database backups and restores. You need to consider downtime during cluster upgrades."

As for routing, that is tough problem if you want to run K8s across multiple regions, though we have some folks who've done it.

And if one finds setting up StatefulSets challenging, we have a tutorial on how to do it written by a former Kubernetes engineer: https://www.cockroachlabs.com/docs/stable/orchestrate-cockro...

hardwaresofton7y ago

Note that none of those things are impossible on kubernetes, k8s just doesn't offer them by default (which is good IMO).

There are projects that help you run databases in kubernetes and also make backups of many things hosted:

- Automatic CephFS for your cluster -> https://rook.io/docs/rook/master/

- Backups for cluster resources and volumes -> https://github.com/heptio/ark

- Spin up dynamic postgres clusters -> https://github.com/zalando-incubator/postgres-operator

1 more reply

AndyNemmity7y ago

We run many, many stateful apps on kubernetes. Not without challenges certainly, but I am not sure any of them are really kubernetes specific.

They just don't act like other services, and require more care. That's about it. I think that's what Kelsey is referring to, you can't just treat them the same as other pods.

rrdharan7y ago

5% seems like a surprisingly large overhead. What is k8s doing in this situation that would have that kind of impact?

smarterclayton7y ago

CPU Cache contention, network overhead introduced by Kubernetes service proxy model, even the liveness checks.

We haven’t yet evolved Kubernetes services to prefer specific cores and avoid app workloads quite yet (although cpu management is getting closer).

Docker is also somewhat hefty memory wise and you may contend on disk if not careful.

5% seems pretty reasonable to me in general, just as a consequence of having something heavier weight on the same node managing workloads.

a-robinson7y ago

Yeah, it appeared to just be general resource contention from having to share the machine -- CPU interrupts, less memory available, etc.

nimos7y ago

stefanatfrg7y ago

I'd like to know how to solve the storage dilution problem with stateful apps in k8s where you have to buy 3-18x more raw capacity than desired to meet availability & durability guarantees.

lowbloodsugar7y ago

>Given its pedigree of literally working at Google-scale

I understood that a team at google developed k8s but google doesn't actually run it for their "google-scale" workloads. Am I misinformed?

bajsejohannes7y ago

You are correct.

> [kubernetes is] a simplified clone of Google’s internal borg system

https://medium.com/@steve.yegge/honestly-i-cant-stand-k8s-48...

outside12347y ago

2 more replies

puzzle7y ago

daxfohl7y ago

Wow, I'd never heard of Fargate until this. It seems so somewhat obvious now for things that Lambda doesn't fit. Heroku for containers, not VMs.

1 more reply

lobster_johnson7y ago

daxfohl7y ago

zapita7y ago

It's complicated, because the definition of Service Fabric seems to be in flux.

daxfohl7y ago

jrbancel7y ago

Disclaimer: I work at Microsoft, not on Service Fabric but I have built complex stateful services on top of Service Fabric.

As zapita said, Service Fabric now handles containers but I think it is just because containers became trendy and FOMO kicked in.

[0] https://docs.microsoft.com/en-us/azure/service-fabric/servic...

[1] https://docs.microsoft.com/en-us/dotnet/api/system.fabric.fa...

tapirl7y ago

Are there any cloud providers providing remote disks without replications? It looks such needs are popular for deploying databases in which replications are maintained by the databases themselves.

jen207y ago

That's effectively what EBS is, no?

tapirl7y ago

I have the impression that EBS does replication automatically, is it wrong?

j / k navigate · click thread line to collapse