Containers and Distributed Systems: Where They Came from and Where They’re Going (opens in new tab)

(mesosphere.com)

288 pointsflorianleibert8y ago41 comments

41 comments

This was a lot of fun, one of the things that doesn't get much air time is that back in the early 2000's when "clusters" and "NUMA SMP machines" were competing with each other the big argument for large SMP iron was ACID compliant SQL databases like Oracle. Now that Google has implemented an ACID compliant SQL database across clusters it puts the final nail in the argument (for me at least) that "Some things only work on SMP machines"

0xbear8y ago

I actually thing "NUMA SMP machines" are about due for a renaissance with the advent of extreme multicore CPUs. Next year we will see CPUs that run 32 cores (64 hyperthreads) per socket, and don't cost an arm and a leg. Current max is 28 cores per socket, at a staggering $13K per chip.

With relatively economical 64 core / 128 thread options with nearly unlimited RAM capacity appearing, a lot more workloads will "fit". Needless to say, single-node systems are much easier/cheaper/faster to get right.

ChuckMcM8y ago

The irony of clusters of many core NUMA SMP machines is not lost on me :-)

If things follow the previous patterns that will open up the market to a single/dual core, large memory, machine with a high speed networking and storage ports that is lower cost and lower power.

peterwwillis8y ago

Some things do only work on SMP, because for some workloads no network fabric (that I know of) is as fast as the system bus. It's the classic Beowulf vs Mainframe problem: some things can be split up into discrete parts, and some things have to share so much that bandwidth is the bottleneck. Then you go from SMP to NUMA because your bus is too slow. But both are still faster than most (all?) networks.

A lot of this also centers around how apps are designed. If you can rewrite it you can pick your architecture. But it's hard to take a pre-existing one and slap it on a new architecture and make it just scale.

Mosix attempted to have a magical SSI that just scaled single pre-existing apps (even threaded shared-memory ones) across machines with no extra work, but it's got such a hard set of problems it never caught on. Now we just add intermediate abstractions until we can jam an app into whatever architecture we have.

ChuckMcM8y ago

It is exactly the Beowulf vs Mainframe problem. And it is this: "Some things do only work on SMP, because for some workloads no network fabric (that I know of) is as fast as the system bus." is under siege.

From a systems architecture point of view it is a really interesting exploration of Amdahl's law. So many things that people said "You'll never do that on a networked cluster of machines." have fallen (data bases being one of the larger ones). And while it used to be mainframes won on I/O channel capacity, Google and others have shown that when you can parallelize the I/O channels effectively that advantage goes away as well.

kuschku8y ago

Indeed, the trend is clear, but for now, nothing has changed.

Google's implementation is not very helpful to all those companies, individuals, NGOs, and governments that have to follow privacy laws, HIPAA, etc though, because Google's implementation isn't open source, and these entities can't use Google Cloud. Or don't want to.

Until we get an open source solution for this, SMP machines will be useful.

And even then, you save money by having less and larger machines in your cluster than just having tiny ones. Larger machines means your overhead is reduced.

numbsafari8y ago

Uhhhh... most orgs that run HIPAA workloads do so slavishly on Windows and EPIC, neither of which are famously what most would consider open source.

1 more reply

jabl8y ago

There's cockroachdb which is roughly open source spanner you can deploy on your own systems.

1 more reply

Pyxl1018y ago

Distributed ACID transactions across multiple DBs or clusters has been possible for a while if you have been willing to pay the latency penalty for the transaction to complete, hasn't it? As far as I know, Google systems have to pay this latency cost too. From an analysis of Spanner and Calvin [1]:

> The cost of two-phase commit is particularly high in Spanner because the protocol involves three forced writes to a log that cannot be overlapped with each other. In Spanner, every write to a log involves a cross-region Paxos agreement, so the latency of two-phase commit in Spanner is at least equal to three times the latency of cross-region Paxos.

I suspect that Google's innovation in making these cross-region or cross-cluster transactions viable is partly in their network: having an extremely reliable network with consistent low latency, a lot of which allegedly uses first-party network fiber between disparate locations. This is touched on in a paper that Google published [4]:

> Many assume that Spanner somehow gets around CAP via its use of TrueTime, which is a service that enables the use of globally synchronized clocks. Although remarkable, TrueTime does not significantly help achieve CA; its actual value is covered below. To the extent there is anything special, it is really Google’s wide-area network, plus many years of operational improvements, that greatly limit partitions in practice, and thus enable high availability.

I'm also not sure that technologies like Spanner necessarily address the same use-case as SMP machines. Aren't those use-cases concerned with achieving the highest possible transaction throughput for transactions that are potentially highly contentious and require low latency? Systems that solve that problem today still largely look like Oracle DBs from 2000 -- or a fleet of replicas reading executing the same state machine on a transaction log.

Transactions against Spanner likely have high latency compared to traditional DBs[2]:

> On the downside to Spanner, Zweben noted, that the price of Spanner’s high availability is latency at levels unacceptable in some applications. If you’re executing banking transactions, like in airline reservations, and you can afford a minute of delay while you make sure they’re committed in multiple data centers, Spanner’s probably a good database for that, he said.

> Zweben says, however, if you’ve got a customer-service app that requires you to be extraordinarily responsive over millions of transactions and at the same time analyze profitability of a customer and maybe train a machine learning model, those simultaneous transactional and analytical requirements are better for a database [...]

One benchmark I found suggests that read and update latencies can be quite high [3]. The graphs are missing units, but it seems to suggest update latency is higher than with traditional DBs. I've had difficulty finding other benchmarks just now, but the one other I found reported the same finding [5]:

> An interesting pattern above is that queries 14-18, which are all updates, perform with higher latency on Spanner than the easy selects and non-bulk inserts.

It's definitely innovative technology in that it allows you to scale read-oriented SQL workloads to a far greater degree, with better availability, but from what I've read it does not tackle all of the same problems and use-cases that traditional relational DBs are best at solving: high transaction throughput and low latency for contentious data sets.

[1] https://fauna.com/blog/spanner-vs-calvin-distributed-consist...

[2] https://thenewstack.io/google-cloud-spanner-view-field/

[3] https://www.nuodb.com/techblog/benchmarking-google-cloud-spa...

[4] https://research.google.com/pubs/pub45855.html

[5] https://quizlet.com/blog/quizlet-cloud-spanner

peterwwillis8y ago

> I suspect that Google's innovation in making these cross-region or cross-cluster transactions viable is partly in their network

I think this understates it. Production distributed systems suck balls with unreliable infrastructure, and anyone who has ever tried to do realtime replication of all of their data (and lots of it) across the internet knows how completely unrealistic it is. It's like running a power cord across a highway to power a refrigerator.

1 more reply

bogomipz8y ago

>"back in the early 2000's when "clusters" and "NUMA SMP machines" were competing with each other"

Did you mean Numa vs SMP? Or something else maybe? How can a machine be a NUMA SMP? NUMA and SMP are fundamentally different architectures.

ChuckMcM8y ago

No, I was thinking non-uniform memory architecture, which is to say an SMP machine where the "speed" at which you can access RAM is dependent on the core or 'thread' from which you accessed it. Lack of memory uniformity was the compromise to achieve larger effective address spaces and "simple" programming.

Today on a 24/48 core dual socket server you'll see the same sorts of thing, having a core using memory on the 'other' physical chip's memory bus will impact the overall performance significantly.

In the 2000's, before multi-core chips were a thing, there were two major camps, the 'super computer' camp, and the 'cluster' camp.

The 'super computer' camp insisted on cache coherent memory between all of the cores or threads. You got these very expensive fabrics from people like cray that would snoop access to memory from the cores and send coherency messages around to insure that if someone wrote something in to memory somewhere, everyone else's L1 or L2 cache got the message to invalidate what they were holding (shoot downs). These machines are very expensive and take months to build.

The 'cluster' camp said, "We can use a network fabric and just parameterize shared memory usage." So they put together independent machines connected by a network fabric and no cache coherency protocol. If you wanted to use shared memory you could build something like memcached and wrap your access with network calls. With that architecture even if it took twice as many cores to do what you wanted to do, the price of the machine was one tenth what it was for the big SMP machine.

For something that was trivially parallellizable like internet search or serving up web sites, that was a much more cost effective way to go. When people started doing stuff that they previously used 'super computers' and big SMP machines for on these Linux clusters it became a sort of race to pull apart these problems into "shared nothing" clusters.

3 more replies

zurn8y ago

Is this clear cut in the terminology? A plausible definition would also be that it's still symmetric MP if remote memory access has non-uniform performance - since the nodes and their memories are symmetrical. After all, you get that just with caches and 2 sockets plugged to the same DRAM.

The historical opposite of SMP used to be asymmetric multiprocessors in the heterogenous sense - different kinds of processors, for example scalar/vector/io processors. Or for a modern day take, ARM SoCs with little low-power cores and faster & more power hungry cores, and GPUs thrown in for good measure.

1 more reply

mathattack8y ago

Good to see you outside the context of HN. You're an immense value to the community.

godelmachine8y ago

Hi Chuck, by SMP do you mean Symmetric Multi Processing? Sincerely,

ChuckMcM8y ago

Yes. All threads have equal computation capability and cache coherent (although not uniform) access to main memory. This was the 'big iron' of the day like the Sun, SGI, and Cray "big iron" boxes.

chubot8y ago

Just to double-click on container technology, why do you think it took so long for this technology that was built into Solaris to become mainstream with Docker?

I think VMWare deserves a mention here? And the terms OS Virtualization vs Hardware virtualization do as well (Ctrl-F doesn't find them.)

For awhile hardware virtualization (VMWare) was more prevalent, but it's more complex and has more overhead than OS virtualization (containers). That is how people solved the problem of having powerful machines and small workloads (or workloads with a lot of variance).

Although historically it might have been that hardware virtualization actually came first, in IBM mainframes. In the Unix world I guess OS virtualization came first.

kakwa_8y ago

I think docker succeeded not because it was a better container or virtualization solution, or because it was lighter than VMs. Jails or Solaris zones have existed for years or decades, and on Linux, we had openvz and vserver long before Docker. And I'm not even mentioning plain old chroot (with indeed some security issues).

I didn't follow the latest development, but for a while, Docker was not even that great in term of robustness and stability. I never was a big fan of the userland proxy for example. And from what I've read, upgrading from one docker version to another could be quite painful (disclaimer: I toyed with docker a little, I haven't used it in production yet).

IMHO, Docker succeeded because it brought a complete ecosystem around containers:

* Ways to generate the container image through Dockerfile

* Ways to compose an image on top of another (itself on top of another, etc...)

* Ways to share and distribute these images with Docker Registry

cat1998y ago

"You basically came up with Docker before Docker was around, or at least with things like Solaris zones and C groups."

Except jails already existed on FreeBSD, and CP/CMS on mainfraimes in the 70s existed long before this...

bpicolo8y ago

Docker won the UX game, which mattered the most here. :)

havetocharge8y ago

It only mattered the most because it was thought of last. If it was UX that increased impact by improving access, then it indeed mattered a great deal.

msarrel8y ago

The article says " It turned out that about twenty years earlier IBM did sort of solve that problem with partition isolation, but they did that on custom mainframe hardware and architecture."

msla8y ago

Right. I remember looking at technologies for virtualization on x86 chips before actual virtualization hardware existed on those chips, and they were pretty heavyweight, including rewriting the binary to replace privileged opcodes which would fail silently or behave differently in ring 3 compared to ring 0 with sequences of opcodes which would work to maintain the isolation. It was closer to emulation than virtualization given how much work the host needed to do to maintain the illusion that the guest OSes were running on bare hardware, and it really pointed up how the x86 architecture of the time just wasn't designed for that kind of thing.

For the past decade or so, mainframe-level virtualization on x86 has been possible, at least, starting on high-end chips and working its way downwards to the commodity world. And, yes, it exists in ARM as well.

https://genode.org/documentation/articles/arm_virtualization

LinuXY8y ago

Minus CP/CMS and OpenVZ, here's an article on the concepts and tech Docker is built on. https://www.kentik.com/the-evolution-of-docker/

naasking8y ago

Containerization long predates Solaris, even on commodity hardware. Capability operating systems dating back to the 60s and 70s support even more extreme isolation by default.

swizydo8y ago

yea indeed

j / k navigate · click thread line to collapse

41 comments

ChuckMcM8y ago

0xbear8y ago

ChuckMcM8y ago

The irony of clusters of many core NUMA SMP machines is not lost on me :-)

If things follow the previous patterns that will open up the market to a single/dual core, large memory, machine with a high speed networking and storage ports that is lower cost and lower power.

peterwwillis8y ago

ChuckMcM8y ago

kuschku8y ago

Indeed, the trend is clear, but for now, nothing has changed.

Until we get an open source solution for this, SMP machines will be useful.

And even then, you save money by having less and larger machines in your cluster than just having tiny ones. Larger machines means your overhead is reduced.

numbsafari8y ago

Uhhhh... most orgs that run HIPAA workloads do so slavishly on Windows and EPIC, neither of which are famously what most would consider open source.

1 more reply

jabl8y ago

There's cockroachdb which is roughly open source spanner you can deploy on your own systems.

1 more reply

Pyxl1018y ago

Transactions against Spanner likely have high latency compared to traditional DBs[2]:

> An interesting pattern above is that queries 14-18, which are all updates, perform with higher latency on Spanner than the easy selects and non-bulk inserts.

[1] https://fauna.com/blog/spanner-vs-calvin-distributed-consist...

[2] https://thenewstack.io/google-cloud-spanner-view-field/

[3] https://www.nuodb.com/techblog/benchmarking-google-cloud-spa...

[4] https://research.google.com/pubs/pub45855.html

[5] https://quizlet.com/blog/quizlet-cloud-spanner

peterwwillis8y ago

> I suspect that Google's innovation in making these cross-region or cross-cluster transactions viable is partly in their network

1 more reply

bogomipz8y ago

>"back in the early 2000's when "clusters" and "NUMA SMP machines" were competing with each other"

Did you mean Numa vs SMP? Or something else maybe? How can a machine be a NUMA SMP? NUMA and SMP are fundamentally different architectures.

ChuckMcM8y ago

Today on a 24/48 core dual socket server you'll see the same sorts of thing, having a core using memory on the 'other' physical chip's memory bus will impact the overall performance significantly.

In the 2000's, before multi-core chips were a thing, there were two major camps, the 'super computer' camp, and the 'cluster' camp.

3 more replies

zurn8y ago

1 more reply

mathattack8y ago

Good to see you outside the context of HN. You're an immense value to the community.

godelmachine8y ago

Hi Chuck, by SMP do you mean Symmetric Multi Processing? Sincerely,

ChuckMcM8y ago

Yes. All threads have equal computation capability and cache coherent (although not uniform) access to main memory. This was the 'big iron' of the day like the Sun, SGI, and Cray "big iron" boxes.

chubot8y ago

Just to double-click on container technology, why do you think it took so long for this technology that was built into Solaris to become mainstream with Docker?

I think VMWare deserves a mention here? And the terms OS Virtualization vs Hardware virtualization do as well (Ctrl-F doesn't find them.)

Although historically it might have been that hardware virtualization actually came first, in IBM mainframes. In the Unix world I guess OS virtualization came first.

kakwa_8y ago

IMHO, Docker succeeded because it brought a complete ecosystem around containers:

* Ways to generate the container image through Dockerfile

* Ways to compose an image on top of another (itself on top of another, etc...)

* Ways to share and distribute these images with Docker Registry

cat1998y ago

"You basically came up with Docker before Docker was around, or at least with things like Solaris zones and C groups."

Except jails already existed on FreeBSD, and CP/CMS on mainfraimes in the 70s existed long before this...

bpicolo8y ago

Docker won the UX game, which mattered the most here. :)

havetocharge8y ago

It only mattered the most because it was thought of last. If it was UX that increased impact by improving access, then it indeed mattered a great deal.

msarrel8y ago

The article says " It turned out that about twenty years earlier IBM did sort of solve that problem with partition isolation, but they did that on custom mainframe hardware and architecture."

msla8y ago

https://genode.org/documentation/articles/arm_virtualization

LinuXY8y ago

Minus CP/CMS and OpenVZ, here's an article on the concepts and tech Docker is built on. https://www.kentik.com/the-evolution-of-docker/

naasking8y ago

Containerization long predates Solaris, even on commodity hardware. Capability operating systems dating back to the 60s and 70s support even more extreme isolation by default.

swizydo8y ago

yea indeed

j / k navigate · click thread line to collapse