undefined | Better HN

0 pointsjandrewrogers10y ago0 comments

Over the last decade, the distributed system nature of modern server hardware internals has become painfully evident in how software architectures scale on a single machine. The traditional approaches -- multithreading, locking, lock-free structures, etc -- are all forms of coordination and agreement in a distributed system, with the attendant scalability problems if not used very carefully.

At some point several years ago, a few people noticed that if you attack the problem of scalable distribution within a single server the same way you would in large distributed systems (e.g. shared nothing architectures) that you could realize huge performance increases on a single machine. The caveat is that the software architectures look unorthodox.

The general model looks like this:

- one process per core, each locked to a single core

- use locked local RAM only (effectively limiting NUMA)

- direct dedicated network queue (bypass kernel)

- direct storage I/O (bypass kernel)

If you do it right, you minimize the amount of silicon that is shared between processes which has surprisingly large performance benefits. Linux has facilities that make this relatively straightforward too.

As a consequence, adjacent cores on the same CPU have only marginally more interaction with each other than cores on different machines entirely. Treating a single server as a distributed cluster of 1-core machines, and writing the software in such a way that the operating system behavior reflects that model to the extent possible, is a great architecture for extreme performance but you rarely see it outside of closed source software.

As a corollary, garbage-collected languages do not work for this at all.

0 comments

pron10y ago

I think that's a generalization that simply shifts the burden elsewhere, and cannot be said to be "the right" architecture in general. There is a reason CPUs implement cache-coherence on top of their "innate" shared-nothing design, and the reason is abstraction. If you don't need certain abstractions, then a sharded approach is indeed optimal, but if you do, then you have to implement them at some level or another, and it's often better to rely on hardware messaging (cache-coherence) than software messaging.

So for some abstractions such as column and other analytics databases, a sharded architecture works very well, but if you need isolated online transactions, strong consistency etc., then sharding no longer cuts it. Instead, you should rely on algorithms that minimize coherence traffic while maintaining a consistent shared-memory abstraction. In fact, there are algorithms that provide the exact same linear-scalability as sharding with an added small latency constant (not a constant factor but a constant addition) while providing much richer abstractions.

Similarly, your statement about "garbage-collected languages" is misplaced. First, there's that abstraction issue again -- if you need a consistent shared memory, then a good GC can be extremely effective. Second, while it is true that GCs don't contribute much to a sharded infrastructure (and may harm its performance), GCs and "GCed languages" are two different things. For example, in Java you don't have to use the GC. In fact, it is common practice for high-performance, sharded-architecture Java code to use manually-managed memory.

acconsta10y ago

You can still offer isolated transactions, tunable consistency, etc. within a shard though, which Cassandra does.

And yes, you can write high performance Java, but for whatever reasons the Cassandra codebase isn't an example of that. They just did a big storage engine rewrite and the result is slower.

https://issues.apache.org/jira/browse/CASSANDRA-7486

pron10y ago

> You can still offer isolated transactions, tunable consistency, etc. within a shard though

And if that happens to be exactly all you need then that's great! :)

> And yes, you can write high performance Java, but for whatever reasons the Cassandra codebase isn't an example of that.

I don't know anything about the Cassandra codebase, but one thing I'm often asked is if you're not going to use the GC in Java, why write Java at all. The answer is that the very tight core code like a shard event loop turns out to be a rather small part of the total codebase. There's more code dedicated to support functions (such as monitoring and management) than the core, and relying on the GC for that makes writing all that code much more convenient and doesn't affect your performance.

1 more reply

est10y ago

I really appreciate your insight, but I didn't understand why you keep implying "open source" as some kind of lack-behind design?

jandrewrogersOP10y ago

In practice, open source databases use more traditional, simpler architectures for which there is a lot of literature. Ironically, you see a lot more creativity and experimentation in closed source database architectures, and this has accrued some substantial benefits to those implementations.

The architecture at the link looks unusual compared to open source databases but it is actually a common architecture pattern in closed source databases with significant benefits, particularly when it comes to performance. There is a lot of what I would call "guild knowledge" in advanced database engine design, much like with HPC, things the small number of experts all seem to know but no one ever writes down.

It is a path dependency problem. Most open source databases were someone's first serious attempt at designing a database, a project that turned into a product. This is an excellent way to learn but it would be unrealistic to expect a thoroughly expert design for a piece of software of such complexity on the first (or second, or third) go at it. The atypical quality of PostgreSQL is a testament to the importance of having an experienced designer involved in the architecture.

_benedict10y ago

It's not exactly guild knowledge, there's just a log of legacy baggage with open source projects that were started by random people and became popular before much thought was given to the architecture. This model has been considered by Cassandra devs for at least the last two years, and there are open JIRA tickets associated with it, it just hasn't been considered a priority.

cbsmith10y ago

What aspect of the architecture seems unusual?

ogrisel10y ago

I agree. I suppose that the vast majority of closed source DBs do not follow this shared nothing architecture either.

lastofus10y ago

How does one bypass the kernel for network and disk IO? I've never heard of doing this before (e.g. IO is always a system call)

acconsta10y ago

Basically mapping the device registers into user space and doing exactly what the kernel would do, but without the syscall overhead.

Some researchers made a big splash at OSDI by doing this securely:

http://people.inf.ethz.ch/troscoe/pubs/peter-arrakis-osdi14....

technion10y ago

Oracle has long argued the value of bypassing the file system and associated kernel drivers with its raw devices and ASM. It'd be interesting to see such a thing land in other platforms.

1 more reply

glommer10y ago

For network, the kernel is bypasses through dpdk. For disk, the syscalls are there. But they are always async IO with O_DIRECT. So the OS wont cache and buffer anything. So that is what you bypass

mamcx10y ago

> As a corollary, garbage-collected languages do not work for this at all.

As a naive person about this...

Is possible to reap some of the benefits from this kind of architecture in a managed environment (like .NET)?

If a try to implement, for example a sqlite-like/kdb+, memory only database in .NET, how far is possible to go?

Or how avoid some common traps?

infinite8s10y ago

This approach seems to dovetail nicely with unikernel approaches. Have you experimented with combining the above constraints with a system like MirageOS?

acconsta10y ago

Strongly agreed. It's great that these HPC techniques are finally starting to trickle down.

a8da6b0c91d10y ago

> one process per core, each locked to a single core; use locked local RAM only (effectively limiting NUMA); direct dedicated network queue (bypass kernel); direct storage I/O (bypass kernel)

I have no idea how to do any of these things. What are the system/api calls to lock a process to a kernel? How do you bypass kernel IO?

acconsta10y ago

Take a look at sched_setaffinity and Intel DPDK.

j / k navigate · click thread line to collapse

0 comments

pron10y ago

acconsta10y ago

You can still offer isolated transactions, tunable consistency, etc. within a shard though, which Cassandra does.

And yes, you can write high performance Java, but for whatever reasons the Cassandra codebase isn't an example of that. They just did a big storage engine rewrite and the result is slower.

https://issues.apache.org/jira/browse/CASSANDRA-7486

pron10y ago

> You can still offer isolated transactions, tunable consistency, etc. within a shard though

And if that happens to be exactly all you need then that's great! :)

> And yes, you can write high performance Java, but for whatever reasons the Cassandra codebase isn't an example of that.

1 more reply

est10y ago

I really appreciate your insight, but I didn't understand why you keep implying "open source" as some kind of lack-behind design?

jandrewrogersOP10y ago

_benedict10y ago

cbsmith10y ago

What aspect of the architecture seems unusual?

ogrisel10y ago

I agree. I suppose that the vast majority of closed source DBs do not follow this shared nothing architecture either.

lastofus10y ago

How does one bypass the kernel for network and disk IO? I've never heard of doing this before (e.g. IO is always a system call)

acconsta10y ago

Basically mapping the device registers into user space and doing exactly what the kernel would do, but without the syscall overhead.

Some researchers made a big splash at OSDI by doing this securely:

http://people.inf.ethz.ch/troscoe/pubs/peter-arrakis-osdi14....

technion10y ago

Oracle has long argued the value of bypassing the file system and associated kernel drivers with its raw devices and ASM. It'd be interesting to see such a thing land in other platforms.

1 more reply

glommer10y ago

For network, the kernel is bypasses through dpdk. For disk, the syscalls are there. But they are always async IO with O_DIRECT. So the OS wont cache and buffer anything. So that is what you bypass

mamcx10y ago

> As a corollary, garbage-collected languages do not work for this at all.

As a naive person about this...

Is possible to reap some of the benefits from this kind of architecture in a managed environment (like .NET)?

If a try to implement, for example a sqlite-like/kdb+, memory only database in .NET, how far is possible to go?

Or how avoid some common traps?

infinite8s10y ago

This approach seems to dovetail nicely with unikernel approaches. Have you experimented with combining the above constraints with a system like MirageOS?

acconsta10y ago

Strongly agreed. It's great that these HPC techniques are finally starting to trickle down.

a8da6b0c91d10y ago

> one process per core, each locked to a single core; use locked local RAM only (effectively limiting NUMA); direct dedicated network queue (bypass kernel); direct storage I/O (bypass kernel)

I have no idea how to do any of these things. What are the system/api calls to lock a process to a kernel? How do you bypass kernel IO?

acconsta10y ago

Take a look at sched_setaffinity and Intel DPDK.

j / k navigate · click thread line to collapse