Broadly speaking, this is the correct style of architecture for a database engine on modern hardware. It is vastly more efficient in terms of throughput than the more traditional architectures common in open source. It also lends itself to elegant, compact implementations. I've been using similar architectures for several years now.
While I have not benchmarked their particular implementation, my first-hand experience is that these types of implementations are always at least 10x faster on the same hardware than a nominally equivalent open source database engine, so the performance claim is completely believable. One of my longstanding criticisms of open source data infrastructure has always been the very poor operational efficiency at a basic architectural level; many closed source companies have made a good business arbitraging the gross efficiency differences.
At some point several years ago, a few people noticed that if you attack the problem of scalable distribution within a single server the same way you would in large distributed systems (e.g. shared nothing architectures) that you could realize huge performance increases on a single machine. The caveat is that the software architectures look unorthodox.
The general model looks like this:
- one process per core, each locked to a single core
- use locked local RAM only (effectively limiting NUMA)
- direct dedicated network queue (bypass kernel)
- direct storage I/O (bypass kernel)
If you do it right, you minimize the amount of silicon that is shared between processes which has surprisingly large performance benefits. Linux has facilities that make this relatively straightforward too.
As a consequence, adjacent cores on the same CPU have only marginally more interaction with each other than cores on different machines entirely. Treating a single server as a distributed cluster of 1-core machines, and writing the software in such a way that the operating system behavior reflects that model to the extent possible, is a great architecture for extreme performance but you rarely see it outside of closed source software.
As a corollary, garbage-collected languages do not work for this at all.
"The Scylla design, right, is based on a modern shared-nothing approach. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC."
Now my question is how portable Scylla be in terms of NIC vendors?
Interesting. From: http://www.scylladb.com/technology/architecture/
Just to pull an example from memory, the ubiquitous Intel 82599 10GbE NIC silicon has up to 128 TX and RX queues in hardware. IIRC, these are bundled in pairs for direct access in virtualized environments, so in principle you could have 64 virtual cores each with their own dedicated physical hardware queue. This is almost certainly what they were talking about. That is the whole point of this feature in Ethernet silicon; it gives cores (virtual or physical) dedicate network hardware off a single NIC.
Personal pet-peeve of mine. Using "TPS" or "Transactions/sec" to measure something that is in no way transactional. Maybe ops/sec, reads/sec, updates/sec, or something...
Has it been through Jepsen yet?
All the external facing things for scylla is the same as Cassandra. That includes all the ring stuff and all network protocols.
So you should expect similar cluster behavior.
i would expect nothing.
If theire numbers were astounding with a 10, 100, 1000 node cluster, they would have published numbers with such set-ups. I call shenanigans on a report that is purposely out of line with the expected use case.
There is nothing commodity about a server with 128GB RAM.
When you introduce other nodes, you get chatter and network traffic....
Modifications to the database software must be shared, yes, but your client application is outside the reach of the AGPL and can remain proprietary.
Same trend, by the way, is in Android development.
One of the latest benchmarks I've seen is "Comparison of Programming Languages in Economics" [1] for code without any IO just number crunching, has a 1.91 to 2.69 speedup of using C++ compared to Java. So any code involving IO is going to be slower.
Replacing bad Java code with excellent machine aligned C++ a 10x speedup is possible.
[1] https://github.com/jesusfv/Comparison-Programming-Languages-...
Java has a ton of overhead that C++ doesn't. Each object has metadata which results in more "cold data" in the cache. Each object is a heap allocation (unless you're lucky enough to hit the escape analysis optimization), which again leads to less cache locality because things are distributed around memory. Then there's the garbage collector. Then bounds checking.
a) IO is such a large portion of the problem b) Hypertable isn't just way, way faster.