I am starting to think you can handle all storage requests for a single logical node on just one core/thread. I have been pushing 5~10 million JSON-serialized entities to disk per second with a single managed thread in .NET Core (using a Samsung 970 Pro for testing). This includes indexing and sequential integer key assignment. This testing will completely saturate the drive (over 1 gigabyte per second steady-state). Just getting an increment of a 64 bit integer over a million times per second across an arbitrary number of threads is a big ask. This is the difference you can see when you double down on single threaded ideology for this type of problem domain.
The technical trick to my success is to run all of the database operations in micro batches (10~1000 microseconds per). I use LMAX Disruptor, so the batches are formed naturally based on throughput conditions. Selecting data structures and algorithms that work well in this type of setup is critical. Append-only is a must with flash and makes orders of magnitude difference in performance. Everything else (b-tree algorithms, etc) follows from this realization.
Put another way, If you find yourself using Task or async/await primitives when trying to talk to something as fast as NVMe flash, you need to rethink your approach. The overhead with multiple threads, task parallel abstractions, et. al. is going to cripple any notion of high throughput in a synchronous storage domain.
For kafka, we have multiple indexes - a time index and an offset index which are simple metadata. the trouble becomes on how you handle decompression+checksumming+compression for supporting compacted topics. ( https://github.com/vectorizedio/redpanda/blob/dev/src/v/stor... )
So single core starts to get saturated while doing both fore-ground and background requests.
.....
Now assume that you handle that with correct priorities for IO and CPU scheduling.... the next bottleneck will be keeping up w/ background tasks.
So then you start to add more threads. but as you mentioned and what I tried to highligiht in that article was that the cost of implicit or simple synchronization is very expensive (as noted by you intuition)
The thread-per-core buffer management with defer destructors is really handy at doing 3 things explicitly
1. your cross core communication is explicit - that is you give it shares as part of a quota so that you understand how your system priorities are working across the system for any kind of workload. This is helpful to prioritize foreground and background work.
2. there is effectively a const memory addresses once you parse it - so you treat it is largely immutable and you can add hooks (say crash if modified on a remote core)
3. makes memory accounting fast. i.e.: instead of pushing a global barrier for the allocator you simply send a message back to the originating core for allocator accounting. This becomes hugely important as you start to increase the number of cores.
gzip will cap 1 MB/s with the strongest compression setting and 50 MB/s with the fastest setting, which is really slow.
The first step to improve kafka is for kafka to adopt zstd compression.
Another thing that really hurts is SSL. Desktop CPU with AES instructions can push 1 GB/s so it's not too bad, but that may not the the CPU you have or the default algorithm used by the software.
That said, there are a lot of other simultaneous considerations at play when we are talking about punching through business entity storage rates that exceed the rated IOPS capacity of the underlying storage medium. My NVMe test drive can only push ~500k write IOPS in the most ideal case, but I am able to write several million logical entities across those operations due to batching/single-writer effects.
I'm curious what about your use required implementing your own storage subsystem rather than using an embedded key value store like RocksDB.
Ended up with a system where each thread accumulated results in small buffers, appended pointers to those buffers to a shared "buffer list" which was very fast due to low contention using typical spinlock+mutex combo.
The thread that overflowed the buffer list would then become the single writer by taking on the responsibility to accumulate the results to the shared output image. It would start by swapping in a fresh list, so the other threads could carry on.
The system would self-tune by regulating the size of the shared buffer list so that the other threads could keep working while the one "writer thread" accumulated.
Probably had room for improvement, but after this change it scaled almost linearly to a least 32 cores, which was the largest system available for testing at the time.
The reason for not simply allocating a full output image per thread and accumulate post-render was mainly due to the memory requirements for large output images.
Another thing to consider is that you lose significant performance in a few different dimensions if your storage I/O scheduler design is not tightly coupled to your execution scheduler design. While it requires writing more code it also eliminates a bunch of rough edges. This alone is the reason many database-y applications write their own storage engines. For people that do it for a living, writing an excellent custom storage engine isn't that onerous.
RocksDB is a fine choice for applications where performance and scale are not paramount or your hardware is limited. On large servers with hefty workloads, you'll probably want to use something else.
A secondary requirement is extreme simplicity and safety. My entire implementation is written in managed code and can be understood by a junior developer in one weekend. There is not a single line of code in support of a database feature that we aren't actually going to use.
The final requirement is zero external cost to employ this code. If I own my database implementation, Oracle cannot bill me.
The nice-to-have is being able to follow a breakpoint all the way from user tapping a button down into the b-tree rotation condition logic in the database engine. It also makes profiling performance issues a trivial affair. I like being able to see the actual code in my database engine that is causing a hotpath. This visibility is where additional innovation is possible over time.
Correct me if I’m wrong, but the only number that I can find is a guarantee that you do not exceed 500 us of latency when handling a request. And it’s not clear if this is a guarantee at all, since you say just that the system will throw a traceback in case of latency spikes.
I would have liked to see the how latency varies under load, how much throughput you can achieve, how the latency long tail looks like on a long-running production load, and comparisons with off-the-shelf systems tuned reasonably.
for( ... : collection) {}
to return seastar::do_for_each(collection, callback);Last, a coordinator core for the one of the TCP connections from a client will likely make requests to remote cores (say you receive a request on core 44, but the destination is core 66), so having a thread per core with explicit message passing is pretty fundamental.
ss::future<std::vector<append_entries_reply>>
dispatch_hbeats_to_core(ss::shard_id shard, hbeats_ptr requests) {
return with_scheduling_group(
get_scheduling_group(),
[this, shard, r = std::move(requests)]() mutable {
return _group_manager.invoke_on(
shard,
get_smp_service_group(),
[this, r = std::move(r)](ConsensusManager& m) mutable {
return dispatch_hbeats_to_groups(m, std::move(r));
});
});
}
Here is some code that shows importance of accounting the x-core comms explicitlyps. redpanda link from article is broken, goes to https://vectorized.io/blog/tpc-buffers/vectorized.io/redpand... 404