When you see advertised numbers like '200 GB/s' that is total memory bandwidth, or all cores combined. For individual cores, the limit will still be around 6 GB/s.
This means even if you write a perfect parser, you cannot go faster. This limit also applies to (de)serializing data like JSON and Protobuf, because those formats must typically be fully parsed before a single field can be read.
If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit.
The Lite³ serialization format I am working on aims to exploit exactly this, and is able to outperform simdjson by 120x in some benchmarks as a result: https://github.com/fastserial/lite3
e.g. dual channel zen 1 showing 25GB/s on a single core https://stackoverflow.com/a/44948720
I wrote some microbenchmarks for single-threaded memcpy
zen 2 (8-channel DDR4)
naive c:
17GB/s
non-temporal avx:
35GB/s
Xeon-D 1541 (2-channel DDR4, my weakest system, ten years old)
naive c:
9GB/s
non-temporal avx:
13.5GB/s
apple silicon tests
(warm = generate new source buffer, memset(0) output buffer, add memory fence, then run the same copy again)
m3
naive c:
17GB/s cold, 41GB/s warm
non-temporal neon:
78GB/s cold+warm
m3 max
naive c:
25GB/s cold, 65GB/s warm
non-temporal neon:
49GB/s cold, 125GB/s warm
m4 pro
naive c:
13.8GB/s cold, 65GB/s warm
non-temporal neon:
49GB/s cold, 125GB/s warm
(I'm not actually sure offhand why asi warm is so much faster than cold - the source buffer is filled with new random data each iteration, I'm using memory fences, and I still see the speedup with 16GB src/dst buffers much larger than cache. x86/linux didn't have any kind of cold/warm test difference. my guess would be that it's something about kernel page accounting and not related to the cpu)
I really don't see how you can claim either a 6GB/s single core limit on x86 or a 20GB/s limit on apple siliconI suppose that in real life such ideal condition do not occur, but it shows how badly the CPU is limited by its memory bandwidth for streaming tasks. Its maximum memory-read bandwidth is 768 bits per clock. only 60% of its peak bit-crunching performance. DRAM bandwidth is even more limiting. And this is a single core of at least 12 (and at most 64).
A single threaded Zen2 program could very well take 1 second to scan through your RAM, during which it's entirely trivial to read stuff from disk, so the modern practice of keeping a ton of stuff in RAM might be actually hurting performance.
Algorithms, such as garbage collection, which scan the entire heap, where the single-threaded version is probably slower than the naive zen 2 memcpy might run for more than a second even on a comparatively modest 16GB heap, which might not be even acceptable.
https://chipsandcheese.com/p/intels-clearwater-forest-e-core...
Are those numbers also measured while skipping the CPU cache?
What makes M-series have 3x the bandwidth (per core), over x86?
Apple are able to push a wider bus at higher frequencies because they aren't limited by signal integrity problems you encounter trying to do the same over sockets and motherboard traces. x86 CPUs like the Ryzen AI Max 395+, when packaged without socketed memory, are able to push equally wide busses at high frequencies.
The tradeoff is that it's non-upgradeable, but (contra some people who claim this is only a cash-grab by Apple to prevent RAM upgrades) it's worth it for the bandwidth.
> If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit.
Of course the "skipping" is by cachelines. A cacheline is effectively a self-contained block of data from a memory throughput perspective, once you've read any part of it the rest comes for free.
Samsung is selling NVMe SSDs claiming 14 GB/s sequential read speed.
Yes, those numbers are real but only in very short bursts of strictly sequential reads, sustained speeds will be closer to 8-10 GB/s. And real workloads will be lower than that, because they contain random access.
Most NVMe drivers on Linux actually DMA the pages directly into host memory over the PCIe link, so it is not actually the CPU that is moving the data. Whenever the CPU is involved in any data movement, the 6 GB/s per core limit still applies.
You still have to load a 64-byte cache line at a time, and most CPUs do some amount of readahead, so you'll need a pretty large "blank" space to see these gains, larger than typical protobufs.
The ability to mutate serialized data allows for 2 things:
1) Services exchanging messages and making small changes each time e.g. adding a timestamp without full reserialization overhead.
2) A message producer can keep a message 'template' ready and only change the necessary fields each time before sending. As a result, serializing becomes practically free from a performance perspective.
If you want schema, it must be done by the application through runtime type checking. All messages contain type information. Though I do see the value of adding pydantic-like schema checking in the future.
EDIT: Regarding message size, Lite³ does demand a message size penalty for being schemaless and fully indexed at the same time. Though if you are using it in an RPC / streaming setting, this can be negated through brotli/zstd/dict compression.
Disclosure: I am the author.
That being said storing trees as serializable flat buffers is definitely useful, if only because you can release them very cheaply.
Which file formats allow partial parsing?
Imagine the following object:
{
"attributes": [ .. some really really long array of whatever .. ],
"name": "Bob"
}
In JSON, if you want to extract the "name" property, you need to parse the whole "attributes" array. However, if you encoded the size of the "attributes" array in bytes, a parser could look at the key "attributes", decide that it's not interested, and jump past it without parsing anything.You'd typically want some kind of binary format, but for illustration purposes, here's an imaginary XML representation of the JSON data model which achieves this:
<object content-size-in-bytes="8388737">
<array key="attributes" content-size-in-bytes="8388608">
.. some 8 MiB large array of values ..
</array>
<string key="name" content-size-in-bytes="3">Bob</string>
</object>
If this was stored in a file, you could use fseek to seek past the "attributes" array. If it was compressed, or coming across a socket, you'd need more complicated mechanisms to seek past irrelevant parts of the object.In the case of append, the old value still lives inside the buffer but is zeroed out. This means that if you keep replacing variable-sized elements, over time the buffer will fragment. You can vacuum a message by recursively writing it from the root to a new buffer. This will clear out the unused space. This operation can be delayed for as long as you like.
If you are only setting fixed-size values like integers or floats, then the buffer never grows as they are always updated in-place.
Btw: in my project README I have benchmarks against Cap'N Proto & Google Flatbuffers.
$ time ./wc-avx2 < bible-100.txt
82113300
real 0m0.395s
user 0m0.196s
sys 0m0.117s
"System" time is the amount of CPU time spent in the kernel on behalf of your process, or at least a fairly good guess at that. (e.g. it can be hard to account for time spent in interrupt handlers) With an old hard drive you would probably still see about 117ms of system time for ext4, disk interrupts, etc. but real time would have been much longer. $ time ./optimized < bible-100.txt > /dev/null
real 0m1.525s
user 0m1.477s
sys 0m0.048s
Here we're bottlenecked on CPU time - 1.477s + 0.048s = 1.525s. The CPU is busy for every millisecond of real time, either in user space or in the kernel.In the optimized case:
real 0m0.395s
user 0m0.196s
sys 0m0.117s
0.196 + 0.117 = 0.313, so we used 313ms of CPU time but the entire command took 395ms, with the CPU idle for 82ms.In other words: yes, the author managed to beat the speed of the disk subsystem. With two caveats:
1. not by much - similar attention to tweaking of I/O parameters might improve I/O performance quite a bit.
2. the I/O path is CPU-bound. Those 117ms (38% of all CPU cycles) are all spent in the disk I/O and file system kernel code; if both the disk and your user code were infinitely fast, the command would still take 117ms. (but those I/O tweaks might reduce that number)
Note that the slow code numbers are with a warm cache, showing 48ms of system time - in this case only the ext4 code has to run in the kernel, as data is already cached in memory. In the cold cache case it has to run the disk driver code, as well, for a total of 117ms.
Some other participants said that they measured 0 difference in runtime between pshufb+eq and eqx3+orx2, but i think your problem has more classes of whitespace, and for the histogram problem, considerations about how to hash all the words in a chunk of the input dominate considerations about how to obtain the bitmasks of word-start or word-end positions.
The performance bottleneck is whatever resource hits saturation first under the workload you actually run: CPU, memory bandwidth, cache/allocations, disk I/O, network, locks/coordination, or downstream latency.
Measure it, prove it with a profile/trace, change one thing, measure again.
On a more serious note, the performance of hardware today is mind boggling from what we all encountered way back when. What I struggle to comprehend though is how some software (particularly Windows as an OS, instant messaging applications etc.) feel less performant now than they ever were.
even more so when considered in the context of dev 'remote workstations'. I benchmarked perf on AWS instances that was at least 5x slower than an average m1 macbook, and cost hundreds of dollars a dev per month (easily), and the macbook was a sunk cost!
Both Telegram and FB messenger are snappy; I didn't use anything else seriously as of late. (Especially not Teams, nor the late Skype.)
The problem is sloppy programming. We knew how to make small, fast, programs 20+ years ago that would just scream on modern hardware. But now everything is bloated and slow. CPUs can retire billions of instructions per second. Discord takes 10+ seconds to open. I’m simply not creative enough to think up how to keep the cpu busy that long opening IRC.
they are forcing me to use the web client...
What if you could take it for granted that mmap()ing a file has the exact same performance characteristics as malloc(), aside from the data not going away when you free the address space? What if arbitrary program memory could be given a filename and casually handed off to the OS to make persistent? A lot of basic software design assumptions are still based on the constraints of the spinning rust era...
fsync() is still slow, and you need that for real persistence. It's not just about spinning rust, there's very good reasons for wanting a different treatment of clearly ephemeral/scratchpad storage.
It's just that mmap is slower than using read/write, because the kernel knows less about your data access patterns and thus has to guess for how to populate caches etc.
Instead, imagine if I could just state in one line of system-agnostic code "give me a pointer to /home/user/abc" and it does the right thing--assuming there was some way around mmap's current set of caveats. Imagine if I could turn a memory buffer into a file in one line of code and it Just Worked. Imagine if the OS treated my M.2 SSD as just another chip on the bus instead of still having a good amount of code on the hot path that assumes I'm manually sending bytes to a mechanical drive.
The real joke is on me though, some of these GPU servers actually have 2 TB of RAM now. Crazy engineering!
You can’t really optimise your code for faster sequential reads than the IO is capable of, so the only thing really worth focusing on is how to optimise everything that isn’t already sequential.
Edit: Ah, I saw @wmf has edited their comment to include "related"