I/O is no longer the bottleneck? (2022) (opens in new tab)

(stoppels.ch)

264 pointsbenhoyt4mo ago130 comments

130 comments

Increasingly the performance limit for modern CPUs is the amount of data you can feed through a single core: basically memcpy() speed. On most x86 cores the limit is around 6 GB/s and about 20 GB/s for Apple M chips.

When you see advertised numbers like '200 GB/s' that is total memory bandwidth, or all cores combined. For individual cores, the limit will still be around 6 GB/s.

This means even if you write a perfect parser, you cannot go faster. This limit also applies to (de)serializing data like JSON and Protobuf, because those formats must typically be fully parsed before a single field can be read.

If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit.

The Lite³ serialization format I am working on aims to exploit exactly this, and is able to outperform simdjson by 120x in some benchmarks as a result: https://github.com/fastserial/lite3

lunixbochs4mo ago

your single core numbers seem way too low for peak throughput on one core, unless you stipulate that all cores are active and contending with each other for bandwidth

e.g. dual channel zen 1 showing 25GB/s on a single core https://stackoverflow.com/a/44948720

I wrote some microbenchmarks for single-threaded memcpy

    zen 2 (8-channel DDR4)
    naive c:
      17GB/s
    non-temporal avx:
      35GB/s

    Xeon-D 1541 (2-channel DDR4, my weakest system, ten years old)
    naive c:
      9GB/s
    non-temporal avx:
      13.5GB/s

    apple silicon tests
    (warm = generate new source buffer, memset(0) output buffer, add memory fence, then run the same copy again)

    m3
    naive c:
      17GB/s cold, 41GB/s warm
    non-temporal neon:
      78GB/s cold+warm

    m3 max 
    naive c:
      25GB/s cold, 65GB/s warm
    non-temporal neon:
      49GB/s cold, 125GB/s warm

    m4 pro
    naive c:
      13.8GB/s cold, 65GB/s warm
    non-temporal neon:
      49GB/s cold, 125GB/s warm

    (I'm not actually sure offhand why asi warm is so much faster than cold - the source buffer is filled with new random data each iteration, I'm using memory fences, and I still see the speedup with 16GB src/dst buffers much larger than cache. x86/linux didn't have any kind of cold/warm test difference. my guess would be that it's something about kernel page accounting and not related to the cpu)

I really don't see how you can claim either a 6GB/s single core limit on x86 or a 20GB/s limit on apple silicon

nine_k4mo ago

As much as I can understand a Zen 5 CPU core can run two AVX512 operations per clock (1024 bits) + 4 integer operations per clock (which use up FPU circuitry in the process), so additional 256 bits. At 4 GHz, this is 640 GB/s.

I suppose that in real life such ideal condition do not occur, but it shows how badly the CPU is limited by its memory bandwidth for streaming tasks. Its maximum memory-read bandwidth is 768 bits per clock. only 60% of its peak bit-crunching performance. DRAM bandwidth is even more limiting. And this is a single core of at least 12 (and at most 64).

kvemkon4mo ago

> CPU is limited by its memory bandwidth for streaming tasks

That must be the reason, why EPYC 9175F exists. It is only 16-core CPU, but all 16 8-core CCDs are populated and only one core on each is active.

The next gen EPYC is rumored to have 16 instead of 12 memory channels (which were 8 only 4-5 years ago).

1 more reply

vjerancrnjak4mo ago

It is interesting that despite this we still have programming languages and libraries that cannot exploit pipelining to actually demonstrate IO is the bottleneck and not CPU

torginus4mo ago

Thanks for the detailed writeup! This made me think of an interesting conundrum - with how much RAM modern computers come with (16GB is considered to be on the small side), having the CPU read the entire contents of RAM takes a nontrivial amount of time.

A single threaded Zen2 program could very well take 1 second to scan through your RAM, during which it's entirely trivial to read stuff from disk, so the modern practice of keeping a ton of stuff in RAM might be actually hurting performance.

Algorithms, such as garbage collection, which scan the entire heap, where the single-threaded version is probably slower than the naive zen 2 memcpy might run for more than a second even on a comparatively modest 16GB heap, which might not be even acceptable.

zozbot2344mo ago

It's true that GC is a huge drain on memory throughput, even though it doesn't have to literally scan the entire heap (only potential references have to be scanned). The Golang folks are running into this issue, it's reached a point where the impact of GC traffic is itself a meaningful bottleneck on performance, and GC-free languages have more potential for most workloads.

jauntywundrkind4mo ago

I agree, I don't think these numbers check out. IIRC people were a bit down on manycore Clearwater Forest in August for core complexes of 4x cores each sharing a 35GB/s link. (And also sharing 4MB of pretty alright 400GB/s L2 cache among them). This is a server chip so expectations are higher, but 6GB/s per core seems very unlikely.

https://chipsandcheese.com/p/intels-clearwater-forest-e-core...

eliasdejong4mo ago

> A key feature of this code is that it skips CPU cache when copying

Are those numbers also measured while skipping the CPU cache?

lunixbochs4mo ago

naive c is just a memcpy. non-temporal uses the streaming instructions.

johncolanduoni4mo ago

What is the nature of the architectural limit here? The bus between an individual core and the caches and/or memory controller?

eliasdejong4mo ago

The limit is the number of outstanding cache line requests to the memory controller. CPUs have a fixed number of slots for this, around 10-12 usually. Intel calls them LFBs (Line Fill Buffers) and AMD MSHRs (Miss Status Holding Registers). When the slots are filled, the CPU can issue no more requests and has to wait for them to complete. Apple M chips (probably) have more slots and the memory is physically packaged together with the CPU, so they get better numbers.

foota4mo ago

I assume these must be really expensive? Otherwise it seems like a great way to improve throughput on low concurrency tasks.

2 more replies

tiffanyh4mo ago

> On most x86 cores the limit is around 6 GB/s and about 20 GB/s for Apple M chips.

What makes M-series have 3x the bandwidth (per core), over x86?

MindSpunk4mo ago

M-series have a substantially wider memory bus allowing much higher throughput. It's not really an x86/M-series thing, rather it's a packaging limitation. Apple integrates the memory into the same package as the CPU, the vast majority of x86 CPUs are socketed with socketed memory.

Apple are able to push a wider bus at higher frequencies because they aren't limited by signal integrity problems you encounter trying to do the same over sockets and motherboard traces. x86 CPUs like the Ryzen AI Max 395+, when packaged without socketed memory, are able to push equally wide busses at high frequencies.

Analemma_4mo ago

The sibling comment has the correct, more detailed answer, but the high-level answer is that the M-series chips are SoCs with all the RAM on-die. That lets you push way more data than you can over a bus out to socketed memory.

The tradeoff is that it's non-upgradeable, but (contra some people who claim this is only a cash-grab by Apple to prevent RAM upgrades) it's worth it for the bandwidth.

fanf24mo ago

It isn’t on-die, the ram is separate chips in the same package. See the pics about halfway down https://www.apple.com/de/newsroom/2023/10/apple-unveils-m3-m...

Scaevolus4mo ago

Apple had soldered DDR ram for a long time that was no faster than any other laptop. It's only with the Apple Silicon M1 that it started being notably higher bandwidth.

1 more reply

eru4mo ago

> The tradeoff is that it's non-upgradeable, but (contra some people who claim this is only a cash-grab by Apple to prevent RAM upgrades) it's worth it for the bandwidth.

That, and if you come at it from the phone / tablet or even laptop angle: most people are quite ok just buying their computing devices pre-assembled and not worrying about upgrading them. You just buy a new one when the old one fails or you want an upgrade.

Similar to how cars these days are harder to repair for the layman, but they also need much less maintenance. The guy who was always tinkering away with his car in old American sitcoms wasn't just a trope, he was truth-in-television. Approximately no one has to do that anymore with modern cars.

151554mo ago

The memory is in-package, not on-die - on-die would mean that the DRAM is being manufactured on the same 1-2-3-4-whatever nanometer process - DDR is much larger.

zozbot2344mo ago

On quite a few recent chips (including, AIUI, Apple M series) you can only saturate memory bandwidth by resorting to the iGPU (which has access to unified memory), CPU cores on their own won't do it. It means that using the iGPU as a blitter for huge in-memory transfers and for all throughput-limited computation (including such things as parallel parsing or de/compression workloads) is now the technically advisable choice, provided that this can be arranged.

> If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit.

Of course the "skipping" is by cachelines. A cacheline is effectively a self-contained block of data from a memory throughput perspective, once you've read any part of it the rest comes for free.

dehrmann4mo ago

> 6 GB/s

Samsung is selling NVMe SSDs claiming 14 GB/s sequential read speed.

eliasdejong4mo ago

> 14 GB/s

Yes, those numbers are real but only in very short bursts of strictly sequential reads, sustained speeds will be closer to 8-10 GB/s. And real workloads will be lower than that, because they contain random access.

Most NVMe drivers on Linux actually DMA the pages directly into host memory over the PCIe link, so it is not actually the CPU that is moving the data. Whenever the CPU is involved in any data movement, the 6 GB/s per core limit still applies.

jauntywundrkind4mo ago

I feel like you are pulling all sorts of nonsense out of nowhere. Your numbers seem all made up. 6GB/s seems outlandishly tiny. Your justifications are not really washing. Zen4 here shows single core as, at absolute worst behavior, dropping to 57GB/s. Basically 10x what you are spinning. You are correct in that memory limits are problematic, but we also have had technology like Intel's Direct Data IO (2011) that lets the CPU talk to peripherals without having to go through main memory at all (big security disclosure on that in 2019, yikes). AMD is making what they call "Smart Data Cache Injection" which similarly makes memory speed not gating. So even if you do divide the 80GB/s memory speed across 16 chips on desktop and look at 5GB/s, that still doesn't have to tell the whole story. https://chipsandcheese.com/p/amds-zen-4-part-2-memory-subsys... https://nick-black.com/dankwiki/index.php/DDIO

As for SSD, for most drives, it's true true that they cannot sustain writes indefinitely. They often write in SLC mode then have to rewrite, re-pack things into denser storage configurations that takes more time to write. They'll do that in the background, given the chance, so it's often not seen. But write write write and the drive won't have the time.

Thats very well known, very visible, and most review sites worth a salt test for it and show that sustained write performance. Some drives are much better than others. Even still, an Phison E28 will let you keep writing at 4GB/s until just before the drive is full full full. https://www.techpowerup.com/review/phison-e28-es/6.html

Drive reads don't have this problem. When review sites benchmark, they are not benchmarking some tiny nanosliver of data. Common benchmark utilities will test sustained performance, and it doesn't suddenly change 10 seconds in or 90 seconds in or whatever.

These claims just don't feel like they're straight to me.

1 more reply

yusyusyus4mo ago

DMAing as opposed to what?

3 more replies

yxhuvud4mo ago

What? Nvme dont care about sequential access. If that slows you down then it is the fault of the operating system and the APIs it provide.

In Linux you can use direct IO or RWF_UNCACHED to avoid paying extra for unwanted readahead.

johncolanduoni4mo ago

Sequential read speed is attainable while still having a (small) number of independent sequential cursors. The underlying SSD translation layer will be mapping to multiple banks/erase blocks anyway, and those are tens of megabytes each at most (even assuming a 'perfect' sequential mapping, which is virtually nonexistent). So you could be reading 5 files sequentially, each only producing blocks at 3GB/s. A not totally implausible access pattern for e.g. a LSM database, or object store.

wmf4mo ago

Any code that's reading/writing to SSD needs to use multiple cores. The SSD is faster than a single CPU core.

vlovich1234mo ago

That doesn’t sound right. A single core should more than fast enough to saturate IOPs (particularly with iouring) unless you’re doing something insane like a lot of small writes. A write of 16mib or 32mob should still be about 1 ssd iop - more CPUs shouldn’t help(and in fact should be slower if you have 2 16mib IOPs vs 1 32mib iop)

1 more reply

151554mo ago

SSDs are not faster than a DMA core.

auselen4mo ago

How do you measure/calculate 6GB/s?

woooooo4mo ago

> If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit.

You still have to load a 64-byte cache line at a time, and most CPUs do some amount of readahead, so you'll need a pretty large "blank" space to see these gains, larger than typical protobufs.

squirrellous4mo ago

Would you mind sharing what problems motivated Lite? Curious what are the typical use cases for selective reading / in place modification of serialized data. My understanding is that for cases that really want all of the fields, the zero-copy solutions aren’t much better than JSON / protobuf, so these are solutions to different problems.

eliasdejong4mo ago

The primary motivations were performance requirements and frustration with schema formats.

The ability to mutate serialized data allows for 2 things:

1) Services exchanging messages and making small changes each time e.g. adding a timestamp without full reserialization overhead.

2) A message producer can keep a message 'template' ready and only change the necessary fields each time before sending. As a result, serializing becomes practically free from a performance perspective.

squirrellous4mo ago

Thanks. These are actually achievable with a custom protocol buffer implementation, but I agree at that point one may as well create a new format.

FWIW the other use case I was expecting is something like database queries where you’d select one particular column out of a json-like blob and want to avoid deserialization.

Nathanba4mo ago

cool, do you think it's possible to add a schema mode to lite3 to remove the message size tradeoff? I think most people will still want to use lite3 with hard schemas during both serialization and deserialization. It's nice that it also works in a schemaless mode though.

eliasdejong4mo ago

Being schemaless is deliberate design decision as it eliminates the need for managing and building schema files. By not requiring schema, messages are always readable to arbitrary consumers.

If you want schema, it must be done by the application through runtime type checking. All messages contain type information. Though I do see the value of adding pydantic-like schema checking in the future.

EDIT: Regarding message size, Lite³ does demand a message size penalty for being schemaless and fully indexed at the same time. Though if you are using it in an RPC / streaming setting, this can be negated through brotli/zstd/dict compression.

Nathanba4mo ago

The thing about schemaless is that it's great for usability and I like it with JSON but as with JSON when we develop applications in reality at the end of the day you always have some kind of schema, whether it's written down or not. Like you alluded with pydantic, the application is going to rely on the data being in some sort of shape, even if it's very defensively written and practically everything is optional, you still end up relying something. That would be the informal schema so in my mind if I have a schema anyway no matter what... then maybe this should be supported in the serialization format/library to give me the benefit of size reductions.

1 more reply

digdugdirk4mo ago

Pydantic was the first thing I thought of when I saw this. The possibilities are very intriguing.

Do you have any thoughts/recommendations for someone if they were to try making a pydantic interface layer for lite3?

1 more reply

eru4mo ago

> By not requiring schema, messages are always readable to arbitrary consumers.

That sounds a bit silly..

> All messages contain type information.

That would (partially) enable what you were suggesting in the sentence I quoted first. But that's orthogonal to being schema-full or schema-less.

tignaj4mo ago

If you are looking for a fast format with schema support see STEF: https://www.stefdata.net/

Disclosure: I am the author.

mgaunard4mo ago

Quite easy to outperform a parsing library when you're not actually doing any parsing work and just memory-mapping pre-parsed data...

That being said storing trees as serializable flat buffers is definitely useful, if only because you can release them very cheaply.

eliasdejong4mo ago

Imagine if you measured the speed of beer delivery by the rate at which beer cans can be packed/unpacked from truck pallets. But then somebody shows up with a tanker truck and starts pumping beer directly in and out. You might argue this is 'unfair' because the tanker is not doing any packing or unpacking. But then you realize it was never about packing speed in the first place. It was about delivering beer.

StilesCrisis4mo ago

This is actually a good analogy; the beer cans are self-contained and ready for the customer with zero extra work. The beer delivered by the tanker still needs to be poured into individual glasses by the bartender, which is slow and tedious.

1 more reply

brunoborges4mo ago

> This limit also applies to (de)serializing data like JSON and Protobuf, because those formats must typically be fully parsed before a single field can be read.

Which file formats allow partial parsing?

mort964mo ago

You just need to encode the size of values in bytes to make it possible to partially parse the format.

Imagine the following object:

    {
        "attributes": [ .. some really really long array of whatever .. ],
        "name": "Bob"
    }

In JSON, if you want to extract the "name" property, you need to parse the whole "attributes" array. However, if you encoded the size of the "attributes" array in bytes, a parser could look at the key "attributes", decide that it's not interested, and jump past it without parsing anything.

You'd typically want some kind of binary format, but for illustration purposes, here's an imaginary XML representation of the JSON data model which achieves this:

    <object content-size-in-bytes="8388737">
      <array key="attributes" content-size-in-bytes="8388608">
        .. some 8 MiB large array of values ..
      </array>
      <string key="name" content-size-in-bytes="3">Bob</string>
    </object>

If this was stored in a file, you could use fseek to seek past the "attributes" array. If it was compressed, or coming across a socket, you'd need more complicated mechanisms to seek past irrelevant parts of the object.

brunoborges4mo ago

Yeah, I was thinking of binary formats as the only solution, but your XML example is perfect. Thank you.

perching_aix4mo ago

Anything that is both streamable and seekable?

1vuio0pswjnm74mo ago

Pardon the ignorance, but is there a reason, or reasons, that netstrings/bencode is not included in the list of formats against which Lite^3 is tested

hamandcheese4mo ago

Lite claims that it can be modified in-place, but I'm curious how that works with variable-length structures like strings?

eliasdejong4mo ago

If the new value is equal size or smaller, it will overwrite the old value in-place. If it is larger, then the new value is appended to the buffer and the index structure is updated to point to the new location.

In the case of append, the old value still lives inside the buffer but is zeroed out. This means that if you keep replacing variable-sized elements, over time the buffer will fragment. You can vacuum a message by recursively writing it from the root to a new buffer. This will clear out the unused space. This operation can be delayed for as long as you like.

If you are only setting fixed-size values like integers or floats, then the buffer never grows as they are always updated in-place.

eru4mo ago

Interesting. Sounds like you are getting copy-on-write-with-sharing for growing sizes and in-place updates when your data shrinks?

quadrature4mo ago

For what it’s worth simdjson now has an on demand api that lets you skip over keys that you don’t need.

mort964mo ago

Which is great, but a JSON parser fundamentally can't avoid looking at every byte. You can't jump to the next key, you have to parse your way to the next key.

rattray4mo ago

Does capn proto have similar properties?

eliasdejong4mo ago

Yes, any zero-copy format in general will have this advantage because reading a value is essentially just a pointer dereference. Most of the message data can be completely ignored, so the CPU never needs to see it. Only the actual data accessed counts towards the limit.

Btw: in my project README I have benchmarks against Cap'N Proto & Google Flatbuffers.

eru4mo ago

Have you benchmarked against Rust's rkyv, too?

1 more reply

pjdesno4mo ago

Since no one else seems to have pointed this out - the OP seems to have misunderstood the output of the 'time' command.

  $ time ./wc-avx2 < bible-100.txt
  82113300
  
  real    0m0.395s
  user    0m0.196s
  sys     0m0.117s

"System" time is the amount of CPU time spent in the kernel on behalf of your process, or at least a fairly good guess at that. (e.g. it can be hard to account for time spent in interrupt handlers) With an old hard drive you would probably still see about 117ms of system time for ext4, disk interrupts, etc. but real time would have been much longer.

    $ time ./optimized < bible-100.txt > /dev/null

    real    0m1.525s
    user    0m1.477s
    sys     0m0.048s

Here we're bottlenecked on CPU time - 1.477s + 0.048s = 1.525s. The CPU is busy for every millisecond of real time, either in user space or in the kernel.

In the optimized case:

  real    0m0.395s
  user    0m0.196s
  sys     0m0.117s

0.196 + 0.117 = 0.313, so we used 313ms of CPU time but the entire command took 395ms, with the CPU idle for 82ms.

In other words: yes, the author managed to beat the speed of the disk subsystem. With two caveats:

1. not by much - similar attention to tweaking of I/O parameters might improve I/O performance quite a bit.

2. the I/O path is CPU-bound. Those 117ms (38% of all CPU cycles) are all spent in the disk I/O and file system kernel code; if both the disk and your user code were infinitely fast, the command would still take 117ms. (but those I/O tweaks might reduce that number)

Note that the slow code numbers are with a warm cache, showing 48ms of system time - in this case only the ext4 code has to run in the kernel, as data is already cached in memory. In the cold cache case it has to run the disk driver code, as well, for a total of 117ms.

stabbles4mo ago

Author here. There is a part 2 to this: https://stoppels.ch/2022/11/30/io-is-no-longer-the-bottlenec...

anonymoushn4mo ago

Hello, a couple years ago I participated in a contest to count word frequencies and generate a sorted histogram. There's a cool post about it featuring a video discussing the tricks used by some participants. https://easyperf.net/blog/2022/05/28/Performance-analysis-an...

Some other participants said that they measured 0 difference in runtime between pshufb+eq and eqx3+orx2, but i think your problem has more classes of whitespace, and for the histogram problem, considerations about how to hash all the words in a chunk of the input dominate considerations about how to obtain the bitmasks of word-start or word-end positions.

stabbles4mo ago

Awesome! The slides with roofline analysis are great! https://docs.google.com/presentation/d/16M90It8nOK-Oiy7j9Kw2...

imtringued4mo ago

If this is on a single core then the "6GB/s" guy is disproven not just in theory but also in practice.

dpc_012344mo ago

It's not about memory/CPU/IO, but latency vs throughput. Most software is slow because it ignores the latency. If you program serially waiting for _whatever_ it is going to be slow. If you scatter your data around memory, or read from disk in small chunks, or make tons of tiny queries to the DB serially your software will be 99.9% waiting idle for something to finish. That's it. If you can organize your data linearly in memory and/or work on batches of it at the time and/or parallelize stuff and/or batch your IO, it is going to be fast.

AmazingTurtle4mo ago

I read tons of comments like "It's not [this], it's [that] instead!" which is also wrong.

The performance bottleneck is whatever resource hits saturation first under the workload you actually run: CPU, memory bandwidth, cache/allocations, disk I/O, network, locks/coordination, or downstream latency.

Measure it, prove it with a profile/trace, change one thing, measure again.

ThreatSystems4mo ago

*Unless your in the cloud, then it's a metric to nickel and dime with throttling!

On a more serious note, the performance of hardware today is mind boggling from what we all encountered way back when. What I struggle to comprehend though is how some software (particularly Windows as an OS, instant messaging applications etc.) feel less performant now than they ever were.

rsanheim4mo ago

The performance of hardware today is even more mind-boggling compared to what most people (SRE managers, devs, CTOs) are willing to pay for when it comes to cloud compute.

even more so when considered in the context of dev 'remote workstations'. I benchmarked perf on AWS instances that was at least 5x slower than an average m1 macbook, and cost hundreds of dollars a dev per month (easily), and the macbook was a sunk cost!

nine_k4mo ago

The answer, I suspect, is is the same as always: waiting for I/O in the GUI thread.

Both Telegram and FB messenger are snappy; I didn't use anything else seriously as of late. (Especially not Teams, nor the late Skype.)

josephg4mo ago

> waiting for I/O in the GUI thread

The problem is sloppy programming. We knew how to make small, fast, programs 20+ years ago that would just scream on modern hardware. But now everything is bloated and slow. CPUs can retire billions of instructions per second. Discord takes 10+ seconds to open. I’m simply not creative enough to think up how to keep the cpu busy that long opening IRC.

eru4mo ago

Moore's law really help you with throughput, but latency still requires good engineering.

And you are right, that we got good UI latency even back in the 1980s. You just have make sure that you do the absolute minimum amount of work possible in the UI 'thread' and do gradual enhancement as more time passes.

As an example, the Geos word processor on the C64 does nice line breaks at the end of words only. But if you type really fast, it just wraps lines when you hit exactly x letters, and later when it has some time to catch up, it cleans up the line breaks.

That way it can give you a snappy user experience even on a comically underpowered system. But you can also see that the logic is much more complicated, than just implementing a single business logic for where line breaks should be. Complication means more bugs, more time spent writing and debugging and documenting etc.

fragmede4mo ago

They could be way faster. They're snappy enough but still, so slow.

_zoltan_4mo ago

FB messenger was so good, but they've killed it on both Windows and Mac and I'm sad about it :(

they are forcing me to use the web client...

coryrc4mo ago

CRTs get data to the screen faster. Some LCDs have 500ms delays.

zipy1244mo ago

What non-ancient LCD's have response times that high. Even e-ink/e-paper displays are better than that!

coryrc4mo ago

TVs can do a bunch of filtering which adds long latency based on the setting about the type of content (sorry, can't remember the exact term ATM).

1 more reply

gary_04mo ago

Not a new idea, but it's intriguing to think about an architecture that's just: CPU <-> caches <-> nonvolatile storage

What if you could take it for granted that mmap()ing a file has the exact same performance characteristics as malloc(), aside from the data not going away when you free the address space? What if arbitrary program memory could be given a filename and casually handed off to the OS to make persistent? A lot of basic software design assumptions are still based on the constraints of the spinning rust era...

zozbot2344mo ago

> A lot of basic software design assumptions are still based on the constraints of the spinning rust era...

fsync() is still slow, and you need that for real persistence. It's not just about spinning rust, there's very good reasons for wanting a different treatment of clearly ephemeral/scratchpad storage.

squirrellous4mo ago

I don’t have much evidence but feel like fsync has overshot its significance. It’s really important in databases and that’s fine, but a lot of software can handle (very occasional) kernel / hardware crashes by e.g. restoring data from an external backup, and can benefit from the mmap guarantee that process level crashes won’t lose any data.

imtringued4mo ago

Isn't the problem with fsync() that you can't wait for specific IO operations? You're waiting for every single operation on the file descriptor. If anything happens between your write and the call to fsync(), you're going to get delayed by the additional operation that managed to get through.

zozbot2344mo ago

Modern storage hardware has multiple IO queues with operations in flight at any given time, so I'm not sure to what extent this applies. It may be that only operations that are directly relevant to any given range of data need to be waited for.

p_ing4mo ago

Don't modern NVMe controllers lie to the OS when invoking fsync()?

eru4mo ago

You can get something like this from Linux today. (And mmap is actually how you request memory from the kernel in almost all cases.)

It's just that mmap is slower than using read/write, because the kernel knows less about your data access patterns and thus has to guess for how to populate caches etc.

gary_04mo ago

Yes, I know mmap already sort of allows this (and has for well over a decade). To elaborate: when I want to, say, parse a megabytes-sized file, I don't muck about with mmap(), I just read() into a buffer; it's simple and it's fast enough even though I'm just wasting microseconds waiting for bytes on one fast chip to get copied into another slightly faster chip (and then copied into CPU cache). If I'm dealing with a larger amount of data, I'd be tempted to use database middleware to figure out all the platform-specific shuffling between RAM and disk (designed on the assumption that the disk is spinning rust, cough), but that pulls in yet another chunk of complexity.

Instead, imagine if I could just state in one line of system-agnostic code "give me a pointer to /home/user/abc" and it does the right thing--assuming there was some way around mmap's current set of caveats. Imagine if I could turn a memory buffer into a file in one line of code and it Just Worked. Imagine if the OS treated my M.2 SSD as just another chip on the bus instead of still having a good amount of code on the hot path that assumes I'm manually sending bytes to a mechanical drive.

tmerr4mo ago

This has me thinking, it could be a fun project to prototype a convenient file interface based on pointers as a C library. I imagine it's possible to get something close to what you want in terms of interface (not sure about performance).

I suspect in some cases it will be more convenient and in other cases it will be less convenient to use. The write interface isn't so bad for some use cases like appending to logs. It's also not bad for its ability to treat different types of objects (files, pipes, network) generically.

But if you want to do all manipulation through pointers it should be doable. To support appending to files you'd probably need some way to explicitly grow the file size and hand back a new pointer. Some relevant syscalls would probably be open, ftruncate, and mmap.

1 more reply

kevmo3144mo ago

This was my instinct when NVMe SSDs first came out: I'd joke that now we have 2 TB of RAM.

The real joke is on me though, some of these GPU servers actually have 2 TB of RAM now. Crazy engineering!

npn4mo ago

Now? I had found some used epyc servers with 2TB ddr4 ram for around 5k usd yesteryear. Too bad I didn't purchase it.

sroussey4mo ago

He said GPU servers

esjeon4mo ago

The point is that we did have CPU servers with TBs of RAM. These machines are still pretty much relevant.

npn4mo ago

He didn't say VRAM. GPU servers are just servers with GPUs.

1 more reply

akoboldfrying4mo ago

The author recently posted an addendum that describes a clever trick to go even faster, which I highly recommend reading: https://news.ycombinator.com/item?id=46503612

pvorb4mo ago

But I/O being the bottleneck never was about sequential reads, was it? I get the point of the article, though.

geoctl4mo ago

With modern CXL/PCIe, I guess it's not going to be that stupid to claim that RAM/memory controller is slowly becoming I/O on its own.

throwaway942754mo ago

Old IBM's term for RAM was "storage."

geoctl4mo ago

I wonder whether the current huge funding in AI will ever lead to a revolution in computer architecture. Modern PCIe/CXL is already starting to blur the difference between memory and I/O. Maybe the future is going to be that CPUs, RAM, storage devices, GPUs and other devices are going to directly address one another like a mesh network. Maybe the entire virtual memory model will change to include everything to be addressed via a unified virtual memory space from a process/CPU perspective with simple read/write syscalls that translate into network packets flowing between the CPU and the destination (e.g. RAM/GPU/NVMe) and vice versa.

1 more reply

hmottestad4mo ago

When I took my first database course one topic was IO performance as measured in seek time on regular old hard drives.

You can’t really optimise your code for faster sequential reads than the IO is capable of, so the only thing really worth focusing on is how to optimise everything that isn’t already sequential.

leentee4mo ago

From my experience optimizing an OLAP database with high concurrency; lots of time the bottleneck is memory speed.

verdverm4mo ago

still my bottleneck generally speaking, cloudvm/container filesys i/o sucks

obogobo4mo ago

What metrics does saturating memory bandwidth manifest as? ...iowait? 100% system CPU? How does one isolate memory as the bottleneck specifically?

zozbot2344mo ago

In process monitoring you just see 100% "cpu" use with the processor cores running in their low-medium frequency range and no real thermal issues (fans aren't spinning up). You can use perf indicators to specifically look at whether memory bandwidth is the issue.

grayxu4mo ago

The memory wall is an eternal problem when performing computations on the CPU

wmf4mo ago

Previous related discussion: https://news.ycombinator.com/item?id=33751266

tomhow4mo ago

I/O is no longer the bottleneck - https://news.ycombinator.com/item?id=33751266 - Nov 2022 (326 comments)

gnabgib4mo ago

That's not actually this post (that post is by this OP.. confusingly) Harmen/@stabbles just chose the same title. What's here is a retort to that post from a week later (still from back in 2022).

Edit: Ah, I saw @wmf has edited their comment to include "related"

tomhow4mo ago

Hopefully everyone can get through the process of being slightly confused then ultimately contented.

wmf4mo ago

I didn't edit my comment but yeah, there are two different blog posts by different authors here; one is a response to the first one.

j / k navigate · click thread line to collapse

130 comments

eliasdejong4mo ago

When you see advertised numbers like '200 GB/s' that is total memory bandwidth, or all cores combined. For individual cores, the limit will still be around 6 GB/s.

If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit.

The Lite³ serialization format I am working on aims to exploit exactly this, and is able to outperform simdjson by 120x in some benchmarks as a result: https://github.com/fastserial/lite3

lunixbochs4mo ago

your single core numbers seem way too low for peak throughput on one core, unless you stipulate that all cores are active and contending with each other for bandwidth

e.g. dual channel zen 1 showing 25GB/s on a single core https://stackoverflow.com/a/44948720

I wrote some microbenchmarks for single-threaded memcpy

    zen 2 (8-channel DDR4)
    naive c:
      17GB/s
    non-temporal avx:
      35GB/s

    Xeon-D 1541 (2-channel DDR4, my weakest system, ten years old)
    naive c:
      9GB/s
    non-temporal avx:
      13.5GB/s

    apple silicon tests
    (warm = generate new source buffer, memset(0) output buffer, add memory fence, then run the same copy again)

    m3
    naive c:
      17GB/s cold, 41GB/s warm
    non-temporal neon:
      78GB/s cold+warm

    m3 max 
    naive c:
      25GB/s cold, 65GB/s warm
    non-temporal neon:
      49GB/s cold, 125GB/s warm

    m4 pro
    naive c:
      13.8GB/s cold, 65GB/s warm
    non-temporal neon:
      49GB/s cold, 125GB/s warm

    (I'm not actually sure offhand why asi warm is so much faster than cold - the source buffer is filled with new random data each iteration, I'm using memory fences, and I still see the speedup with 16GB src/dst buffers much larger than cache. x86/linux didn't have any kind of cold/warm test difference. my guess would be that it's something about kernel page accounting and not related to the cpu)

I really don't see how you can claim either a 6GB/s single core limit on x86 or a 20GB/s limit on apple silicon

nine_k4mo ago

kvemkon4mo ago

> CPU is limited by its memory bandwidth for streaming tasks

That must be the reason, why EPYC 9175F exists. It is only 16-core CPU, but all 16 8-core CCDs are populated and only one core on each is active.

The next gen EPYC is rumored to have 16 instead of 12 memory channels (which were 8 only 4-5 years ago).

1 more reply

vjerancrnjak4mo ago

It is interesting that despite this we still have programming languages and libraries that cannot exploit pipelining to actually demonstrate IO is the bottleneck and not CPU

torginus4mo ago

zozbot2344mo ago

jauntywundrkind4mo ago

https://chipsandcheese.com/p/intels-clearwater-forest-e-core...

eliasdejong4mo ago

> A key feature of this code is that it skips CPU cache when copying

Are those numbers also measured while skipping the CPU cache?

lunixbochs4mo ago

naive c is just a memcpy. non-temporal uses the streaming instructions.

johncolanduoni4mo ago

What is the nature of the architectural limit here? The bus between an individual core and the caches and/or memory controller?

eliasdejong4mo ago

foota4mo ago

I assume these must be really expensive? Otherwise it seems like a great way to improve throughput on low concurrency tasks.

2 more replies

tiffanyh4mo ago

> On most x86 cores the limit is around 6 GB/s and about 20 GB/s for Apple M chips.

What makes M-series have 3x the bandwidth (per core), over x86?

MindSpunk4mo ago

Analemma_4mo ago

The tradeoff is that it's non-upgradeable, but (contra some people who claim this is only a cash-grab by Apple to prevent RAM upgrades) it's worth it for the bandwidth.

fanf24mo ago

It isn’t on-die, the ram is separate chips in the same package. See the pics about halfway down https://www.apple.com/de/newsroom/2023/10/apple-unveils-m3-m...

Scaevolus4mo ago

Apple had soldered DDR ram for a long time that was no faster than any other laptop. It's only with the Apple Silicon M1 that it started being notably higher bandwidth.

1 more reply

eru4mo ago

> The tradeoff is that it's non-upgradeable, but (contra some people who claim this is only a cash-grab by Apple to prevent RAM upgrades) it's worth it for the bandwidth.

151554mo ago

The memory is in-package, not on-die - on-die would mean that the DRAM is being manufactured on the same 1-2-3-4-whatever nanometer process - DDR is much larger.

zozbot2344mo ago

> If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit.

Of course the "skipping" is by cachelines. A cacheline is effectively a self-contained block of data from a memory throughput perspective, once you've read any part of it the rest comes for free.

dehrmann4mo ago

> 6 GB/s

Samsung is selling NVMe SSDs claiming 14 GB/s sequential read speed.

eliasdejong4mo ago

> 14 GB/s

jauntywundrkind4mo ago

These claims just don't feel like they're straight to me.

1 more reply

yusyusyus4mo ago

DMAing as opposed to what?

3 more replies

yxhuvud4mo ago

What? Nvme dont care about sequential access. If that slows you down then it is the fault of the operating system and the APIs it provide.

In Linux you can use direct IO or RWF_UNCACHED to avoid paying extra for unwanted readahead.

johncolanduoni4mo ago

wmf4mo ago

Any code that's reading/writing to SSD needs to use multiple cores. The SSD is faster than a single CPU core.

vlovich1234mo ago

1 more reply

151554mo ago

SSDs are not faster than a DMA core.

auselen4mo ago

How do you measure/calculate 6GB/s?

woooooo4mo ago

> If however you use a zero-copy format, the CPU can skip data that it doesn't care about, so you can 'exceed' the 6 GB/s limit.

You still have to load a 64-byte cache line at a time, and most CPUs do some amount of readahead, so you'll need a pretty large "blank" space to see these gains, larger than typical protobufs.

squirrellous4mo ago

eliasdejong4mo ago

The primary motivations were performance requirements and frustration with schema formats.

The ability to mutate serialized data allows for 2 things:

1) Services exchanging messages and making small changes each time e.g. adding a timestamp without full reserialization overhead.

squirrellous4mo ago

Thanks. These are actually achievable with a custom protocol buffer implementation, but I agree at that point one may as well create a new format.

FWIW the other use case I was expecting is something like database queries where you’d select one particular column out of a json-like blob and want to avoid deserialization.

Nathanba4mo ago

eliasdejong4mo ago

Being schemaless is deliberate design decision as it eliminates the need for managing and building schema files. By not requiring schema, messages are always readable to arbitrary consumers.

Nathanba4mo ago

1 more reply

digdugdirk4mo ago

Pydantic was the first thing I thought of when I saw this. The possibilities are very intriguing.

Do you have any thoughts/recommendations for someone if they were to try making a pydantic interface layer for lite3?

1 more reply

eru4mo ago

> By not requiring schema, messages are always readable to arbitrary consumers.

That sounds a bit silly..

> All messages contain type information.

That would (partially) enable what you were suggesting in the sentence I quoted first. But that's orthogonal to being schema-full or schema-less.

tignaj4mo ago

If you are looking for a fast format with schema support see STEF: https://www.stefdata.net/

Disclosure: I am the author.

mgaunard4mo ago

Quite easy to outperform a parsing library when you're not actually doing any parsing work and just memory-mapping pre-parsed data...

That being said storing trees as serializable flat buffers is definitely useful, if only because you can release them very cheaply.

eliasdejong4mo ago

StilesCrisis4mo ago

1 more reply

brunoborges4mo ago

> This limit also applies to (de)serializing data like JSON and Protobuf, because those formats must typically be fully parsed before a single field can be read.

Which file formats allow partial parsing?

mort964mo ago

You just need to encode the size of values in bytes to make it possible to partially parse the format.

Imagine the following object:

    {
        "attributes": [ .. some really really long array of whatever .. ],
        "name": "Bob"
    }

You'd typically want some kind of binary format, but for illustration purposes, here's an imaginary XML representation of the JSON data model which achieves this:

    <object content-size-in-bytes="8388737">
      <array key="attributes" content-size-in-bytes="8388608">
        .. some 8 MiB large array of values ..
      </array>
      <string key="name" content-size-in-bytes="3">Bob</string>
    </object>

brunoborges4mo ago

Yeah, I was thinking of binary formats as the only solution, but your XML example is perfect. Thank you.

perching_aix4mo ago

Anything that is both streamable and seekable?

1vuio0pswjnm74mo ago

Pardon the ignorance, but is there a reason, or reasons, that netstrings/bencode is not included in the list of formats against which Lite^3 is tested

hamandcheese4mo ago

Lite claims that it can be modified in-place, but I'm curious how that works with variable-length structures like strings?

eliasdejong4mo ago

If you are only setting fixed-size values like integers or floats, then the buffer never grows as they are always updated in-place.

eru4mo ago

Interesting. Sounds like you are getting copy-on-write-with-sharing for growing sizes and in-place updates when your data shrinks?

quadrature4mo ago

For what it’s worth simdjson now has an on demand api that lets you skip over keys that you don’t need.

mort964mo ago

Which is great, but a JSON parser fundamentally can't avoid looking at every byte. You can't jump to the next key, you have to parse your way to the next key.

rattray4mo ago

Does capn proto have similar properties?

eliasdejong4mo ago

Btw: in my project README I have benchmarks against Cap'N Proto & Google Flatbuffers.

eru4mo ago

Have you benchmarked against Rust's rkyv, too?

1 more reply

pjdesno4mo ago

Since no one else seems to have pointed this out - the OP seems to have misunderstood the output of the 'time' command.

  $ time ./wc-avx2 < bible-100.txt
  82113300
  
  real    0m0.395s
  user    0m0.196s
  sys     0m0.117s

    $ time ./optimized < bible-100.txt > /dev/null

    real    0m1.525s
    user    0m1.477s
    sys     0m0.048s

Here we're bottlenecked on CPU time - 1.477s + 0.048s = 1.525s. The CPU is busy for every millisecond of real time, either in user space or in the kernel.

In the optimized case:

  real    0m0.395s
  user    0m0.196s
  sys     0m0.117s

0.196 + 0.117 = 0.313, so we used 313ms of CPU time but the entire command took 395ms, with the CPU idle for 82ms.

In other words: yes, the author managed to beat the speed of the disk subsystem. With two caveats:

1. not by much - similar attention to tweaking of I/O parameters might improve I/O performance quite a bit.

stabbles4mo ago

Author here. There is a part 2 to this: https://stoppels.ch/2022/11/30/io-is-no-longer-the-bottlenec...

anonymoushn4mo ago

stabbles4mo ago

Awesome! The slides with roofline analysis are great! https://docs.google.com/presentation/d/16M90It8nOK-Oiy7j9Kw2...

imtringued4mo ago

If this is on a single core then the "6GB/s" guy is disproven not just in theory but also in practice.

dpc_012344mo ago

AmazingTurtle4mo ago

I read tons of comments like "It's not [this], it's [that] instead!" which is also wrong.

Measure it, prove it with a profile/trace, change one thing, measure again.

ThreatSystems4mo ago

*Unless your in the cloud, then it's a metric to nickel and dime with throttling!

rsanheim4mo ago

The performance of hardware today is even more mind-boggling compared to what most people (SRE managers, devs, CTOs) are willing to pay for when it comes to cloud compute.

nine_k4mo ago

The answer, I suspect, is is the same as always: waiting for I/O in the GUI thread.

Both Telegram and FB messenger are snappy; I didn't use anything else seriously as of late. (Especially not Teams, nor the late Skype.)

josephg4mo ago

> waiting for I/O in the GUI thread

eru4mo ago

Moore's law really help you with throughput, but latency still requires good engineering.

fragmede4mo ago

They could be way faster. They're snappy enough but still, so slow.

_zoltan_4mo ago

FB messenger was so good, but they've killed it on both Windows and Mac and I'm sad about it :(

they are forcing me to use the web client...

coryrc4mo ago

CRTs get data to the screen faster. Some LCDs have 500ms delays.

zipy1244mo ago

What non-ancient LCD's have response times that high. Even e-ink/e-paper displays are better than that!

coryrc4mo ago

TVs can do a bunch of filtering which adds long latency based on the setting about the type of content (sorry, can't remember the exact term ATM).

1 more reply

gary_04mo ago

Not a new idea, but it's intriguing to think about an architecture that's just: CPU <-> caches <-> nonvolatile storage

zozbot2344mo ago

> A lot of basic software design assumptions are still based on the constraints of the spinning rust era...

fsync() is still slow, and you need that for real persistence. It's not just about spinning rust, there's very good reasons for wanting a different treatment of clearly ephemeral/scratchpad storage.

squirrellous4mo ago

imtringued4mo ago

zozbot2344mo ago

p_ing4mo ago

Don't modern NVMe controllers lie to the OS when invoking fsync()?

eru4mo ago

You can get something like this from Linux today. (And mmap is actually how you request memory from the kernel in almost all cases.)

It's just that mmap is slower than using read/write, because the kernel knows less about your data access patterns and thus has to guess for how to populate caches etc.

gary_04mo ago

tmerr4mo ago

1 more reply

kevmo3144mo ago

This was my instinct when NVMe SSDs first came out: I'd joke that now we have 2 TB of RAM.

The real joke is on me though, some of these GPU servers actually have 2 TB of RAM now. Crazy engineering!

npn4mo ago

Now? I had found some used epyc servers with 2TB ddr4 ram for around 5k usd yesteryear. Too bad I didn't purchase it.

sroussey4mo ago

He said GPU servers

esjeon4mo ago

The point is that we did have CPU servers with TBs of RAM. These machines are still pretty much relevant.

npn4mo ago

He didn't say VRAM. GPU servers are just servers with GPUs.

1 more reply

akoboldfrying4mo ago

The author recently posted an addendum that describes a clever trick to go even faster, which I highly recommend reading: https://news.ycombinator.com/item?id=46503612

pvorb4mo ago

But I/O being the bottleneck never was about sequential reads, was it? I get the point of the article, though.

geoctl4mo ago

With modern CXL/PCIe, I guess it's not going to be that stupid to claim that RAM/memory controller is slowly becoming I/O on its own.

throwaway942754mo ago

Old IBM's term for RAM was "storage."

geoctl4mo ago

1 more reply

hmottestad4mo ago

When I took my first database course one topic was IO performance as measured in seek time on regular old hard drives.

You can’t really optimise your code for faster sequential reads than the IO is capable of, so the only thing really worth focusing on is how to optimise everything that isn’t already sequential.

leentee4mo ago

From my experience optimizing an OLAP database with high concurrency; lots of time the bottleneck is memory speed.

verdverm4mo ago

still my bottleneck generally speaking, cloudvm/container filesys i/o sucks

obogobo4mo ago

What metrics does saturating memory bandwidth manifest as? ...iowait? 100% system CPU? How does one isolate memory as the bottleneck specifically?

zozbot2344mo ago

grayxu4mo ago

The memory wall is an eternal problem when performing computations on the CPU

wmf4mo ago

Previous related discussion: https://news.ycombinator.com/item?id=33751266

tomhow4mo ago

I/O is no longer the bottleneck - https://news.ycombinator.com/item?id=33751266 - Nov 2022 (326 comments)

gnabgib4mo ago

That's not actually this post (that post is by this OP.. confusingly) Harmen/@stabbles just chose the same title. What's here is a retort to that post from a week later (still from back in 2022).

Edit: Ah, I saw @wmf has edited their comment to include "related"

tomhow4mo ago

Hopefully everyone can get through the process of being slightly confused then ultimately contented.

wmf4mo ago

I didn't edit my comment but yeah, there are two different blog posts by different authors here; one is a response to the first one.

j / k navigate · click thread line to collapse