undefined | Better HN

0 pointsmananaysiempre1y ago0 comments

For what it’s worth, the benchmark on the Zstandard homepage[1] shows none of the setups tested breaking 1GB/s on compression, and only the fastest and sloppiest ones breaking 1GB/s on decompression. If you can live with its API limitations, libdeflate is known[2] to squeeze past 1GB/s decompressing normal Deflate compression levels. In any case, asking for multiple GB/s is probably unfair.

Still, looking at those benchmarks, 10MB/s sounds like the absolute minimum reasonable speed, and they’re reporting nearly three orders of magnitude below that. A modern compressor does not run at mediocre dialup speeds; something in there is absolutely murdering the performance.

And I’m willing to believe it’s just the constant-time overhead. The article mentions “a few hundred bytes” per message payload in a stream of messages, and the actual data of the benchmarks implies 1.6KB uncompressed. Even though they don’t reinitialize the compressor on each message, that is still a very very modest amount of data.

So it might be that general-purpose compressors are simply a bad tool here from a performance standpoint. I’m not aware of a good tool for this kind of application, though.

[1] https://facebook.github.io/zstd/#benchmarks

[2] https://github.com/zlib-ng/zlib-ng/issues/1486

0 comments

jhgg1y ago

One thing to note is that on a given gateway server there are potentially 100k other compression contexts active, and given each connection is transmitting a trickle of small data in an unpredictable way, from different CPU cores as the processes are scheduled by the erlang VM, chances are the CPU caches are absolutely being thrashed. I imagine this contributes to some level of fixed overhead here too, especially when you're measuring these timings on a machine serving actual production traffic as opposed to simply running a bunch of small payloads through a single compressor.

mananaysiempreOP1y ago

It’s possible, I guess, but it wouldn’t be my first thought. It’s too slow for that.

A payload of 1.6KB at 45us/B is 75ms, which is below the typical scheduling quantum of about 100ms. (Can’t say anything about Erlang, let alone Erlang bindings to C libraries, but I wouldn’t expect it to be that much smaller either, precisely because of the switching overhead both direct and indirect.) So a single compression operation shouldn’t be getting preempted enough to affect the results.

Typical RAM bandwidth is tens of GB/s (even consumer-class SSDs[1] are single-digit GB/s) so with tens to hundreds of cores that’s not enough to affect anything, and even taking into account the compressor’s window is measured in megabytes not kilobytes that’s likely not enough (it would be a bad compressor that reread its whole window each time, anyway). And the data we’re compressing is not only minuscule, it has just been generated and is virtually guaranteed to be cached.

Honestly, I almost want to say that the benchmark is measuring the wrong thing somehow, except they’re reporting a 2× speedup switching from one compressor to another. So it can’t be the JSON encoding overhead or whatnot, and, unless one of the Erlang bindings is somehow drastically stupider than the other, it shouldn’t be the FFI overhead, and even those are a huge stretch. The Flying Spaghetti Monster be merciful, I cannot see anything here that we could be spending over a hundred million cycles on.

At this point I’m hoping somebody just mixed up the units, because this is really unsettling.

[1] https://lemire.me/en/talk/perfsummit2020/

j / k navigate · click thread line to collapse