Zstandard v1.5.4 (opens in new tab)

(github.com)

74 pointsascom3y ago9 comments

9 comments

Me being a high perfomance computing enthusiast, it's always an orgasm for me seeing these perfomance improvements releases(and related histories), they are always a unique opportunity for me to learn new optimizations techniques from the commits/PR

worldsavior3y ago

Orgasm, very direct.

flykespice3y ago

¯\_(ツ)_/¯

londons_explore3y ago

I am disappointed that the encoder/decoder state can't be 'checkpointed'.

Imagine this usecase:. We're sending tiny messages over a tiny link. Every byte matters - if we can cut a 10 byte message down to 7, that will really help. So we compress each message as part of a stream, flushing at message boundaries.

However, sometimes messages are lost on the route, and we might decide not to retransmit them - there might now be more relevant stuff to say over the link.

Zstandard has to way to 'undo' a compression operation on a data stream.

londons_explore3y ago

Another usecase:. Binary trees of compressed data.

Imagine you're compressing Wikipedia and want to get the best ratio possible while also being able to access randomly any article.

If you compress each article individually, words like 'citation needed' will end up replicated in most articles.

Another approach is to use a dictionary. This solves the citation needed usecase. But we can still do better. There will be lots of common content between the 'general relativity' and 'special relativity' pages, and likewise between the 'France' and 'Germany' pages. Ideally we'd have different dictionaries for different topics. But the dictionaries themselves have overlap, so it would be good to compress them.

So we end up with a tree-of dictionaries to decode any article.

However, if you now want to do a full decompression of every article, you ideally don't want to reprocess the dictionary for every decompression. So you want to be able to checkpoint the decompressor state.

rkwasny3y ago

Why not use an external dictionary?

http://fileformats.archiveteam.org/wiki/Zstandard_dictionary

ElectricalUnion3y ago

Such split external dictionaries isn't shared between blobs and will have multiple similar/identical entries in a big enough dataset.

pizza3y ago

Coincidentally, the current winner of the Hutter Prize, Starlit, works by trying to come up with a traveling salesman path between wikipedia articles based on document embeddings + a context adaptive compressor.

To make a restartable codec using a non-restartable compressor just means you have to implement that layer not using the non-restartable compressor. As a side note, if I’m not mistaken, the zstd docs suggest that this is already possible - you just have to used the (advanced) buffer-less streaming api, I believe: http://facebook.github.io/zstd/zstd_manual.html#Chapter19

CodesInChaos3y ago

I think a "flush without updating the state" operation would be enough to handle many use-cases and will probably easier to implement efficiently than full undo or checkpointing support.

j / k navigate · click thread line to collapse

9 comments

flykespice3y ago

worldsavior3y ago

Orgasm, very direct.

flykespice3y ago

¯\_(ツ)_/¯

londons_explore3y ago

I am disappointed that the encoder/decoder state can't be 'checkpointed'.

However, sometimes messages are lost on the route, and we might decide not to retransmit them - there might now be more relevant stuff to say over the link.

Zstandard has to way to 'undo' a compression operation on a data stream.

londons_explore3y ago

Another usecase:. Binary trees of compressed data.

Imagine you're compressing Wikipedia and want to get the best ratio possible while also being able to access randomly any article.

If you compress each article individually, words like 'citation needed' will end up replicated in most articles.

So we end up with a tree-of dictionaries to decode any article.

rkwasny3y ago

Why not use an external dictionary?

http://fileformats.archiveteam.org/wiki/Zstandard_dictionary

ElectricalUnion3y ago

Such split external dictionaries isn't shared between blobs and will have multiple similar/identical entries in a big enough dataset.

pizza3y ago

CodesInChaos3y ago

I think a "flush without updating the state" operation would be enough to handle many use-cases and will probably easier to implement efficiently than full undo or checkpointing support.

j / k navigate · click thread line to collapse