Zstandard RFC 8878 (opens in new tab)

(datatracker.ietf.org)

102 pointsitroot4y ago43 comments

43 comments

felixhandte4y ago

As someone on the Zstd team, I'm always happy to see it on HN! I'm curious though what motivates the submission?

thewakalix4y ago

Probably its use in elfshaker[0].

[0] https://news.ycombinator.com/item?id=29277779

kzrdude4y ago

Zstd is always interesting.

For many applications (file formats), ubiquity is important, so it would be fun if zstd becomes ubiquitous and can be relied on to be available. Let's say for example in future versions of HDF (HDF5 or later).

thriftwy4y ago

Zstandard has very cool dictionary training feature, which allows to keep a separate dictionary and have a 50% ratio compression on very small (~100b) but repetitive data such as database records.

Taywee4y ago

I've always thought it could be pretty cool to leverage that for transparent filesystem compression.

For context, filesystem compression usually compresses blocks of data individually (for instance, every 64K block of a file will be individually compressed, and when you modify a file in the middle, that block needs to be recompressed entirely). This is usually good enough, and it has some pretty cool properties, like being able to have only compressable parts of a file compressed, or turning on compression on a file and having only new and rewritten blocks get compressed. Because of Zstd's separated dictionary, it seems like it could be feasible to instead store the dictionary in the file's inode and compress the blocks with that dictionary (recomputing the dictionary and recompressing existing blocks when the file allocates 10 4K blocks and then again at 100 blocks, perhaps).

I wonder what different properties such a compression scheme would have. I imagine it would be able to achieve a much smaller size due to not having to store a dictionary with each compressed block. A downside would be that a corrupted or overwritten inode would render the file completely unrecoverable, where current compression schemes allow blocks to be individually decompressed. Another downside is that files can't be partially compressible, only entirely.

thriftwy4y ago

Compression dictionaries don't usually work that way, as far as my understanding goes they just look behind, like "take 15 bytes which were 63 bytes ago", so any data it also a dictionary.

Trained dictionary is just meaningless but very frequent bits of data which may be referenced in that fashion as if they preceded the real data.

vlovich1234y ago

Is there a reason zstd isn’t popular for HTTP and only brotli and gzip see adoption?

zinekeller4y ago

Because Facebook doesn't have a browser.

(But seriously, Mozilla engineers have warned the Chrome team that they are too rush with the inclusion of Brotli, since that compression wars are heating up. They still proceeded though, which is unsurprising.)

lifthrasiir4y ago

While that might well be one reason, it should be also noted that Zstandard optimizes for the decompression speed with a reasonable compression ratio while Brotli concentrates on the compression ratio at the slight expense of speed (though it is very hard to do a fair comparison). This is evident from their defaults, where zstd uses a fairly low level (3 out of -7..22) and Brotli uses the maximum level (11 out of 1..11). But both have the decompression speeds far exceeding 100 MB/s which is the practical limit for most Internet users, so zstd's higher decompression speed wouldn't matter much in the web context.

lifthrasiir4y ago

Brotli was arguably designed specifically for the web, because it was originally used in the WOFF2 font format and also had a large amount of preset dictionary collected from the web (including HTML, CSS and JS fragments). Zstandard had no such consideration, and while it could be as efficient as Brotli with a correct dictionary it does have a less merit compared to Brotli in the web context.

duskwuff4y ago

Brotli has some pretty wild optimizations for web content, including a gigantic (~120 KB) predefined dictionary packed full of sequences commonly used in HTML/JS/CSS content. This gives it a huge advantage on small text files.

jhgb4y ago

I assume it's because it's very new? That would seem like an obvious explanation.

wolf550e4y ago

zstd is from 2015.

jhgb4y ago

I was talking about the RFC. You can't just shove any random compression into a browser even if it had existed for years, or can you?

1 more reply

loeg4y ago

For comparison, brotli is from 2013.

jeffbee4y ago

Does this mean the Zstd magic number is now cast in stone?

cornstalks4y ago

It's an Informational RFC, not a Standards Track RFC (https://en.wikipedia.org/wiki/Request_for_Comments#Status). That said, I think the magic number is pretty firmly established.

lifthrasiir4y ago

You may have mistaken Brotli (whose file format has no magic number and prevents an easy identification) with Zstandard (whose file format does have defined magic numbers 28 B5 2F FD or [50-5F] 2A 4D 18).

jeffbee4y ago

No I'm not confused, it's just that the Zstd magic number has had 8 different values over the years, so I'm just wondering if we're past that yet.

lifthrasiir4y ago

Ah sure. The wire format has been fixed since 0.8.0 (2016-08), so you must have seen a very early phase of development (which took one full year).

stouset4y ago

Can you shed some light on why this might be something of concern?

wmf4y ago

The file format was finalized years ago, so yes.

kzrdude4y ago

by the way, zlib-ng also seems interesting. In the sense that it's cleaning up and improving a very aged library https://github.com/Dead2/zlib-ng

ggm4y ago

It's said to be a good fit for ZFS. I tend to lz4 because its baked into the older systems I use, but it may be at a point where my default should be zstd.

bz2/gz still predominates for compressed objects in filestore from what I can see.

xoa4y ago

>It's said to be a good fit for ZFS. I tend to lz4 because its baked into the older systems I use, but it may be at a point where my default should be zstd.

It is, and you should definitely at least give it a look. I posted a comment mentioning it the other day in the OpenZFS 2.0 thread [0], and it also came up recently on HN in a thread linked there, but there are some interesting performance graphs comparing different standards in the github PR for zstd in ZFS [1]. LZ4 still has its place IMO, ZFS is not run nor good for exclusively heavier metal, people use it to good effect on the likes of things like RPis as well. Sometimes CPU cycles is still the limiter or every last one is needed elsewhere. I also think it matters a lot less on spinning rust, where $/TB tends to be so much lower. How much one gets out of it also is influenced by application, general NAS with larger record size is going to see different gains vs a database. But with even vaguely modern desktop CPUs (and their surfeit of cores) and SSDs, particularly in network storage dedicated devices, an extra 10-30% even is worth a lot and there's usually plenty of CPU to throw at it. Even more so if primary usage is limited to only a 10-50 Gbps connection.

As always though probably best if you can benchmark it with your own stuff and play around a bit pulling different levers. ZFS is nice that way too since it's so easy to create a bunch of different test FS at the same time.

----

0: https://news.ycombinator.com/item?id=29268907

1: https://github.com/openzfs/zfs/pull/9735#issuecomment-570082...

ggm4y ago

rpi4 8GB with the radxa USB-bridged 4 port card. 4x 2TB HDD. I couldn't afford SSD. Its Ubuntu ZFS. Not over-performant, but does the job. I run a single disk ZFS detachable for periodic snapshots. Really? I like the belt-and-braces aspect of this one. 4 disks should be reasonably safe for single disk failure and I have a snap monthly, albiet all on one platter. Not enterprise-grade safe, but good-enough.

There's enough grunt left over to be my plex headend as well as long as I avoid transcoding.

I actually wanted to run FreeBSD on it, but the rpi4 wasn't fully ported when I started. In the back of my mind, it should be "safe" to convert over because OpenZFS. Which is kind-of the point.

I avoided de-dup. It burns your CPU. But I think compression is worth it, even with a lot of mp4 and mp3 and jpg content (photo and live-TV PVR archives, aside from music)

There's a lot of FUD about ZFS on small devices, how much memory you "need" -I think something said of Solaris got conflated up into a ZFS "law" about minimum memory for the ARC which just isn't really true: it may not be performant, but it works fine on smaller memory (than 8gb) systems. I chose the 8GB pi4 because I could afford it. I would have been fine on 4GB.

vermaden4y ago

I have run 2 x 2 TB ZFS mirror on 512 MB machine with FreeBSD for years along with compression enabled and also other services running like Samba/NFS/Syncthing/Nextcloud/... and it worked like a charm. The only time when this machine rebooted when I needed to apply security updates for the kernel or when there was power outage as it did not had UPS backup power.

The RAM requirement for ZFS is one of the biggest myths of ZFS.

If you have lots of RAM then great - ZFS can use it as cache with ARC.

Otherwise its just as fast as disks on which its running - like every other filesystem.

The other 'big' myth about ZFS is ECC requirement - which is of course not true.

ECC RAM is useful for ALL FILESYSTEMS - not just only for ZFS.

But from all filesystems ZFS works best of all when you do not have ECC RAM.

Regards.

buryat4y ago

i forgive facebook all their abuses just because they gave us zstd

oofbey4y ago

I think you should read more about Facebook. Try e.g. the Damien Collins email dump, and read about how their android app tricked people into letting it record all phone call and text message records, knowing full well users would hate it if they found out.

Clearly they produce good technology. But the company is morally bankrupt.

Y_Y4y ago

The PDF of of the information I assume you're referring to is: https://www.parliament.uk/documents/commons-committees/cultu...

Here is a quote from the summary, but I could not find where it was substantiated in the 250-page document:

> Facebook knew that the changes to its policies on the Android mobile phone system, which enabled the Facebook app to collect a record of calls and texts sent by the user would be controversial. To mitigate any bad PR, Facebook planned to make it as hard of possible for users to know that this was one of the underlying features of the upgrade of their app.

kzrdude4y ago

You could read about lz4 and then later zstd on http://fastcompression.blogspot.com/ long before he joined facebook.

metafex4y ago

they didn't though. zstd has been around even before the main dev joined fb, i distinctly remember it being under the persons personal github name.

prirun4y ago

Yann Collet developed Zstandard first, on his own, then Facebook hired him and Zstandard went along with him.

m0zg4y ago

Zstd is an amazing bit of work and all I ever use for data compression nowadays (or LZ4 when speed is even more critical). Several times the compression/decompression speed of gzip, approximately the same compression ratio with default settings.

It's also supported by tar in recent Linux distros, if zstd is installed, so "tar acf blah.tar.zst *" works fine, and "tar xf blah.tar.zst" works automatically as well. Give it a try, folks, and retire gzip shortly afterwards.

nigeltao4y ago

> Several times the compression/decompression speed of gzip

Just be careful that you're comparing against the best implementation of gzip. One recent re-implementation of zcat was 3.1x faster than /bin/zcat (and the CRC-32 implementation within was 7.3x faster than /bin/crc32). Both programs decode exactly the same file format. They're just different implementations. For details, see: https://nigeltao.github.io/blog/2021/fastest-safest-png-deco...

mjevans4y ago

I get why someone might want to avoid .zstd ; but that is the short name offered for humans.

Was .zs not sufficient if a file format ending in 'std' is so abhorrent?

m0zg4y ago

I'm not the one who came up with the extension. It just sort of organically happened I guess. I'd prefer "zstd" myself, but, frankly, "zst" is fine as well.

diroussel4y ago

Four letter extensions work really well for .java and .json. It seems strange the abbreviate zstd anymore.

1 more reply

erichocean4y ago

Does Zstandard still have the junk Facebook license attached to it?

lifthrasiir4y ago

Not since 1.3.1.

erichocean4y ago

Thanks!

j / k navigate · click thread line to collapse

43 comments

felixhandte4y ago

As someone on the Zstd team, I'm always happy to see it on HN! I'm curious though what motivates the submission?

thewakalix4y ago

Probably its use in elfshaker[0].

[0] https://news.ycombinator.com/item?id=29277779

kzrdude4y ago

Zstd is always interesting.

thriftwy4y ago

Zstandard has very cool dictionary training feature, which allows to keep a separate dictionary and have a 50% ratio compression on very small (~100b) but repetitive data such as database records.

Taywee4y ago

I've always thought it could be pretty cool to leverage that for transparent filesystem compression.

thriftwy4y ago

Compression dictionaries don't usually work that way, as far as my understanding goes they just look behind, like "take 15 bytes which were 63 bytes ago", so any data it also a dictionary.

Trained dictionary is just meaningless but very frequent bits of data which may be referenced in that fashion as if they preceded the real data.

vlovich1234y ago

Is there a reason zstd isn’t popular for HTTP and only brotli and gzip see adoption?

zinekeller4y ago

Because Facebook doesn't have a browser.

lifthrasiir4y ago

duskwuff4y ago

jhgb4y ago

I assume it's because it's very new? That would seem like an obvious explanation.

wolf550e4y ago

zstd is from 2015.

jhgb4y ago

I was talking about the RFC. You can't just shove any random compression into a browser even if it had existed for years, or can you?

1 more reply

loeg4y ago

For comparison, brotli is from 2013.

jeffbee4y ago

Does this mean the Zstd magic number is now cast in stone?

cornstalks4y ago

It's an Informational RFC, not a Standards Track RFC (https://en.wikipedia.org/wiki/Request_for_Comments#Status). That said, I think the magic number is pretty firmly established.

lifthrasiir4y ago

jeffbee4y ago

No I'm not confused, it's just that the Zstd magic number has had 8 different values over the years, so I'm just wondering if we're past that yet.

lifthrasiir4y ago

Ah sure. The wire format has been fixed since 0.8.0 (2016-08), so you must have seen a very early phase of development (which took one full year).

stouset4y ago

Can you shed some light on why this might be something of concern?

wmf4y ago

The file format was finalized years ago, so yes.

kzrdude4y ago

by the way, zlib-ng also seems interesting. In the sense that it's cleaning up and improving a very aged library https://github.com/Dead2/zlib-ng

ggm4y ago

It's said to be a good fit for ZFS. I tend to lz4 because its baked into the older systems I use, but it may be at a point where my default should be zstd.

bz2/gz still predominates for compressed objects in filestore from what I can see.

xoa4y ago

>It's said to be a good fit for ZFS. I tend to lz4 because its baked into the older systems I use, but it may be at a point where my default should be zstd.

----

0: https://news.ycombinator.com/item?id=29268907

1: https://github.com/openzfs/zfs/pull/9735#issuecomment-570082...

ggm4y ago

There's enough grunt left over to be my plex headend as well as long as I avoid transcoding.

I actually wanted to run FreeBSD on it, but the rpi4 wasn't fully ported when I started. In the back of my mind, it should be "safe" to convert over because OpenZFS. Which is kind-of the point.

I avoided de-dup. It burns your CPU. But I think compression is worth it, even with a lot of mp4 and mp3 and jpg content (photo and live-TV PVR archives, aside from music)

vermaden4y ago

The RAM requirement for ZFS is one of the biggest myths of ZFS.

If you have lots of RAM then great - ZFS can use it as cache with ARC.

Otherwise its just as fast as disks on which its running - like every other filesystem.

The other 'big' myth about ZFS is ECC requirement - which is of course not true.

ECC RAM is useful for ALL FILESYSTEMS - not just only for ZFS.

But from all filesystems ZFS works best of all when you do not have ECC RAM.

Regards.

buryat4y ago

i forgive facebook all their abuses just because they gave us zstd

oofbey4y ago

Clearly they produce good technology. But the company is morally bankrupt.

Y_Y4y ago

The PDF of of the information I assume you're referring to is: https://www.parliament.uk/documents/commons-committees/cultu...

Here is a quote from the summary, but I could not find where it was substantiated in the 250-page document:

kzrdude4y ago

You could read about lz4 and then later zstd on http://fastcompression.blogspot.com/ long before he joined facebook.

metafex4y ago

they didn't though. zstd has been around even before the main dev joined fb, i distinctly remember it being under the persons personal github name.

prirun4y ago

Yann Collet developed Zstandard first, on his own, then Facebook hired him and Zstandard went along with him.

m0zg4y ago

nigeltao4y ago

> Several times the compression/decompression speed of gzip

mjevans4y ago

I get why someone might want to avoid .zstd ; but that is the short name offered for humans.

Was .zs not sufficient if a file format ending in 'std' is so abhorrent?

m0zg4y ago

I'm not the one who came up with the extension. It just sort of organically happened I guess. I'd prefer "zstd" myself, but, frankly, "zst" is fine as well.

diroussel4y ago

Four letter extensions work really well for .java and .json. It seems strange the abbreviate zstd anymore.

1 more reply

erichocean4y ago

Does Zstandard still have the junk Facebook license attached to it?

lifthrasiir4y ago

Not since 1.3.1.

erichocean4y ago

Thanks!

j / k navigate · click thread line to collapse