To me, most of the claims are arguable.
To say 3 levels of headers is "unsafe complexity"... I don't agree. Indirection is fundamental to design.
To say padding is "useless"... I don't understand why padding and byte-alignment that is given so much vitriol. Look at how much padding the tar format has. And tar is a good example of how "useless padding" was used to extend the format to support larger files. So this supposed "flaw" has been in tar for dozens of years, with no disastrous effects at all.
The xz decision was not made "blindly". There was thought behind the decision.
And it's pure FUD to say "Xz implementations may choose what subset of the format they support. They may even choose to not support integrity checking at all. Safe interoperability among xz implementations is not guaranteed". You could say this about any software - "oh no, someone might make a bad implementation!" Format fragmentation is essentially a social problem more than a technical problem.
I'll leave it at this for now, but there's more I could write.
3 individual headers for one file format is unnecessary complexity.
> To say padding is "useless"
Padding in general is not useless, but padding in a compression format is very counterproductive.
> And it's pure FUD to say "Xz implementations may choose what subset of the format they support. They may even choose to not support integrity checking at all. Safe interoperability among xz implementations is not guaranteed". You could say this about any software - "oh no, someone might make a bad implementation!" Format fragmentation is essentially a social problem more than a technical problem.
This isn't about "someone making a bad implementation!", it's about crucial features being optional. That is, completely compliant implementations may or may not be able to decompress a given XZ archive, and may or may not be able to validate the archive.
XZ may not have been chosen blindly, but it certainly does not seem like a sensible format. There is no benefit to this complexity. We do not need or benefit from a format that is flexible, as we can just swap format and tool if we want to swap algorithms, like we have done so many times before (a proper compression format is just a tiny algorithm-specific header + trailing checksum, so it is not worth generalizing away).
Any and all benefits of XZ lie in LZMA2. We could have lzip2 and avoid all of these problems.
(I have no opinion as to whether LZIP should supersede GZIP/BZIP2, but XZ certainly seems like a poor choice.)
So all these file formats are unnecessarily complex?
- all OpenDocument formats
- all MS office formats
- all multimedia container formats
- deb/rpm packages
etc?
Just because it's in tar doesn't mean that the design is flawless. tar was created a long time ago, when a lot of things we are concerned with now weren't even thought of.
Deterministic, bit-reproduceable archives are one thing that tar has recently struggled with[1], because the archive format was not originaly designed with that in mind. With more foresight and a better archive format, this need not have been an issue at all.
[1] - https://lists.gnu.org/archive/html/help-tar/2015-05/msg00005...
While I think he made a case, I somewhat doubt that the other formats are flawless, and the real answer would lie in a more open analysis of all of them.
Like some other compressed formats, an lzip file is just a series of compressed blocks concatenated together, each block starting with a magic number and containing a certain amount of compressed data. There’s no overall file header, nor any marker that a particular block is the last one. This structure has the advantage that you can simply concatenate two lzip files, and the result is a valid lzip file that decompresses to the concatenation of what the inputs decompress to.
Thus, when the decompressor has finished reading a block and sees there’s more input data left in the file, there are two possibilities for what that data could contain. It could be another lzip block corresponding to additional compressed data. Or it could be any other random binary data, if the user is taking advantage of the “trailing data” feature, in which case the rest of the file should be silently ignored.
How do you tell the difference? Simply enough, by checking if the data starts with the 4-byte lzip magic number. If the magic number itself is corrupted in any way? Then the entire rest of the file is treated as “trailing data” and ignored. I hope the user notices their data is missing before they delete the compressed original…
It might be possible to identify an lzip block that has its magic number corrupted, e.g. by checking whether the trailing CRC is valid. However, at least at the time I discovered this, lzip’s decompressor made no attempt to do so. It’s possible the behavior has improved in later releases; I haven’t checked.
But at least at the time this article was written: pot, meet kettle.
https://lists.debian.org/55C0FE82.7050700@gnu.org
Their advocacy in this thread was so good that I removed lzip from my system.
To add to that, if you need parity to recover from errors, you need to calculate how much based on your storage medium durability and projected life span. It's not the file format's concern. The xz crc should be irrelevant.
So you've archived two or more copies of each file? That means you're use at least twice as much space (and if you're keeping the original as well, more than twice).
For the likely corruption of the occasional single bit flip here and there, you could do a lot better by using something like par2 and/or dvdisaster (depending on what media you're archiving to).
You haven't?
It took me just one minor "data loss incident" ~20 years ago to very quickly convince me to become a lifetime member of the "backup all the things to a few different locations" club.
> That means you're use at least twice as much space (and if you're keeping the original as well, more than twice).
"Storage is cheap."
If your data is not in three different places it might as well not exist.
https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-co...
...relative to ... ? Is it better than lzip? lzip sounds like it would also use LZMA-based compression, right? This [1] sounds like an interesting and more detailed/up-to-date comparison. Also by the same author BTW.
People began using xz because mostly because they (e.g. distro maintainers like Debian) had started seeing 7z files floating around, thought they were cool, and so wanted a format that did what 7z did but was an open standard rather than being dictated by some company. xz was that format, so they leapt on it.
As it turns out, lzip had already been around for a year (though I'm not sure in what state of usability) before the xz project was started, but the people who created xz weren't looking for something that compressed better, they were looking for something that compressed better like 7z, and xz is that.
(Meanwhile, what 7z/xz is actually better at, AFAIK, is long-range identical-run deduplication; this is what makes it the tool of choice in the video-game archival community for making archives of every variation of a ROM file. Stick 100 slight variations of a 5MB file together into one .7z (or .tar.xz) file, and they'll compress down to roughly 1.2x the size of a single variant of the file.)
665472 freebsd-11.0-release-amd64-disc1.iso
401728 freebsd-11.0-release-amd64-disc1.iso.xz 5m0.606s
406440 freebsd-11.0-release-amd64-disc1.iso.lz 5m43.375s
430872 freebsd-11.0-release-amd64-disc1.iso.bz2 1m38.654s
440400 freebsd-11.0-release-amd64-disc1.iso.gz 0m27.073s
431740 freebsd-11.0-release-amd64-disc1.iso.zst 0m3.424s
Maybe xz is not good for long term archiving but it's both faster and produces smaller files in most scenarios. However, I'm sticking with gz for backups, mainly because of the speed and popularity. If I want to compress anything to the smallest possible size without any regard for CPU time, then I use xz.Previously discussed here on HN back then:
https://news.ycombinator.com/item?id=12768425
The author has made some minor revisions since then. Here are the main differences to the page compared to when it was first discussed here:
http://web.cvs.savannah.nongnu.org/viewvc/lzip/lzip/xz_inade...
And here's the full page history:
http://web.cvs.savannah.nongnu.org/viewvc/lzip/lzip/xz_inade...
I'm not so sure, using tools suitable long-term archiving by default might not be a bad practice. The thing about archiving is that it's often hard to know in advance what exactly you want to keep long-term. Using more robust formats probably won't cost much in the short term, but could pay off in the long term.
[1]: https://github.com/lrq3000/pyFileFixity
[2]: http://dvdisaster.net/en/index.html
If you can't control the underlying storage, then ditto. Keeping and maintaining explicit parity chunks is somewhat inconvenient, but it works.
But if you just want to avoid bitrot of your own files, sitting on your own HDD, I'd recommend using a reliable storage system instead. ZFS or, at higher and more complicated levels, Ceph/Rook and its kin. That still offers a posix interface (unlike parity files), while being just as safe.
Parity is ECC (which is usually Reed-Solomon, which is just a fancy name for a big set of more equations than data chunk you have, so that's how it adds in redundancy) with 1 bit. Usually you should aim for +20-40% redundancy.
Ceph, HDFS and other distributed storage systems implement erasure coding (which is subtly different from error correction coding), which I would recommend for handling backups.
I think for backup (as in small-scale, "fits on one disk") error-correcting codes are not a really good approach, because IME hard disks with one error you notice usually have made many more errors - or will do so shortly. In that case no ECC will help you. If, on the other hand, you're looking at an isolated error, then only very little data is affected (on average).
For example, a bit error in a chunk in a tool like borg/restic will only break that chunk; a piece of a file or perhaps part of a directory listing.
So for these kinds of scenarios "just use multiple backup drives with fully independent backups" is better and simpler.
xz can be amazing. It can also bite you.
I've had payloads that compress to 0.16 with gzip then compress to 0.016 with xz. Hurray! Then I've had payloads where xz compression is par, or worse. However, with "best or extreme" compression, xz can peg your CPU for much longer. gzip and bzip2 will take minutes and xz -9 is taking hours at 100% CPU.
As annoying as that is, getting an order of magnitude better in many circumstances is hard to give up.
My compromise is "xz -1". It usually delivers pretty good results, in reasonable time, with manageable CPU/Memory usage.
FYI. The datasets are largely text-ish. Usually in 250MB-1GB chunks. So talking JSON data, webpages, and the like.
If you store enough of the same type of data, invest in redesigning the application. There's a reason we all use jpegs over zipped bitmaps...
It's because it's an appropriate compression - just like xz can be? Not sure what you're actually suggesting here.
When I last looked into this issue, it seemed that erasure codes, like with Parchive/par/par2, was the way to go. (As others have mentioned here.) I haven't tried it out as I haven't needed that level of robustness.
Then why use the default settings?
I tend to use the maximum settings, which are much more of a memory hog, but I have enough memory where that's not an issue.
Just use the settings that are right for you.
I think he saw "'best' compression" and stopped looking there.
When I burn data (including xz archives) on to DVD for archival storage, I use dvdisaster[2] for the same purpose.
I've tested both by damaging archives and scratching DVDs, and these tools work great for recovery. The amount of redundancy (with a tradeoff for space) is also tuneable for both.
[1] - https://github.com/Parchive/par2cmdline
[2] - http://dvdisaster.net/
This article is likely more relevant to tape archives than anything most people use today.
The author seems to think the xz container file format should do that.
When you remove this requirement, nearly all his arguments become moot.
On the contrary. People archive files to save space, exchange files with each other over unreliable networks able to corrupt data, store them in corrupted ram and corrupted disks, even if just temporary. Compression formats are there to help with that, this is their main purpose. This is why fast and proper checksumming is expected, but not cryptographic, like sha256, that adds nothing to this goal but overhead.
I can understand the concerns about versioning and fragmented extension implementations though.
Actually, one uses the tape archive utility, tar, to write directly to the tape. (-:
renice 19 -p $$ > /dev/null 2>&1
then ...Use tar + xz to save extra metadata about the file(s), even if it is only 1 file.
tar cf - ~/test_files/* | xz -9ec -T0 > ./test.tar.xz
If that (or the extra options in tar for xattrs) is not enough, then create a checksum manifest, always sorted. sha256sum ~/test_files/* | sort -n > ~/test_files/.sha256
Then use the above command to compress it all into a .tar file that now contains your checksum manifest. 34M zig-linux-x86_64-0.2.0.cc35f085.tar.gz
33M zig-linux-x86_64-0.2.0.cc35f085.tar.zst
30M zig-linux-x86_64-0.2.0.cc35f085.tar.bz2
24M zig-linux-x86_64-0.2.0.cc35f085.tar.lz
23M zig-linux-x86_64-0.2.0.cc35f085.tar.xz
With maximum compression (the -9 switch), lzip wins but takes longer than xz: 23725264 zig-linux-x86_64-0.2.0.cc35f085.tar.xz 63.05 seconds
23627771 zig-linux-x86_64-0.2.0.cc35f085.tar.lz 83.42 secondsPerhaps folks are trying to stick with packages that are in their base repo. p7zip is usually outside of the standard base repos.
Packing a bunch of files together as .tgz is a quite universal format and compresses most of the redundancy out. It has some pathological cases but those are rare, and for general files it's still in the same ballpark with other compressors.
I remember using .tbz2 in the turn of the millennium because at the time download/upload times did matter and in some cases it was actually faster to compress with bzip2 and then send over less data.
But DSL broadband pretty much made it not matter any longer: transfers were fast enough that I don't think I've specifically downloaded or specifically created a .tbz2 archive for years. Good old .tgz is more than enough. Files are usually copied in seconds instead of minutes, and really big files still take hours and hours.
None of the compressors really turn a 15-minute download into a 5-minute download consistently. And the download is likely to be fast enough anyway. Disk space is cheap enough that you haven't needed the best compression methods for ages in order to stuff as much data on portable or backup media.
Ditto for p7zip. It has more features and compresses faster and better but for all practical purposes zip is just as good. Eventhough it's slower it won't take more than a breeze to create and transfer, and it unzips virtually everywhere.
What is the probability of a complete HD failure in a year?
tar c foo | gzip > foo.tar.gz
or tar c foo | bzip2 > foo.tar.bz2
Been using these for over 20 years now. Why is is so important to change things especially as this article points out for the worse?! document.body.style['max-width'] = '550px'; document.body.style.margin = '0 auto'"What medium should be used for long term, high volume, data storage (archival)?" https://superuser.com/q/374609/52739
It mostly focuses on the media instead of formats though.
Amazon Glacier runs on BDXL disc libraries (like a tape library). There's nothing truly expensive about producing BDXL media, there just isn't enough volume in the consumer market to make it worthwhile. If you contract directly with suppliers for a few million discs at a time, that's not an issue (you did say high-volume, right?).
https://storagemojo.com/2014/04/25/amazons-glacier-secret-bd...
For medium-scale users, tape libraries are still the way to go. You can have petabytes of near-line storage in a rack. Storage conditions are not really a concern in a datacenter, which is where they should live.
(CERN has about 200 petabytes of tapes for their long-term storage.)
https://home.cern/about/updates/2017/07/cern-data-centre-pas...
If you mean "high-volume for a small business", probably also tapes, or BD discs with 20% parity encoding to guard against bitrot.
Small users should also consider dumping it in Glacier as a fallback - make it Amazon's problem. If you have a significant stream of data it'll get expensive over time, but if it's business-critical data then you don't really have a choice, do you?
You can also use xzip on top of something that can correct errors, such as par2.
> According to [Koopman] (p. 50), one of the "Seven Deadly Sins" (i.e., bad ideas) of CRC and checksum use is failing to protect a message length field. This causes vulnerabilities due to framing errors. Note that the effects of a framing error in a data stream are more serious than what Figure 1 suggests. Not only data at a random position are interpreted as the CRC. Whatever data that follow the bogus CRC will be interpreted as the beginning of the following field, preventing the successful decoding of any remaining data in the stream.
> Except the 'Backward Size' field in the stream footer, none of the many length fields in the xz format is protected by a check sequence of any kind. Not even a parity bit. All of them suffer from the framing vulnerability illustrated in the picture above.
If your storage fails, maybe you'll have a problem, but you'd have a problem anyway.
Sometimes I feel like genuine technical concerns are buried by the authors being jerks and blowing things way out of proportion. I, for one, tend to lose interest when I hear hyperbolic mudslinging.
Wow ... that is inexcusably idiotic. Whoever designed that shouldn't be programming. Out of professional disdain, I pledge never to use this garbage.
We certainly should have environments where we can tell someone code is shit, it's just silly and counterproductive to then leap to attacks on the abilities on the person behind it.