Xz format inadequate for long-term archiving (2017) (opens in new tab)

[1] - https://lists.gnu.org/archive/html/help-tar/2015-05/msg00005...

pmoriarty8y ago

"Look at how much padding the tar format has. And tar is a good example of how "useless padding" was used to extend the format to support larger files. So this supposed "flaw" has been in tar for dozens of years, with no disastrous effects at all."

Just because it's in tar doesn't mean that the design is flawless. tar was created a long time ago, when a lot of things we are concerned with now weren't even thought of.

Deterministic, bit-reproduceable archives are one thing that tar has recently struggled with[1], because the archive format was not originaly designed with that in mind. With more foresight and a better archive format, this need not have been an issue at all.

rootbear8y ago

The name tar comes from Tape ARchive. Lots of padding makes sense when you know that tar was originally used to write files to magnetic tape, which is highly block oriented. The use of tar today as a bundling and distribution format is something of a misapplication, as it lacks features one might want of such a program.

rolandog8y ago

Thanks for such an amazing rabbit-hole of a link.

nebulous18y ago

I feel he has made a case for some inadequacies in Xz. Some of the claims seem exaggerated, such as (2.2) the optional integrity checking, assuming the decompressor at least logs the fact that it couldn't do the integrity checking. Some others are clearly more significant issues, such as (2.5) not checksumming the length fields (2.6) the variable length integers being able to cause framing errors. Others still are petty, such as (2.3) too many possible filters.

While I think he made a case, I somewhat doubt that the other formats are flawless, and the real answer would lie in a more open analysis of all of them.

comex8y ago

Last time this came up on HN, I did some research, and discovered that lzip was quite non-robust in the face of data corruption: a single bit flip in the right place in an lzip archive could cause the decompressor to silently truncate the decompressed data, without reporting an error. Not only that, this vulnerability was a direct consequence of one of the features used to claim superiority to XZ: namely, the ability to append arbitrary “trailing data” to an lzip archive without invalidating it.

Like some other compressed formats, an lzip file is just a series of compressed blocks concatenated together, each block starting with a magic number and containing a certain amount of compressed data. There’s no overall file header, nor any marker that a particular block is the last one. This structure has the advantage that you can simply concatenate two lzip files, and the result is a valid lzip file that decompresses to the concatenation of what the inputs decompress to.

Thus, when the decompressor has finished reading a block and sees there’s more input data left in the file, there are two possibilities for what that data could contain. It could be another lzip block corresponding to additional compressed data. Or it could be any other random binary data, if the user is taking advantage of the “trailing data” feature, in which case the rest of the file should be silently ignored.

How do you tell the difference? Simply enough, by checking if the data starts with the 4-byte lzip magic number. If the magic number itself is corrupted in any way? Then the entire rest of the file is treated as “trailing data” and ignored. I hope the user notices their data is missing before they delete the compressed original…

It might be possible to identify an lzip block that has its magic number corrupted, e.g. by checking whether the trailing CRC is valid. However, at least at the time I discovered this, lzip’s decompressor made no attempt to do so. It’s possible the behavior has improved in later releases; I haven’t checked.

But at least at the time this article was written: pot, meet kettle.

jwilk8y ago

The maintainer's response when I reported this bug was 'Just use "lzip -vvvv" to see the warning':

https://lists.debian.org/55C0FE82.7050700@gnu.org

Their advocacy in this thread was so good that I removed lzip from my system.

lopmotr8y ago

It's that an implementation problem? I would expect a decompressor to warn that there's unidentified trailing data and perhaps dump it out as-is. After all, even if you did put it there on purpose, surely you still want it, not to have it discarded.

kazinator8y ago

If the claims in the article are true who cares if the competing thing that the author is working on is also shit (but good to know that too).

tedunangst8y ago

Are these concerns, about error recovery, outdated? If I want to recover a corrupted file, I find another copy. I don't fiddle with the internal length field to fix framing issues. Certainly, if I want to detect corruption, I use a sha256 of the entire file. If that fails, I don't waste time trying to find the bad bit.

To add to that, if you need parity to recover from errors, you need to calculate how much based on your storage medium durability and projected life span. It's not the file format's concern. The xz crc should be irrelevant.

stefco_8y ago

While that's true for most use cases, I think the author's point is that an archival compression format should be as forgiving as possible to the person recovering data because they are not necessarily the person who stored it. There will certainly be plenty of data in the future that was haphazardly stored but which needs to be recovered, possibly centuries after it was originally created, when no other copies may exist. So we should try to be nice to future archivists/librarians by making our data formats as robust as possible (in addition to our storage media, which is what you are correctly implying we should also worry about).

pmoriarty8y ago

"If I want to recover a corrupted file, I find another copy."

So you've archived two or more copies of each file? That means you're use at least twice as much space (and if you're keeping the original as well, more than twice).

For the likely corruption of the occasional single bit flip here and there, you could do a lot better by using something like par2 and/or dvdisaster (depending on what media you're archiving to).

jlgaddis8y ago

> So you've archived two or more copies of each file

You haven't?

It took me just one minor "data loss incident" ~20 years ago to very quickly convince me to become a lifetime member of the "backup all the things to a few different locations" club.

> That means you're use at least twice as much space (and if you're keeping the original as well, more than twice).

"Storage is cheap."

tedunangst8y ago

If you're using par2, I'd say that's closer to recovering a second copy than trying to extract meaningful data from a corrupted file. (The internal structure of the format is irrelevant, thus concerns about it are outdated.)

https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-co...

outworlder8y ago

Yes.

If your data is not in three different places it might as well not exist.

technion8y ago

I would generally suggest you're more likely to corrupt/lose your whole backup than to have one corrupted bitflip not addressed by the filesystem or underlying storage.

jjuhl8y ago

If you are archiving it may be for long times (think "museums", "data vaults" etc). Finding another copy 200 years later may be difficult.

nebulous18y ago

While finding another copy might be a practical solution for most of us, it seems like a wrongheaded way of designing an archiving data format.

arundelo8y ago

I upvoted this because it seems to make some good points and I think the topic is interesting and important, but I can't understand why the "Then, why some free software projects use xz?" section does not mention xz's main selling point of being better than other commonly used alternatives at compressing things to smaller sizes.

wyldfire8y ago

> compressing things to smaller sizes.

...relative to ... ? Is it better than lzip? lzip sounds like it would also use LZMA-based compression, right? This [1] sounds like an interesting and more detailed/up-to-date comparison. Also by the same author BTW.

[1] https://www.nongnu.org/lzip/lzip_benchmark.html#xz

derefr8y ago

Relative to the compression formats people were aware of at the time (which didn't include lzip.)

People began using xz because mostly because they (e.g. distro maintainers like Debian) had started seeing 7z files floating around, thought they were cool, and so wanted a format that did what 7z did but was an open standard rather than being dictated by some company. xz was that format, so they leapt on it.

As it turns out, lzip had already been around for a year (though I'm not sure in what state of usability) before the xz project was started, but the people who created xz weren't looking for something that compressed better, they were looking for something that compressed better like 7z, and xz is that.

(Meanwhile, what 7z/xz is actually better at, AFAIK, is long-range identical-run deduplication; this is what makes it the tool of choice in the video-game archival community for making archives of every variation of a ROM file. Stick 100 slight variations of a 5MB file together into one .7z (or .tar.xz) file, and they'll compress down to roughly 1.2x the size of a single variant of the file.)

https://news.ycombinator.com/item?id=12768425

petre8y ago

I did some tests of my own and xz turned out marginally better than lzip in most of them.

    665472 freebsd-11.0-release-amd64-disc1.iso
    401728 freebsd-11.0-release-amd64-disc1.iso.xz 5m0.606s
    406440 freebsd-11.0-release-amd64-disc1.iso.lz 5m43.375s
    430872 freebsd-11.0-release-amd64-disc1.iso.bz2 1m38.654s
    440400 freebsd-11.0-release-amd64-disc1.iso.gz 0m27.073s
    431740 freebsd-11.0-release-amd64-disc1.iso.zst 0m3.424s

Maybe xz is not good for long term archiving but it's both faster and produces smaller files in most scenarios. However, I'm sticking with gz for backups, mainly because of the speed and popularity. If I want to compress anything to the smallest possible size without any regard for CPU time, then I use xz.

JdeBP8y ago

You can see discussion of this point, from the last time that this article was on Hacker News, at https://news.ycombinator.com/item?id=12769277 .

carussell8y ago

(2016)

Previously discussed here on HN back then:

The author has made some minor revisions since then. Here are the main differences to the page compared to when it was first discussed here:

http://web.cvs.savannah.nongnu.org/viewvc/lzip/lzip/xz_inade...

And here's the full page history:

http://web.cvs.savannah.nongnu.org/viewvc/lzip/lzip/xz_inade...

cpburns20098y ago

It may not be a good choice for long-term data storage, but I disagree that it should not be used for data sharing or software distribution. Different use cases have different needs. If you need long-term storage, it's better to avoid lossless compression that can break after minor corruption. You should also be storing parity/ECC data (I don't recall the subtle difference). If you only need short to moderate term storage, the best compression ratio is likely optimal. Keep a spare backup just in case.

Lionsion8y ago

> It may not be a good choice for long-term data storage, but I disagree that it should not be used for data sharing or software distribution. Different use cases have different needs.

I'm not so sure, using tools suitable long-term archiving by default might not be a bad practice. The thing about archiving is that it's often hard to know in advance what exactly you want to keep long-term. Using more robust formats probably won't cost much in the short term, but could pay off in the long term.

snuxoll8y ago

For long-term archival I think relying on your compression software to protect data integrity is a fool's errand, protecting against bit-rot should be a function of your storage layer as long as you have control over it (in contrast to say, Usenet, where multiple providers have copies of data and you can't trust them to not lose part of it - hence the inclusion of .par files for everything under alt.binaries).

dv_dt8y ago

I keep seeing recommendations for par/par2 but it seems like as software, the project isn't actively maintained? As an aside, that makes me think of dead languages and the use of latin for scientific names because it isn't changing anymore... but do you want that out of archival formats and software?

tzahola8y ago

Nope. You always need end-to-end parity integrity checking. Your data goes through too many layers before reaching the storage medium. E.g. I once got a substantial amount of my pictures filled with bit errors because of a faulty RAM module in my NAS.

speleo_engr8y ago

I've used XZ to compress tarballs of backup. XZ was useful so I could store more backups on an external hard drive. I have seen bit rot on some of these files (stored on a magnetic HDD), in the sense that the md5sum of the .tar.xz archive no longer matches when it was created. What do you suggest for creating parity/ECC in this case? I'm aware of parchive, but is that the right choice and in what configuration?

cpburns20098y ago

Keep in mind I'm not an archival expert so you should do your own research. That being said, currently I'm using pyFileFixity [1] to generate the hashes and ECC data for my personal backups. I write them to M-Disc Blu-rays using Dvdisaster [2] which can also write additional ECC data. After a lot of googling and reading this useful Super User question [3], and this extensive answer [4] I settled on this setup. I must admit that I am guilty of storing images as JPGs and compressing most most of my files in ZIPs for convenience.

[1]: https://github.com/lrq3000/pyFileFixity

[2]: http://dvdisaster.net/en/index.html

[3]: https://superuser.com/q/374609/52739

[4]: https://superuser.com/a/873260/52739

Filligree8y ago

When you're transferring files and need to cope with corrupted/missing chunks, you should use a parity scheme. Others have mentioned that; it's common for, for example, Usenet.

If you can't control the underlying storage, then ditto. Keeping and maintaining explicit parity chunks is somewhat inconvenient, but it works.

But if you just want to avoid bitrot of your own files, sitting on your own HDD, I'd recommend using a reliable storage system instead. ZFS or, at higher and more complicated levels, Ceph/Rook and its kin. That still offers a posix interface (unlike parity files), while being just as safe.

yorwba8y ago

I have occasionally had downloaded tarballs that were truncated by network failure. It's nice to be able to get a meaningful error when decompression fails, instead of silently decompressing only part of the data. So built-in integrity checks are also desirable for short-term distribution.

pas8y ago

> parity/ECC

Parity is ECC (which is usually Reed-Solomon, which is just a fancy name for a big set of more equations than data chunk you have, so that's how it adds in redundancy) with 1 bit. Usually you should aim for +20-40% redundancy.

Ceph, HDFS and other distributed storage systems implement erasure coding (which is subtly different from error correction coding), which I would recommend for handling backups.

blattimwind8y ago

The interesting thing about erasure codes is that you need to checksum your shards independently from the EC itself. If you supply corrupted or wrong shards, you get corrupted data back.

I think for backup (as in small-scale, "fits on one disk") error-correcting codes are not a really good approach, because IME hard disks with one error you notice usually have made many more errors - or will do so shortly. In that case no ECC will help you. If, on the other hand, you're looking at an isolated error, then only very little data is affected (on average).

For example, a bit error in a chunk in a tool like borg/restic will only break that chunk; a piece of a file or perhaps part of a directory listing.

So for these kinds of scenarios "just use multiple backup drives with fully independent backups" is better and simpler.

zokier8y ago

Of course there are million variables here, but for compressible data arguably compression+ecc is more robust against damage than uncompressed data. The rationale being that with compression you can afford to use more/bigger ecc

jwilliams8y ago

I sent a reasonable amount of data to Cloud Storage. It varies a lot. Usually ~10GB/day, but it can be up to 1TB/day regularly.

xz can be amazing. It can also bite you.

I've had payloads that compress to 0.16 with gzip then compress to 0.016 with xz. Hurray! Then I've had payloads where xz compression is par, or worse. However, with "best or extreme" compression, xz can peg your CPU for much longer. gzip and bzip2 will take minutes and xz -9 is taking hours at 100% CPU.

As annoying as that is, getting an order of magnitude better in many circumstances is hard to give up.

My compromise is "xz -1". It usually delivers pretty good results, in reasonable time, with manageable CPU/Memory usage.

FYI. The datasets are largely text-ish. Usually in 250MB-1GB chunks. So talking JSON data, webpages, and the like.

foepys8y ago

If you can compress data this much, you seem to have a lot of repetitive data. Have you tried using compression algorithms that support custom dictionaries? ZSTD and DEFLATE support those and can maybe help with compression ratio as well as speed.

If you get compression ratios that good, you should consider if your application might be doing something stupid like storing the same data thousands of times inside it's data file.

If you store enough of the same type of data, invest in redesigning the application. There's a reason we all use jpegs over zipped bitmaps...

jwilliams8y ago

> There's a reason we all use jpegs over zipped bitmaps...

It's because it's an appropriate compression - just like xz can be? Not sure what you're actually suggesting here.

rspeer8y ago

HTML is pretty repetitive, but if you want to archive HTML data, you don't get to redefine what HTML is. Compression is useful.

UK-Al058y ago

It sounds like his application scraping data of some kind rather than say generating it.

freedomben8y ago

This is purely anecdotal and could easily be PEBKAC, but I created a bunch of xz backups years ago and had to access them a couple of years later after a disc died. To my panicked surprise, when trying to unpack them, I was informed that something was wrong (sorry at this point I don't remember what it was). I never did get it working. From that point on I went back to gzip and have not had a problem since. Yes xz packs efficiently, but a tight archive that doesn't inflate is worse than worthless to me.

eesmith8y ago

FWIW, PNG also "fails to protect the length of variable size fields". That is, it's possible to construct PNGs such that a 1-bit corruption gives an entirely different, and still valid, image.

When I last looked into this issue, it seemed that erasure codes, like with Parchive/par/par2, was the way to go. (As others have mentioned here.) I haven't tried it out as I haven't needed that level of robustness.

davidw8y ago

FWIW, xz is also a memory hog with the default settings. I inherited an embedded system that attempts to compress and send some logs, using xz, and if they're big enough, it blows up because of memory exhaustion.

pmoriarty8y ago

"xz is also a memory hog with the default settings"

Then why use the default settings?

I tend to use the maximum settings, which are much more of a memory hog, but I have enough memory where that's not an issue.

Just use the settings that are right for you.

davidw8y ago

You'd have to ask the guy who wrote the code in the first place.

I think he saw "'best' compression" and stopped looking there.

pmoriarty8y ago

When I use xz for archival purposes I always use par2[1] to provide redundancy and recoverability in case of errors.

When I burn data (including xz archives) on to DVD for archival storage, I use dvdisaster[2] for the same purpose.

I've tested both by damaging archives and scratching DVDs, and these tools work great for recovery. The amount of redundancy (with a tradeoff for space) is also tuneable for both.

[1] - https://github.com/Parchive/par2cmdline

[2] - http://dvdisaster.net/

doubledad2228y ago

Thank you for sharing this. I am in charge of archiving the family files - pictures, video, art projects, email. I want it available through the aging of standards and protected against the bitrot of aging hard drives. I'll be converting any xz archives I get into a better format.

moviuro8y ago

Mix and match, according to criticity and max affordable data loss: multiple locations, multiple solutions, multiple local copies (e.g. one cloud solution + DVD + NAS). See: https://www.backblaze.com/blog/the-3-2-1-backup-strategy/

Skunkleton8y ago

You should also write out ECC information.

ryao8y ago

Requiring userland software to worry about bitrot is a great way to ensure that it is not done well. It is better to let the filesystem worry about it by using a file system that can deal with it.

This article is likely more relevant to tape archives than anything most people use today.

nurettin8y ago

Too bad for arch https://www.archlinux.org/news/switching-to-xz-compression-f...

saghm8y ago

Is this really an issue for this use case? My naive take is that since Arch updates packages so often, "long-term storage" doesn't come up that much in practice.

aidenn08y ago

It's zero issue since the packages are updated regularly and hashed before installing.

mikepurvis8y ago

Default compression for debian packages is xz as well: http://manpages.ubuntu.com/manpages/xenial/en/man1/dpkg-deb....

agumonkey8y ago

this is from 2010, I guess if xz was bad for this use case they'd know by now

The purpose of a compression format is not to provide error recovery or integrity verification.

The author seems to think the xz container file format should do that.

When you remove this requirement, nearly all his arguments become moot.

zzzcpan8y ago

> The purpose of a compression format is not to provide error recovery or integrity verification.

On the contrary. People archive files to save space, exchange files with each other over unreliable networks able to corrupt data, store them in corrupted ram and corrupted disks, even if just temporary. Compression formats are there to help with that, this is their main purpose. This is why fast and proper checksumming is expected, but not cryptographic, like sha256, that adds nothing to this goal but overhead.

leni5368y ago

I fail to see why integrity checking is the file format's responsibility. Is this historical? Like when you just dd a tar file directly onto a tape and there is no filesystem? Anyway seems like it should be handled by the filesystem and network layers.

I can understand the concerns about versioning and fragmented extension implementations though.

JdeBP8y ago

> you just dd a tar file directly onto a tape

Actually, one uses the tape archive utility, tar, to write directly to the tape. (-:

LinuxBender8y ago

Perhaps renice your job so that others don't complain about their noisy neighbor.

    renice 19 -p $$ > /dev/null 2>&1

then ...

Use tar + xz to save extra metadata about the file(s), even if it is only 1 file.

    tar cf - ~/test_files/* | xz -9ec -T0 > ./test.tar.xz

If that (or the extra options in tar for xattrs) is not enough, then create a checksum manifest, always sorted.

    sha256sum ~/test_files/* | sort -n > ~/test_files/.sha256

Then use the above command to compress it all into a .tar file that now contains your checksum manifest.

AndyKelley8y ago

I did some compression tests of the CI build of master branch of zig:

    34M zig-linux-x86_64-0.2.0.cc35f085.tar.gz
    33M zig-linux-x86_64-0.2.0.cc35f085.tar.zst
    30M zig-linux-x86_64-0.2.0.cc35f085.tar.bz2
    24M zig-linux-x86_64-0.2.0.cc35f085.tar.lz
    23M zig-linux-x86_64-0.2.0.cc35f085.tar.xz

With maximum compression (the -9 switch), lzip wins but takes longer than xz:

    23725264 zig-linux-x86_64-0.2.0.cc35f085.tar.xz  63.05 seconds
    23627771 zig-linux-x86_64-0.2.0.cc35f085.tar.lz  83.42 seconds

qwerty4561278y ago

Why do people use xz anyway? As for me I just use tar.gz when I need to backup a piece of a Linux file system into an universally-compatible archive, zip when I need to send some files to a non-geek and 7z to backup a directory of plain data files for myself. And I dream of the world to just switch to 7z altogether but it is hardly possible as nobody seems interested in adding tar-like unix-specific metadata support to it.

LinuxBender8y ago

xz has substantially better compression than gz or bz2, especially if using the flags -9e. You can use all your cores with -T0 or set how many cores to use. I find it to be on par with 7-zip.

Perhaps folks are trying to stick with packages that are in their base repo. p7zip is usually outside of the standard base repos.

yason8y ago

Substantially is a relative term. There are niche cases but how many people really care, or need to care, about the last bytes that can be compressed?

Packing a bunch of files together as .tgz is a quite universal format and compresses most of the redundancy out. It has some pathological cases but those are rare, and for general files it's still in the same ballpark with other compressors.

I remember using .tbz2 in the turn of the millennium because at the time download/upload times did matter and in some cases it was actually faster to compress with bzip2 and then send over less data.

But DSL broadband pretty much made it not matter any longer: transfers were fast enough that I don't think I've specifically downloaded or specifically created a .tbz2 archive for years. Good old .tgz is more than enough. Files are usually copied in seconds instead of minutes, and really big files still take hours and hours.

None of the compressors really turn a 15-minute download into a 5-minute download consistently. And the download is likely to be fast enough anyway. Disk space is cheap enough that you haven't needed the best compression methods for ages in order to stuff as much data on portable or backup media.

Ditto for p7zip. It has more features and compresses faster and better but for all practical purposes zip is just as good. Eventhough it's slower it won't take more than a breeze to create and transfer, and it unzips virtually everywhere.

https://storagemojo.com/2014/04/25/amazons-glacier-secret-bd...

orbitur8y ago

Related: where can I find a thorough step-by-step method for maintaining the integrity of family photos/videos in backups on either Windows or macOS?

ebullientocelot8y ago

The [Koopman] cited throughout is my boss, Phil! At any rate I'm sadly not surprised and a little appalled that xz doesn't store the version of the tool that did the compression..

Annatar8y ago

So long as xz(1) gets insane amounts of compression and there is no compressor which compresses better, people are going to keep preferring it.

vortico8y ago

What is the probability that a given byte will be corrupted on a hard disk in one year?

What is the probability of a complete HD failure in a year?

loeg8y ago

Use par2 to generate FEC for your archives and move on with your life.

sirsuki8y ago

So what wrong with plain and simple

  tar c foo | gzip > foo.tar.gz

  tar c foo | bzip2 > foo.tar.bz2

Been using these for over 20 years now. Why is is so important to change things especially as this article points out for the worse?!

dchest8y ago

Better (smaller and/or faster) compression.

nailer8y ago

To read the article:

    document.body.style['max-width'] = '550px'; document.body.style.margin = '0 auto'

fenwick678y ago

or just resize your browser window

Lionsion8y ago

What are better file formats for long term archiving? Were any of them designed specifically with that use case in mind?

cpburns20098y ago

There's a post on Super User that contains useful information:

"What medium should be used for long term, high volume, data storage (archival)?" https://superuser.com/q/374609/52739

It mostly focuses on the media instead of formats though.

zokier8y ago

Personally I think the premise of the question is poor. Attempting to build monolithic long term (100+ years) cold storage of significant amount of data is a folly, instead the only reasonable approach is to do it in smaller parts (maybe 10-20 years) and plan for migrations.

paulmd8y ago

It all depends on what your definition of "high-volume" is, and just how "archival" your access patterns really are.

Amazon Glacier runs on BDXL disc libraries (like a tape library). There's nothing truly expensive about producing BDXL media, there just isn't enough volume in the consumer market to make it worthwhile. If you contract directly with suppliers for a few million discs at a time, that's not an issue (you did say high-volume, right?).

For medium-scale users, tape libraries are still the way to go. You can have petabytes of near-line storage in a rack. Storage conditions are not really a concern in a datacenter, which is where they should live.

(CERN has about 200 petabytes of tapes for their long-term storage.)

https://home.cern/about/updates/2017/07/cern-data-centre-pas...

If you mean "high-volume for a small business", probably also tapes, or BD discs with 20% parity encoding to guard against bitrot.

Small users should also consider dumping it in Glacier as a fallback - make it Amazon's problem. If you have a significant stream of data it'll get expensive over time, but if it's business-critical data then you don't really have a choice, do you?

Lionsion8y ago

Thanks, I'll take a look. Though I think I have the media question answered, and I settled on M-DISC for personal stuff (https://en.wikipedia.org/wiki/M-DISC). It only has special requirements for writing, reading can be done on standard drives.

1: https://www.nongnu.org/lzip/

aidenn08y ago

TFA is on the homepage for lzip[1] which is an lzma based compressor designed for this.

You can also use xzip on top of something that can correct errors, such as par2.

hzhou3218y ago

According to the article, xz minus extensibility.

Lionsion8y ago

That's not what I got from it. Xz has other problems such as:

> According to [Koopman] (p. 50), one of the "Seven Deadly Sins" (i.e., bad ideas) of CRC and checksum use is failing to protect a message length field. This causes vulnerabilities due to framing errors. Note that the effects of a framing error in a data stream are more serious than what Figure 1 suggests. Not only data at a random position are interpreted as the CRC. Whatever data that follow the bogus CRC will be interpreted as the beginning of the following field, preventing the successful decoding of any remaining data in the stream.

> Except the 'Backward Size' field in the stream footer, none of the many length fields in the xz format is protected by a check sequence of any kind. Not even a parity bit. All of them suffer from the framing vulnerability illustrated in the picture above.

microcolonel8y ago

Given that there is basically one standard implementation, and virtually nobody has ever had an issue with compatibility with a given file, I don't see how it is "inadequate". Sure, if it's inadequate now, it'll be inadequate if you read it in a decade, but not in any way which would prevent you from reading it.

If your storage fails, maybe you'll have a problem, but you'd have a problem anyway.

Sometimes I feel like genuine technical concerns are buried by the authors being jerks and blowing things way out of proportion. I, for one, tend to lose interest when I hear hyperbolic mudslinging.

kazinator8y ago

> The xz format lacks a version number field. The only reliable way of knowing if a given version of a xz decompressor can decompress a given file is by trial and error.

Wow ... that is inexcusably idiotic. Whoever designed that shouldn't be programming. Out of professional disdain, I pledge never to use this garbage.

menacingly8y ago

Histrionic reactions don't improve the overall quality of software.

We certainly should have environments where we can tell someone code is shit, it's just silly and counterproductive to then leap to attacks on the abilities on the person behind it.

kazinator8y ago

Improving badly designed software that is unnecessary in the first place is foolish; just "rm -rf" and never give it another thought.

zzzcpan8y ago

Welcome to the world of software I guess. Non idiotic things are rare here.

kazinator8y ago

Not w.r.t. that level of idiotic; and in FOSS, at least, we should be able to eject the idiotic. Thanks in part to articles like this, we can.

j / k navigate · click thread line to collapse

135 comments

moltensyntax8y ago

This article again? In my opinion, this article is biased. The subtext here is that the author is claiming that his "lzip" format is superior. But xz was not chosen "blindly" as the article claims.

To me, most of the claims are arguable.

To say 3 levels of headers is "unsafe complexity"... I don't agree. Indirection is fundamental to design.

The xz decision was not made "blindly". There was thought behind the decision.

I'll leave it at this for now, but there's more I could write.

arghwhat8y ago

> To say 3 levels of headers is "unsafe complexity"... I don't agree. Indirection is fundamental to design.

3 individual headers for one file format is unnecessary complexity.

> To say padding is "useless"

Padding in general is not useless, but padding in a compression format is very counterproductive.

Any and all benefits of XZ lie in LZMA2. We could have lzip2 and avoid all of these problems.

(I have no opinion as to whether LZIP should supersede GZIP/BZIP2, but XZ certainly seems like a poor choice.)

shawnz8y ago

> 3 individual headers for one file format is unnecessary complexity.

So all these file formats are unnecessarily complex?

- all OpenDocument formats

- all MS office formats

- all multimedia container formats

- deb/rpm packages

etc?

[1] - https://lists.gnu.org/archive/html/help-tar/2015-05/msg00005...

pmoriarty8y ago

Just because it's in tar doesn't mean that the design is flawless. tar was created a long time ago, when a lot of things we are concerned with now weren't even thought of.

rootbear8y ago

rolandog8y ago

Thanks for such an amazing rabbit-hole of a link.

nebulous18y ago

While I think he made a case, I somewhat doubt that the other formats are flawless, and the real answer would lie in a more open analysis of all of them.

comex8y ago

But at least at the time this article was written: pot, meet kettle.

jwilk8y ago

The maintainer's response when I reported this bug was 'Just use "lzip -vvvv" to see the warning':

https://lists.debian.org/55C0FE82.7050700@gnu.org

Their advocacy in this thread was so good that I removed lzip from my system.

lopmotr8y ago

kazinator8y ago

If the claims in the article are true who cares if the competing thing that the author is working on is also shit (but good to know that too).

tedunangst8y ago

stefco_8y ago

pmoriarty8y ago

"If I want to recover a corrupted file, I find another copy."

So you've archived two or more copies of each file? That means you're use at least twice as much space (and if you're keeping the original as well, more than twice).

For the likely corruption of the occasional single bit flip here and there, you could do a lot better by using something like par2 and/or dvdisaster (depending on what media you're archiving to).

jlgaddis8y ago

> So you've archived two or more copies of each file

You haven't?

It took me just one minor "data loss incident" ~20 years ago to very quickly convince me to become a lifetime member of the "backup all the things to a few different locations" club.

> That means you're use at least twice as much space (and if you're keeping the original as well, more than twice).

"Storage is cheap."

tedunangst8y ago

https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-co...

outworlder8y ago

Yes.

If your data is not in three different places it might as well not exist.

technion8y ago

I would generally suggest you're more likely to corrupt/lose your whole backup than to have one corrupted bitflip not addressed by the filesystem or underlying storage.

jjuhl8y ago

If you are archiving it may be for long times (think "museums", "data vaults" etc). Finding another copy 200 years later may be difficult.

nebulous18y ago

While finding another copy might be a practical solution for most of us, it seems like a wrongheaded way of designing an archiving data format.

arundelo8y ago

wyldfire8y ago

> compressing things to smaller sizes.

[1] https://www.nongnu.org/lzip/lzip_benchmark.html#xz

derefr8y ago

Relative to the compression formats people were aware of at the time (which didn't include lzip.)

https://news.ycombinator.com/item?id=12768425

petre8y ago

I did some tests of my own and xz turned out marginally better than lzip in most of them.

    665472 freebsd-11.0-release-amd64-disc1.iso
    401728 freebsd-11.0-release-amd64-disc1.iso.xz 5m0.606s
    406440 freebsd-11.0-release-amd64-disc1.iso.lz 5m43.375s
    430872 freebsd-11.0-release-amd64-disc1.iso.bz2 1m38.654s
    440400 freebsd-11.0-release-amd64-disc1.iso.gz 0m27.073s
    431740 freebsd-11.0-release-amd64-disc1.iso.zst 0m3.424s

JdeBP8y ago

You can see discussion of this point, from the last time that this article was on Hacker News, at https://news.ycombinator.com/item?id=12769277 .

carussell8y ago

(2016)

Previously discussed here on HN back then:

The author has made some minor revisions since then. Here are the main differences to the page compared to when it was first discussed here:

http://web.cvs.savannah.nongnu.org/viewvc/lzip/lzip/xz_inade...

And here's the full page history:

http://web.cvs.savannah.nongnu.org/viewvc/lzip/lzip/xz_inade...

cpburns20098y ago

Lionsion8y ago

> It may not be a good choice for long-term data storage, but I disagree that it should not be used for data sharing or software distribution. Different use cases have different needs.

snuxoll8y ago

dv_dt8y ago

tzahola8y ago

speleo_engr8y ago

cpburns20098y ago

[1]: https://github.com/lrq3000/pyFileFixity

[2]: http://dvdisaster.net/en/index.html

[3]: https://superuser.com/q/374609/52739

[4]: https://superuser.com/a/873260/52739

Filligree8y ago

When you're transferring files and need to cope with corrupted/missing chunks, you should use a parity scheme. Others have mentioned that; it's common for, for example, Usenet.

If you can't control the underlying storage, then ditto. Keeping and maintaining explicit parity chunks is somewhat inconvenient, but it works.

yorwba8y ago

pas8y ago

> parity/ECC

Ceph, HDFS and other distributed storage systems implement erasure coding (which is subtly different from error correction coding), which I would recommend for handling backups.

blattimwind8y ago

The interesting thing about erasure codes is that you need to checksum your shards independently from the EC itself. If you supply corrupted or wrong shards, you get corrupted data back.

For example, a bit error in a chunk in a tool like borg/restic will only break that chunk; a piece of a file or perhaps part of a directory listing.

So for these kinds of scenarios "just use multiple backup drives with fully independent backups" is better and simpler.

zokier8y ago

jwilliams8y ago

I sent a reasonable amount of data to Cloud Storage. It varies a lot. Usually ~10GB/day, but it can be up to 1TB/day regularly.

xz can be amazing. It can also bite you.

As annoying as that is, getting an order of magnitude better in many circumstances is hard to give up.

My compromise is "xz -1". It usually delivers pretty good results, in reasonable time, with manageable CPU/Memory usage.

FYI. The datasets are largely text-ish. Usually in 250MB-1GB chunks. So talking JSON data, webpages, and the like.

foepys8y ago

If you get compression ratios that good, you should consider if your application might be doing something stupid like storing the same data thousands of times inside it's data file.

If you store enough of the same type of data, invest in redesigning the application. There's a reason we all use jpegs over zipped bitmaps...

jwilliams8y ago

> There's a reason we all use jpegs over zipped bitmaps...

It's because it's an appropriate compression - just like xz can be? Not sure what you're actually suggesting here.

rspeer8y ago

HTML is pretty repetitive, but if you want to archive HTML data, you don't get to redefine what HTML is. Compression is useful.

UK-Al058y ago

It sounds like his application scraping data of some kind rather than say generating it.

freedomben8y ago

eesmith8y ago

FWIW, PNG also "fails to protect the length of variable size fields". That is, it's possible to construct PNGs such that a 1-bit corruption gives an entirely different, and still valid, image.

davidw8y ago

pmoriarty8y ago

"xz is also a memory hog with the default settings"

Then why use the default settings?

I tend to use the maximum settings, which are much more of a memory hog, but I have enough memory where that's not an issue.

Just use the settings that are right for you.

davidw8y ago

You'd have to ask the guy who wrote the code in the first place.

I think he saw "'best' compression" and stopped looking there.

pmoriarty8y ago

When I use xz for archival purposes I always use par2[1] to provide redundancy and recoverability in case of errors.

When I burn data (including xz archives) on to DVD for archival storage, I use dvdisaster[2] for the same purpose.

I've tested both by damaging archives and scratching DVDs, and these tools work great for recovery. The amount of redundancy (with a tradeoff for space) is also tuneable for both.

[1] - https://github.com/Parchive/par2cmdline

[2] - http://dvdisaster.net/

doubledad2228y ago

moviuro8y ago

Skunkleton8y ago

You should also write out ECC information.

ryao8y ago

Requiring userland software to worry about bitrot is a great way to ensure that it is not done well. It is better to let the filesystem worry about it by using a file system that can deal with it.

This article is likely more relevant to tape archives than anything most people use today.

nurettin8y ago

Too bad for arch https://www.archlinux.org/news/switching-to-xz-compression-f...

saghm8y ago

Is this really an issue for this use case? My naive take is that since Arch updates packages so often, "long-term storage" doesn't come up that much in practice.

aidenn08y ago

It's zero issue since the packages are updated regularly and hashed before installing.

mikepurvis8y ago

Default compression for debian packages is xz as well: http://manpages.ubuntu.com/manpages/xenial/en/man1/dpkg-deb....

agumonkey8y ago

this is from 2010, I guess if xz was bad for this use case they'd know by now

The purpose of a compression format is not to provide error recovery or integrity verification.

The author seems to think the xz container file format should do that.

When you remove this requirement, nearly all his arguments become moot.

zzzcpan8y ago

> The purpose of a compression format is not to provide error recovery or integrity verification.

leni5368y ago

I can understand the concerns about versioning and fragmented extension implementations though.

JdeBP8y ago

> you just dd a tar file directly onto a tape

Actually, one uses the tape archive utility, tar, to write directly to the tape. (-:

LinuxBender8y ago

Perhaps renice your job so that others don't complain about their noisy neighbor.

    renice 19 -p $$ > /dev/null 2>&1

then ...

Use tar + xz to save extra metadata about the file(s), even if it is only 1 file.

    tar cf - ~/test_files/* | xz -9ec -T0 > ./test.tar.xz

If that (or the extra options in tar for xattrs) is not enough, then create a checksum manifest, always sorted.

    sha256sum ~/test_files/* | sort -n > ~/test_files/.sha256

Then use the above command to compress it all into a .tar file that now contains your checksum manifest.

AndyKelley8y ago

I did some compression tests of the CI build of master branch of zig:

    34M zig-linux-x86_64-0.2.0.cc35f085.tar.gz
    33M zig-linux-x86_64-0.2.0.cc35f085.tar.zst
    30M zig-linux-x86_64-0.2.0.cc35f085.tar.bz2
    24M zig-linux-x86_64-0.2.0.cc35f085.tar.lz
    23M zig-linux-x86_64-0.2.0.cc35f085.tar.xz

With maximum compression (the -9 switch), lzip wins but takes longer than xz:

    23725264 zig-linux-x86_64-0.2.0.cc35f085.tar.xz  63.05 seconds
    23627771 zig-linux-x86_64-0.2.0.cc35f085.tar.lz  83.42 seconds

qwerty4561278y ago

LinuxBender8y ago

xz has substantially better compression than gz or bz2, especially if using the flags -9e. You can use all your cores with -T0 or set how many cores to use. I find it to be on par with 7-zip.

Perhaps folks are trying to stick with packages that are in their base repo. p7zip is usually outside of the standard base repos.

yason8y ago

Substantially is a relative term. There are niche cases but how many people really care, or need to care, about the last bytes that can be compressed?

I remember using .tbz2 in the turn of the millennium because at the time download/upload times did matter and in some cases it was actually faster to compress with bzip2 and then send over less data.

https://storagemojo.com/2014/04/25/amazons-glacier-secret-bd...

orbitur8y ago

Related: where can I find a thorough step-by-step method for maintaining the integrity of family photos/videos in backups on either Windows or macOS?

ebullientocelot8y ago

The [Koopman] cited throughout is my boss, Phil! At any rate I'm sadly not surprised and a little appalled that xz doesn't store the version of the tool that did the compression..

Annatar8y ago

So long as xz(1) gets insane amounts of compression and there is no compressor which compresses better, people are going to keep preferring it.

vortico8y ago

What is the probability that a given byte will be corrupted on a hard disk in one year?

What is the probability of a complete HD failure in a year?

loeg8y ago

Use par2 to generate FEC for your archives and move on with your life.

sirsuki8y ago

So what wrong with plain and simple

  tar c foo | gzip > foo.tar.gz

  tar c foo | bzip2 > foo.tar.bz2

Been using these for over 20 years now. Why is is so important to change things especially as this article points out for the worse?!

dchest8y ago

Better (smaller and/or faster) compression.

nailer8y ago

To read the article:

    document.body.style['max-width'] = '550px'; document.body.style.margin = '0 auto'

fenwick678y ago

or just resize your browser window

Lionsion8y ago

What are better file formats for long term archiving? Were any of them designed specifically with that use case in mind?

cpburns20098y ago

There's a post on Super User that contains useful information:

"What medium should be used for long term, high volume, data storage (archival)?" https://superuser.com/q/374609/52739

It mostly focuses on the media instead of formats though.

zokier8y ago

paulmd8y ago

It all depends on what your definition of "high-volume" is, and just how "archival" your access patterns really are.

(CERN has about 200 petabytes of tapes for their long-term storage.)

https://home.cern/about/updates/2017/07/cern-data-centre-pas...

If you mean "high-volume for a small business", probably also tapes, or BD discs with 20% parity encoding to guard against bitrot.

Lionsion8y ago