ZFS won’t save you: fancy filesystem fanatics need to get a clue about bit rot (opens in new tab)

(nctritech.wordpress.com)

19 pointsgphreak8y ago37 comments

37 comments

> While it is true that keeping a hash of a chunk of data will tell you if that data is damaged or not, the filesystem CRCs are an unnecessary and redundant waste of space ...

A few years ago I, when I was on a game console team, a hardware engineer came to my desk and said, "Can you find out what's wrong with this disk drive?" It had come from a customer whose complaint was that games sometimes failed to download and game saves became unreadable.

I spent a fun afternoon tracking down what turned out to be a stuck-at-zero bit on that drive's cache. Just above the drive's ECC-it-to-death block storage was this flaky bit of RAM that was going totally unchecked. The console had a Merkle-tree based file system and easily detected the failure, but without that addition checking the corruption would have been very subtle, most of the time.

Okay, so that's just one system out of millions, right? What are the chances? Well, at the scale of millions, pretty much any hole in data integrity is going to be found out and affect real, live customers at some not insignificant rate. You really shouldn't be amazed at the number of single-bit memory errors happening on consumer hardware (from consoles to PCs -- and I assume phones). You should expect these failures and determine in advance if they are important to you and your customers.

Just asserting "CRCs are useless" is putting a lot of trust on stuff that has real-world failure modes.

rgbrenner8y ago

Just asserting "CRCs are useless" is putting a lot of trust on stuff that has real-world failure modes.

Yes, and he does this over and over again throughout the article. I have personally experienced at least 3 scenarios that he has determined won't happen.

If this guy wrote a filesystem (something that he pretends to have enough experience to critique), it would be an unreliable unusable piece of crap.

AstralStorm8y ago

You have worse problems that a filesystem won't catch if RAM gets randomly corrupted. Including said CRC check itself getting corrupted or code writing putt data structures to disk being wrong. Neither of those is caught by CRC better than by a dirty bit. It so happens that journaling file systems already have a degree of redundancy for writes built into them unless you defeat it.

1 more reply

Buge8y ago

But did the console's software checking help in that case? Either way you're going to have a customer complaining about problems.

X86BSD8y ago

Consumer hardware is notriously busted. Even most of the enterprise hardware isn't flawless. Firmware bugs etc. "Your hardware is actively trying to kill your data and ZFS job is to prevent it." To paraphrase Allan Jude.

asveikau8y ago

A few years ago I had a drive at home that was flipping bits, randomly corrupting my files. It inspired me to build a ZFS disk server and introduce redundancy in my home setup.

A bunch of this article reads as if this scenario, which I in fact hit, won't happen, drives do it better, etc. But it happens. It happened to me. The drive did not "magically fix itself", and instead got worse over time. With ZFS, if it happens again, I can be told where it happened, exactly what files are affected, etc., and that's already better than what I got with that other disk which didn't have ZFS.

Plus the ZFS tools like snapshotting, send/receive, scrub being able to check integrity while the system is running... Those are great features.

wyoung28y ago

I've got a ZFS server here that regularly detects some small number of megs of incorrect data on each week's scrub. This week, it happens to be 4.28M. Every week, ZFS finds the correct copy and fixes it.

I have no idea what the problem is with this server. There are no SMART failures or kernel messages indicating hardware failure, and the system doesn't hard-crash. The thing is, I don't actually have to care, because ZFS is actively taking care of the problem. Until one of the disks goes so bad that SMART or the kernel's SATA layer or ZFS can point me at it, I can just passively let ZFS continue protecting me.

If this were a RAID, the first risk is that the RAID system wouldn't have a scrub command at all. Some do, but not all. Without such a command, those on-disk ECCs the author heaps so much praise on won't help him. I've got the same ECCs backing my ZFS, and clearly the data is getting corrupted anyway, somehow.

Let's keep the author's context in mind, which is apparently that we're going to use motherboard or software RAID, since he's budgeted $0 for a hardware RAID card, so the chances are higher that there is no scrub or verify command.

If our RAID implementation does happen to have a scrub or verify command, it might be forced to just kick one of the disks out or mark the whole array as degraded, depending on where in the chain the corruption happened. If it does that, it'll take a whole lot longer to rewrite one of the author's cheap 3 TB disks than it took ZFS on my file server to fix the few megs of corrupted blocks.

And that's not all. I have a second anecdote, the plural of which is "data," right? :)

Another ZFS-based system I manage had a disk die outright in it. SMART errors, I/O timeouts, the whole bit. Very easy to diagnose. So, I attached a third disk in an external hard disk enclosure to the pained ZFS mirror, which caused ZFS to start resilvering it.

Before I go on, I want to point out that this shows another case where ZFS has a clear advantage. In a typical hardware RAID setup, a 2-disk mirror is more likely to be done with a 2-port RAID card, because they're cheaper than 4-port and 8-port cards. That means there is a very real chance that you couldn't set up a 3-disk mirror at all, which means you're temporarily reduced to no redundancy during the resilver operation. Even if you've got a spare RAID port on the RAID card or motherboard, you might not have another internal disk slot to put the disk in. With ZFS, I don't need either: ZFS doesn't care if two of a pool's disks are in a high-end RAID enclosure configured for JBOD and the third is in a cheap USB enclosure.

The point of having a temporary 3-disk mirror is that the dying disk wasn't quite dead yet. That means it was still useful for maintaining redundancy during the resilvering operation. With the RAID setup, you might be forced to replace the dying disk with the new disk, which means you lose all your redundancy during the resilver.

Now as it happens, sometime during the resilver operation, `zfs status` began showing corruptions. ZFS was actively fixing them like a trooper, but this was still very bad. It turned out that the cheap USB external disk enclosure I was using for the third disk was flaky, so that when resilvering the new disk, it wasn't always able to write reliably. I unmounted the ZFS pool, moved the new disk to a different external USB disk enclosure, re-mounted the pool, and watched it pick the resilvering process right back up. Once that was done, I detached the dying disk from the mirror and did a scrub pass to clear the data errors, and I was back in business having lost no data, despite the hardware actively trying to kill my data twice over.

There are still cases where I'll use RAID over ZFS, but I'm under no illusions that ZFS has no real advantages over RAID. I've seen plenty of evidence to the contrary.

rubatuga8y ago

By the way, are you running ZFS on a linux server? Or BSD? Just want to set one up for myself too.

2 more replies

Mindless21128y ago

As someone who has lost some files to a silently malfunctioning hard disk in the past, I think I'll stick with ZFS. Checksumming, RAID-Z, and periodic scrubbing would have saved my files. Even having backups did not -- after all, what good is a bit-for-bit copy of a corrupted file?

(On a side note, ZFS -- at least OpenZFS -- doesn't support any CRC algorithms for use as its checksum.)

AstralStorm8y ago

Mostly periodic scrubbing and patrol reads I reckon. Which is as required with RAID without ZFS.

wyoung28y ago

Scrub/verify/patrols, whatever you want to call it, with RAID all it can do is say, "Well shit, these two copies don't match. What do you want me to do about it, boss?"

ZFS doesn't have to guess which copy is wrong. It knows, and it will automatically replace it.

More, ZFS will even do this on a ZFS mirror when reading half the data blocks from one disk and half from the other, because it reads the cryptographically-strong checksums in with each data block and checks them before delivering the data to the application. If the checksum doesn't match, it rewrites that block from the redundant copy on the other disk(s).

RAID can't do that. If one of a mirror's data blocks is corrupted on disk but with a correct ECC, so that the two blocks don't match but both read cleanly, RAID can't tell which one is correct, so it'll typically just force the system administrator to choose one disk to overwrite the other with. That exchanges astronomical odds against incorrect data for coin flip odds against.

rgbrenner8y ago

For an article with that tone, you would think the author would have more experience. It's literally filled with flawed and uninformed or inexperienced thinking.

From the idea that SMART reliably detects hard drive failures.. to dismissing data protection for no reason other than it sounds unlikely to the author (which in several cases I know personally to be false... because I've experienced those failures).

ZFS is a very well designed filesystem. Things weren't added haphazardly or because they sounded cool. The author would do well to try to understand why those protections were added.

AstralStorm8y ago

Almost all of the protections are also afforded by plain old RAID without ZFS. Why waste space on a CRC when you still get to run a redundancy check? If FS structure is corrupted CRC won't save you anyway. An FSCK might instead.

DiabloD38y ago

This entire article can be summarized as the following: RAID is not a replacement for backups.

Sun/Oracle, and a lot of popular third party documentation, has said as such very openly, and commands like zfs send/recv exist to easily automate zfs cloning (to backup from one zfs fs to another, for example, if you choose to do it that way).

I suspect whoever wrote this missed the boat on why zfs works.

notacoward8y ago

Totally off base, on several points. Any kind of checksum on the disk only protects what gets to the disk. Filesystem-level CRCs can protect the entire data path. If you have a defect in your RAID card or HBA, or anywhere in the software stack below the filesystem, on-disk CRCs will happily "validate" the already-corrupted data while filesystem-level CRCs are likely to detect the corruption. The author dismisses it as a "remotely likely scenario" but I've seen it happen for real many times. Maybe that's because I have about 3.5x as many years of experience as the author, across what's probably thousands of times as many machines or drives (I've worked on some big system).

The same "I've never seen it so it's not real" fallacy appears again in the discussion of RAID 5. He says that losing a second drive during a rebuild is "statistically very unlikely" but that's not so. Not only have I seen it many times, but the simple math of disk capacities and interface speeds shows that it's not really all that unlikely. I've seen RAID 6 fail because of overlapping rebuild times, leading people to push for more powerful erasure-coding schemes. Over the lifetime of even a medium-sized system, concurrent failures on RAID 5 are likely enough to justify using something stronger.

I was one of the earliest and most outspoken critics of ZFS hype and FUD when it came out. It was and is no panacea, but that doesn't justify more FUD in the other direction to sell backup products or services.

Veratyr8y ago

While he's right that it's not as big an issue as ZFS fanatics make it out to be, it _is_ a real issue and they're not just pulling it out their asses. There are a number of studies that actually measured the error rate, some of the bigger ones being done by CERN [0], NetApp [1] and IA (I think there's meant to be a talk or something to go with this one) [2].

ZFS certainly isn't a magic wand you should wave at anything and everything and it doesn't replace backups but it does make the chances of something going wrong undetected much smaller and even though the chances are small to begin with, there are times when you just can't accept it at all.

[0]: https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kele...

[1]: https://www.usenix.org/legacy/events/fast08/tech/full_papers...

[2]: http://storageconference.us/2006/Presentations/39rWFlagg.pdf

X86BSD8y ago

Actually it does replace backups with replication and/or cloning.

bbatha8y ago

Its not backed up until its at least on an external system, ideally in triplicate off-box, off-site, and cold storage. Cloning and replication makes it easier to backup but is no substitute.

1 more reply

Veratyr8y ago

That's not a replacement for backups, that's an implementation of backups and only if you send it to an offline disk or remote system.

ATsch8y ago

>Snapshots may help, but they depend on the damage being caught before the snapshot of the good data is removed. If you save something and come back six months later and find it’s damaged, your snapshots might just contain a few months with the damaged file and the good copy was lost a long time ago.

The author seems to misunderstand the purpose of snapshots. As frequently [1] pointed out, snapshots are not in fact backups and should not be used for longer term storage.

Also the same argument can be used on Backups: "Backups may help, but they depend on the damage being caught before the backup of the good data is removed. If you save something and come back six months later and find it’s damaged, your backups might just contain a few months with the damaged file and the good copy was lost a long time ago."

[1] http://www.cobaltiron.com/2014/01/06/blog-snapshots-are-not-...

OpenZFSonLinux8y ago

This blog post was deleted hours after I posted the following comment rebuking most of what was said:

I don’t know much about btrfs so I’ll stick to ZFS related comments. ZFS does not use CRC, by default it uses fletcher4 checksum. Fletcher’s checksum is made to approach CRC properties without the computational overhead usually associated with CRC.

Without a checksum, there is no way to tell if the data you read back is different from what you wrote down. As you said corruption can happen for a variety of reason – due to bugs or HW failure anywhere in the storage stack. Just like other filesystems not all types of corruption will be caught even by ZFS, especially on the write to disk side. However, ZFS will catch bit rot and a host of other corruptions, while non-checksumming filesystems will just pass the corrupted data back to the application. Hard drives don’t do it better, they have no idea if they’ve bit rotted over time and there are many other components that may and do corrupt data, it’s not as rare as you think. The longer you hold data and the more data you have the higher the chance you will see corruption at some point.

I want to do my best to avoid corrupting data and then giving it back to my users so I would like to know if my data has been corrupted (not to mention I’d like it to self-heal as well which is what ZFS will do if there is a good copy available). If you care about your data use a checksumming filesystem period. Ideally, a checksumming filesystem that doesn’t keep the checksum next to the data. A typical checksum is less than 0.14 Kb while a block that it’s protecting is 128 Kb by default. I’ll take that 0.1% “waste of space” to detect corruption all day, any day. Now let’s remember ZFS can also do in-line compression which will easily save you 3-50% of storage space (depending on the data you’re storing) and calling a checksum a “waste of space” is even more laughable.

I do want to say that I wholeheartedly agree with “Nothing replaces backups” no matter what filesystem you’re using. Backing up between two OpenZFS pools machines in different physical location is super easy using zfs snapshot-ting and send/receive functionality.

zlynx8y ago

He missed all the history of ZFS too. Sun had actual customers with bit rot. Even though they were running systems with the highest types of server hardware Sun provided, they had invisible data errors which were only noticed when the files were used and analysis showed ECC passing bit errors.

ZFS was created to solve actual business problems.

random_comment8y ago

This entire article can be summarised as 'guy who has never used ZFS and has no idea whatsoever about how it works writes a critique that exposes their ignorance publicly'.

Here's a quote:

- “ZFS has CRCs for data integrity

A certain category of people are terrified of the techno-bogeyman named “bit rot.” These people think that a movie file not playing back or a picture getting mangled is caused by data on hard drives “rotting” over time without any warning. The magical remedy they use to combat this today is the holy CRC, or “cyclic redundancy check.” It’s a certain family of hash algorithms that produce a magic number that will always be the same if the data used to generate it is the same every time.

This is, by far, the number one pain in the ass statement out of the classic ZFS fanboy’s mouth..."

Meanwhile in reality...

ZFS does not use CRCs for checksums.

It's very hard to take someone's view seriously when they are making mistakes at this level.

ZFS allows a range of checksum algorithms, including SHA256, and you can even specify per dataset the strength of checksum you want.

- "Hard drives already do it better"

No, they don't, or Oracle/Sun/OpenZFS developers wouldn't have spent time and money making it.

It makes a bit of a difference when your disk says 'whoops, sorry, CRC fail, that block's gone?' and it was holding your whole filesystem together. Or when a power surge or bad component fries the whole drive at once.

ZFS allows optional duplication of metadata or data blocks automatically; as well as multiple levels of RAID-equivalency for automatic, transparent rebuilding of data/metadata in the presence of multiple unreliable or failed devices. Hard drives... don't do that.

Even ZFS running on a single disk can automatically keep 2 (or more) copies on disk of whatever datasets you think are especially important - just check the flag. Regular hard drives don't offer that.

- What about the very unlikely scenario where several bits flip in a specific way that thwarts the hard drive’s ECC? This is the only scenario where the hard drive would lose data silently, therefore it’s also the only bit rot scenario that ZFS CRCs can help with.

Well, that and entire disk failures.

And power failures leading to inconsistency on the drive.

And cable faults leading to the wrong data being sent to the drive to be written.

And drive firmware bugs.

And faulty cache memory or faulty controllers on the hard drive.

And poorly connected drives with intermittent glitches / timeouts in communication.

You get the idea.

I could also point out that ZFS allows you to backup quickly and precisely (via snapshots, and incremental snapshot diffs).

It allows you to detect errors as they appear (via scrubs) rather than find out years later when your photos are filled with vomit coloured blocks.

It also tells you every time it opens a file if it has found an error, and corrected it in the background for you - thank god! This 'passive warning' feature alone lets you quickly realise you have a bad disk or cable so you can do something about it. Consider the same situation with a hard drive over a period of years...

ZFS is a copy-on-write filesystem, so if something naughty happens like a power-cut during an update to a file, your original data is still there. Unlike a hard disk (or RAID).

It's trivial to set up automatic snapshots, which as well as allowing known-point-in-time recovery, are an exceptionally effective way to prevent viruses, user errors etc from wrecking your data. You can always wind back the clock.

Where is the author losing his data (that he knows of, and in his very limited experience...): All of my data loss tends to come from poorly typed ‘rm’ commands. ... so, exactly the kind of situation that ZFS snapshots allow instant, certain, trouble-free recovery from in the space of seconds? [either by rolling back the filesystem, or by conveniently 'dipping into' past snapshots as though they were present-day directories as needed]

Anyway I do hope Mr/Ms nctritech learns to read the beginner's guide for technologies they critique in future, maybe even try them once or twice, before they write their critique.

What next?

"Why even use C? Everything you can do in C, you can do in PHP anyway!"

Veratyr8y ago

> No, they don't, or Oracle wouldn't have spent money making it.

Tiny nitpick but though Oracle now owns and develops ZFS, Sun Microsystems was the company that initially designed and implemented it. They worked on it for 5 years after they released it, before Oracle acquired them.

random_comment8y ago

Whoops, thanks for the catch. Have updated and also added OpenZFS to that sentence.

AstralStorm8y ago

Copy on write is a good thing, as is log structure which is even more resilient. However, it is not strictly superior to journaling in terms of data safety. The copied data will get garbage collected or overwritten after some time and regardless might be tricky to recover.

random_comment8y ago

ZFS has journalling too, in the form of the ZFS Intent Log, where writes are placed rapidly as they occur before becoming part of the main filesystem. The effect is similar to a journal in a journalling filesystem [i.e. to allow recovery and consistency of writes in flight in the event of power loss], but the way it is used is different.

Unless most journaled filesystems, ZFS:

- allows a separate high-performance device to be used for the log. This is important because the cost of journalling can be high when lots of fsyncs are being used to ensure integrity (i.e. try running a write performance test on a database like postgresql using ext4 with and without journalling, you'll see a difference).

- the filesystem log can be mirrored physically, to protect against the risk of log device failure [which would endanger writes in flight].

Other similarities/differences:

In a journalling FS, you need to take the filesystem offline and check the journal. In ZFS, there is continual passive checking of file data and metadata at time of access, as well as the option for an online 'scrub' that is similar to the fsck of a journalled filesystem without requiring dismounting of the filesystem.

While copy-on-write by itself may not be necessarily strictly superior to journalling, ZFS is strictly superior to either.

2 more replies

Jaepa8y ago

I think one of the universal truths in tech is that, those for it, and those annoyed by it both kind of miss the point.

X86BSD8y ago

I think what bothers me most is this person owns a computer related business. He is actively endangering people's data out of willful ignorance. It's highly unethical.

j / k navigate · click thread line to collapse

37 comments

kabdib8y ago

> While it is true that keeping a hash of a chunk of data will tell you if that data is damaged or not, the filesystem CRCs are an unnecessary and redundant waste of space ...

Just asserting "CRCs are useless" is putting a lot of trust on stuff that has real-world failure modes.

rgbrenner8y ago

Just asserting "CRCs are useless" is putting a lot of trust on stuff that has real-world failure modes.

Yes, and he does this over and over again throughout the article. I have personally experienced at least 3 scenarios that he has determined won't happen.

If this guy wrote a filesystem (something that he pretends to have enough experience to critique), it would be an unreliable unusable piece of crap.

AstralStorm8y ago

1 more reply

Buge8y ago

But did the console's software checking help in that case? Either way you're going to have a customer complaining about problems.

X86BSD8y ago

asveikau8y ago

A few years ago I had a drive at home that was flipping bits, randomly corrupting my files. It inspired me to build a ZFS disk server and introduce redundancy in my home setup.

Plus the ZFS tools like snapshotting, send/receive, scrub being able to check integrity while the system is running... Those are great features.

wyoung28y ago

And that's not all. I have a second anecdote, the plural of which is "data," right? :)

There are still cases where I'll use RAID over ZFS, but I'm under no illusions that ZFS has no real advantages over RAID. I've seen plenty of evidence to the contrary.

rubatuga8y ago

By the way, are you running ZFS on a linux server? Or BSD? Just want to set one up for myself too.

2 more replies

Mindless21128y ago

(On a side note, ZFS -- at least OpenZFS -- doesn't support any CRC algorithms for use as its checksum.)

AstralStorm8y ago

Mostly periodic scrubbing and patrol reads I reckon. Which is as required with RAID without ZFS.

wyoung28y ago

Scrub/verify/patrols, whatever you want to call it, with RAID all it can do is say, "Well shit, these two copies don't match. What do you want me to do about it, boss?"

ZFS doesn't have to guess which copy is wrong. It knows, and it will automatically replace it.

rgbrenner8y ago

For an article with that tone, you would think the author would have more experience. It's literally filled with flawed and uninformed or inexperienced thinking.

ZFS is a very well designed filesystem. Things weren't added haphazardly or because they sounded cool. The author would do well to try to understand why those protections were added.

AstralStorm8y ago

DiabloD38y ago

This entire article can be summarized as the following: RAID is not a replacement for backups.

I suspect whoever wrote this missed the boat on why zfs works.

notacoward8y ago

Veratyr8y ago

[0]: https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kele...

[1]: https://www.usenix.org/legacy/events/fast08/tech/full_papers...

[2]: http://storageconference.us/2006/Presentations/39rWFlagg.pdf

X86BSD8y ago

Actually it does replace backups with replication and/or cloning.

bbatha8y ago

Its not backed up until its at least on an external system, ideally in triplicate off-box, off-site, and cold storage. Cloning and replication makes it easier to backup but is no substitute.

1 more reply

Veratyr8y ago

That's not a replacement for backups, that's an implementation of backups and only if you send it to an offline disk or remote system.

ATsch8y ago

The author seems to misunderstand the purpose of snapshots. As frequently [1] pointed out, snapshots are not in fact backups and should not be used for longer term storage.

[1] http://www.cobaltiron.com/2014/01/06/blog-snapshots-are-not-...

OpenZFSonLinux8y ago

This blog post was deleted hours after I posted the following comment rebuking most of what was said:

zlynx8y ago

ZFS was created to solve actual business problems.

random_comment8y ago

This entire article can be summarised as 'guy who has never used ZFS and has no idea whatsoever about how it works writes a critique that exposes their ignorance publicly'.

Here's a quote:

- “ZFS has CRCs for data integrity

This is, by far, the number one pain in the ass statement out of the classic ZFS fanboy’s mouth..."

Meanwhile in reality...

ZFS does not use CRCs for checksums.

It's very hard to take someone's view seriously when they are making mistakes at this level.

ZFS allows a range of checksum algorithms, including SHA256, and you can even specify per dataset the strength of checksum you want.

- "Hard drives already do it better"

No, they don't, or Oracle/Sun/OpenZFS developers wouldn't have spent time and money making it.

Well, that and entire disk failures.

And power failures leading to inconsistency on the drive.

And cable faults leading to the wrong data being sent to the drive to be written.

And drive firmware bugs.

And faulty cache memory or faulty controllers on the hard drive.

And poorly connected drives with intermittent glitches / timeouts in communication.

You get the idea.

I could also point out that ZFS allows you to backup quickly and precisely (via snapshots, and incremental snapshot diffs).

It allows you to detect errors as they appear (via scrubs) rather than find out years later when your photos are filled with vomit coloured blocks.

ZFS is a copy-on-write filesystem, so if something naughty happens like a power-cut during an update to a file, your original data is still there. Unlike a hard disk (or RAID).

Anyway I do hope Mr/Ms nctritech learns to read the beginner's guide for technologies they critique in future, maybe even try them once or twice, before they write their critique.

What next?

"Why even use C? Everything you can do in C, you can do in PHP anyway!"

Veratyr8y ago

> No, they don't, or Oracle wouldn't have spent money making it.

random_comment8y ago

Whoops, thanks for the catch. Have updated and also added OpenZFS to that sentence.

AstralStorm8y ago

random_comment8y ago

Unless most journaled filesystems, ZFS:

- the filesystem log can be mirrored physically, to protect against the risk of log device failure [which would endanger writes in flight].

Other similarities/differences:

While copy-on-write by itself may not be necessarily strictly superior to journalling, ZFS is strictly superior to either.

2 more replies

Jaepa8y ago

I think one of the universal truths in tech is that, those for it, and those annoyed by it both kind of miss the point.

X86BSD8y ago

I think what bothers me most is this person owns a computer related business. He is actively endangering people's data out of willful ignorance. It's highly unethical.

j / k navigate · click thread line to collapse