A few years ago I, when I was on a game console team, a hardware engineer came to my desk and said, "Can you find out what's wrong with this disk drive?" It had come from a customer whose complaint was that games sometimes failed to download and game saves became unreadable.
I spent a fun afternoon tracking down what turned out to be a stuck-at-zero bit on that drive's cache. Just above the drive's ECC-it-to-death block storage was this flaky bit of RAM that was going totally unchecked. The console had a Merkle-tree based file system and easily detected the failure, but without that addition checking the corruption would have been very subtle, most of the time.
Okay, so that's just one system out of millions, right? What are the chances? Well, at the scale of millions, pretty much any hole in data integrity is going to be found out and affect real, live customers at some not insignificant rate. You really shouldn't be amazed at the number of single-bit memory errors happening on consumer hardware (from consoles to PCs -- and I assume phones). You should expect these failures and determine in advance if they are important to you and your customers.
Just asserting "CRCs are useless" is putting a lot of trust on stuff that has real-world failure modes.
Yes, and he does this over and over again throughout the article. I have personally experienced at least 3 scenarios that he has determined won't happen.
If this guy wrote a filesystem (something that he pretends to have enough experience to critique), it would be an unreliable unusable piece of crap.
A bunch of this article reads as if this scenario, which I in fact hit, won't happen, drives do it better, etc. But it happens. It happened to me. The drive did not "magically fix itself", and instead got worse over time. With ZFS, if it happens again, I can be told where it happened, exactly what files are affected, etc., and that's already better than what I got with that other disk which didn't have ZFS.
Plus the ZFS tools like snapshotting, send/receive, scrub being able to check integrity while the system is running... Those are great features.
I have no idea what the problem is with this server. There are no SMART failures or kernel messages indicating hardware failure, and the system doesn't hard-crash. The thing is, I don't actually have to care, because ZFS is actively taking care of the problem. Until one of the disks goes so bad that SMART or the kernel's SATA layer or ZFS can point me at it, I can just passively let ZFS continue protecting me.
If this were a RAID, the first risk is that the RAID system wouldn't have a scrub command at all. Some do, but not all. Without such a command, those on-disk ECCs the author heaps so much praise on won't help him. I've got the same ECCs backing my ZFS, and clearly the data is getting corrupted anyway, somehow.
Let's keep the author's context in mind, which is apparently that we're going to use motherboard or software RAID, since he's budgeted $0 for a hardware RAID card, so the chances are higher that there is no scrub or verify command.
If our RAID implementation does happen to have a scrub or verify command, it might be forced to just kick one of the disks out or mark the whole array as degraded, depending on where in the chain the corruption happened. If it does that, it'll take a whole lot longer to rewrite one of the author's cheap 3 TB disks than it took ZFS on my file server to fix the few megs of corrupted blocks.
And that's not all. I have a second anecdote, the plural of which is "data," right? :)
Another ZFS-based system I manage had a disk die outright in it. SMART errors, I/O timeouts, the whole bit. Very easy to diagnose. So, I attached a third disk in an external hard disk enclosure to the pained ZFS mirror, which caused ZFS to start resilvering it.
Before I go on, I want to point out that this shows another case where ZFS has a clear advantage. In a typical hardware RAID setup, a 2-disk mirror is more likely to be done with a 2-port RAID card, because they're cheaper than 4-port and 8-port cards. That means there is a very real chance that you couldn't set up a 3-disk mirror at all, which means you're temporarily reduced to no redundancy during the resilver operation. Even if you've got a spare RAID port on the RAID card or motherboard, you might not have another internal disk slot to put the disk in. With ZFS, I don't need either: ZFS doesn't care if two of a pool's disks are in a high-end RAID enclosure configured for JBOD and the third is in a cheap USB enclosure.
The point of having a temporary 3-disk mirror is that the dying disk wasn't quite dead yet. That means it was still useful for maintaining redundancy during the resilvering operation. With the RAID setup, you might be forced to replace the dying disk with the new disk, which means you lose all your redundancy during the resilver.
Now as it happens, sometime during the resilver operation, `zfs status` began showing corruptions. ZFS was actively fixing them like a trooper, but this was still very bad. It turned out that the cheap USB external disk enclosure I was using for the third disk was flaky, so that when resilvering the new disk, it wasn't always able to write reliably. I unmounted the ZFS pool, moved the new disk to a different external USB disk enclosure, re-mounted the pool, and watched it pick the resilvering process right back up. Once that was done, I detached the dying disk from the mirror and did a scrub pass to clear the data errors, and I was back in business having lost no data, despite the hardware actively trying to kill my data twice over.
There are still cases where I'll use RAID over ZFS, but I'm under no illusions that ZFS has no real advantages over RAID. I've seen plenty of evidence to the contrary.
(On a side note, ZFS -- at least OpenZFS -- doesn't support any CRC algorithms for use as its checksum.)
ZFS doesn't have to guess which copy is wrong. It knows, and it will automatically replace it.
More, ZFS will even do this on a ZFS mirror when reading half the data blocks from one disk and half from the other, because it reads the cryptographically-strong checksums in with each data block and checks them before delivering the data to the application. If the checksum doesn't match, it rewrites that block from the redundant copy on the other disk(s).
RAID can't do that. If one of a mirror's data blocks is corrupted on disk but with a correct ECC, so that the two blocks don't match but both read cleanly, RAID can't tell which one is correct, so it'll typically just force the system administrator to choose one disk to overwrite the other with. That exchanges astronomical odds against incorrect data for coin flip odds against.
From the idea that SMART reliably detects hard drive failures.. to dismissing data protection for no reason other than it sounds unlikely to the author (which in several cases I know personally to be false... because I've experienced those failures).
ZFS is a very well designed filesystem. Things weren't added haphazardly or because they sounded cool. The author would do well to try to understand why those protections were added.
Sun/Oracle, and a lot of popular third party documentation, has said as such very openly, and commands like zfs send/recv exist to easily automate zfs cloning (to backup from one zfs fs to another, for example, if you choose to do it that way).
I suspect whoever wrote this missed the boat on why zfs works.
The same "I've never seen it so it's not real" fallacy appears again in the discussion of RAID 5. He says that losing a second drive during a rebuild is "statistically very unlikely" but that's not so. Not only have I seen it many times, but the simple math of disk capacities and interface speeds shows that it's not really all that unlikely. I've seen RAID 6 fail because of overlapping rebuild times, leading people to push for more powerful erasure-coding schemes. Over the lifetime of even a medium-sized system, concurrent failures on RAID 5 are likely enough to justify using something stronger.
I was one of the earliest and most outspoken critics of ZFS hype and FUD when it came out. It was and is no panacea, but that doesn't justify more FUD in the other direction to sell backup products or services.
ZFS certainly isn't a magic wand you should wave at anything and everything and it doesn't replace backups but it does make the chances of something going wrong undetected much smaller and even though the chances are small to begin with, there are times when you just can't accept it at all.
[0]: https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kele...
[1]: https://www.usenix.org/legacy/events/fast08/tech/full_papers...
[2]: http://storageconference.us/2006/Presentations/39rWFlagg.pdf
The author seems to misunderstand the purpose of snapshots. As frequently [1] pointed out, snapshots are not in fact backups and should not be used for longer term storage.
Also the same argument can be used on Backups: "Backups may help, but they depend on the damage being caught before the backup of the good data is removed. If you save something and come back six months later and find it’s damaged, your backups might just contain a few months with the damaged file and the good copy was lost a long time ago."
[1] http://www.cobaltiron.com/2014/01/06/blog-snapshots-are-not-...
I don’t know much about btrfs so I’ll stick to ZFS related comments. ZFS does not use CRC, by default it uses fletcher4 checksum. Fletcher’s checksum is made to approach CRC properties without the computational overhead usually associated with CRC.
Without a checksum, there is no way to tell if the data you read back is different from what you wrote down. As you said corruption can happen for a variety of reason – due to bugs or HW failure anywhere in the storage stack. Just like other filesystems not all types of corruption will be caught even by ZFS, especially on the write to disk side. However, ZFS will catch bit rot and a host of other corruptions, while non-checksumming filesystems will just pass the corrupted data back to the application. Hard drives don’t do it better, they have no idea if they’ve bit rotted over time and there are many other components that may and do corrupt data, it’s not as rare as you think. The longer you hold data and the more data you have the higher the chance you will see corruption at some point.
I want to do my best to avoid corrupting data and then giving it back to my users so I would like to know if my data has been corrupted (not to mention I’d like it to self-heal as well which is what ZFS will do if there is a good copy available). If you care about your data use a checksumming filesystem period. Ideally, a checksumming filesystem that doesn’t keep the checksum next to the data. A typical checksum is less than 0.14 Kb while a block that it’s protecting is 128 Kb by default. I’ll take that 0.1% “waste of space” to detect corruption all day, any day. Now let’s remember ZFS can also do in-line compression which will easily save you 3-50% of storage space (depending on the data you’re storing) and calling a checksum a “waste of space” is even more laughable.
I do want to say that I wholeheartedly agree with “Nothing replaces backups” no matter what filesystem you’re using. Backing up between two OpenZFS pools machines in different physical location is super easy using zfs snapshot-ting and send/receive functionality.
ZFS was created to solve actual business problems.
Here's a quote:
- “ZFS has CRCs for data integrity
A certain category of people are terrified of the techno-bogeyman named “bit rot.” These people think that a movie file not playing back or a picture getting mangled is caused by data on hard drives “rotting” over time without any warning. The magical remedy they use to combat this today is the holy CRC, or “cyclic redundancy check.” It’s a certain family of hash algorithms that produce a magic number that will always be the same if the data used to generate it is the same every time.
This is, by far, the number one pain in the ass statement out of the classic ZFS fanboy’s mouth..."
Meanwhile in reality...
ZFS does not use CRCs for checksums.
It's very hard to take someone's view seriously when they are making mistakes at this level.
ZFS allows a range of checksum algorithms, including SHA256, and you can even specify per dataset the strength of checksum you want.
- "Hard drives already do it better"
No, they don't, or Oracle/Sun/OpenZFS developers wouldn't have spent time and money making it.
It makes a bit of a difference when your disk says 'whoops, sorry, CRC fail, that block's gone?' and it was holding your whole filesystem together. Or when a power surge or bad component fries the whole drive at once.
ZFS allows optional duplication of metadata or data blocks automatically; as well as multiple levels of RAID-equivalency for automatic, transparent rebuilding of data/metadata in the presence of multiple unreliable or failed devices. Hard drives... don't do that.
Even ZFS running on a single disk can automatically keep 2 (or more) copies on disk of whatever datasets you think are especially important - just check the flag. Regular hard drives don't offer that.
- What about the very unlikely scenario where several bits flip in a specific way that thwarts the hard drive’s ECC? This is the only scenario where the hard drive would lose data silently, therefore it’s also the only bit rot scenario that ZFS CRCs can help with.
Well, that and entire disk failures.
And power failures leading to inconsistency on the drive.
And cable faults leading to the wrong data being sent to the drive to be written.
And drive firmware bugs.
And faulty cache memory or faulty controllers on the hard drive.
And poorly connected drives with intermittent glitches / timeouts in communication.
You get the idea.
I could also point out that ZFS allows you to backup quickly and precisely (via snapshots, and incremental snapshot diffs).
It allows you to detect errors as they appear (via scrubs) rather than find out years later when your photos are filled with vomit coloured blocks.
It also tells you every time it opens a file if it has found an error, and corrected it in the background for you - thank god! This 'passive warning' feature alone lets you quickly realise you have a bad disk or cable so you can do something about it. Consider the same situation with a hard drive over a period of years...
ZFS is a copy-on-write filesystem, so if something naughty happens like a power-cut during an update to a file, your original data is still there. Unlike a hard disk (or RAID).
It's trivial to set up automatic snapshots, which as well as allowing known-point-in-time recovery, are an exceptionally effective way to prevent viruses, user errors etc from wrecking your data. You can always wind back the clock.
Where is the author losing his data (that he knows of, and in his very limited experience...): All of my data loss tends to come from poorly typed ‘rm’ commands. ... so, exactly the kind of situation that ZFS snapshots allow instant, certain, trouble-free recovery from in the space of seconds? [either by rolling back the filesystem, or by conveniently 'dipping into' past snapshots as though they were present-day directories as needed]
Anyway I do hope Mr/Ms nctritech learns to read the beginner's guide for technologies they critique in future, maybe even try them once or twice, before they write their critique.
What next?
"Why even use C? Everything you can do in C, you can do in PHP anyway!"
Tiny nitpick but though Oracle now owns and develops ZFS, Sun Microsystems was the company that initially designed and implemented it. They worked on it for 5 years after they released it, before Oracle acquired them.
Unless most journaled filesystems, ZFS:
- allows a separate high-performance device to be used for the log. This is important because the cost of journalling can be high when lots of fsyncs are being used to ensure integrity (i.e. try running a write performance test on a database like postgresql using ext4 with and without journalling, you'll see a difference).
- the filesystem log can be mirrored physically, to protect against the risk of log device failure [which would endanger writes in flight].
Other similarities/differences:
In a journalling FS, you need to take the filesystem offline and check the journal. In ZFS, there is continual passive checking of file data and metadata at time of access, as well as the option for an online 'scrub' that is similar to the fsck of a journalled filesystem without requiring dismounting of the filesystem.
While copy-on-write by itself may not be necessarily strictly superior to journalling, ZFS is strictly superior to either.