The results vary between annoying (need to restore / “resilver” and have no redundancy until it’s done; massively increased risk of data loss while doing so due to heavy IO load without redundancy and pointless loss of the redundancy that already exists) to catastrophic (outright corruption). The corollary is that RAID invariably works poorly with disks connected over using an interface that enumerates slowly or unreliably.
Yet most competent active-active database systems have no problems with this scenario!
I would love to see a RAID system that thinks of disks as nodes, properly elects leaders, and can efficiently fast-forward a disk that’s behind. A pile of USB-connected drives would work perfectly, would come up when a quorum was reached, and would behave correctly when only a varying subset of disks is available. Bonus points for also being able to run an array that spans multiple computers efficiently, but that would just be icing on the cake.
I'm not sure what you expect?
RAID1 is a simple data copy, you made sure to make both disks contain different data. So there's two outcomes possible: either the system notices this and copies A to B or B to A to reestablish the redundancy, or it fails to notice and you get corruption.
Linux MD allows for partial sync with the bitmap. If the system knows something in the first 5% of the disk changed, it can limit itself to only syncing that 5%.
> Yet most competent active-active database systems have no problems with this scenario!
Because they're not RAID. The whole point of RAID is that it's extremely simple. This means it's a brute force method with some downsides, but in exchange it's extremely easy to reason about.
RAID is overkill for home use. It also does not solve backups and snapshots. I use one way syncthing with unlimited history, plus usb-sata adapter.
I have a ZFS mirror, where I have taken one disk out, added files to it elsewhere, returned it and reimported.
The pool immediately resilvered the new content onto the untouched drive.
Doing this on btrfs will require a rebalance, forcing all data on the disks to be rewritten.
I believe btrfs replace will copy only the data that had a replica on the failing drive.
That's a weird argument. Even if it's true, it is now stable, and has been for a long time. btrfs has long been my default, and I'd be wary of switching to something newer just because someone was mad that development took a long time.
This includes plenty of random power losses.
The people on IRC tend to default to "unless you're using an enterprise drive, it's probably buggy and doesn't respect write barriers", which shouldn't have mattered because there was no system crash involved.
Yes, I did test my RAM, I know it's fine. For comparison, I've (unintentionally) ran a ZFS system with bad RAM for years and it only manifested as an occasional checksum error.
Just out of curiosity: is there a specific reason you're not using plain-vanilla filesystems which _are_ stable?
Personal anecdote: i've only ever had serious corruption twice, 20-ish years ago, once with XFS and once with ReiserFS, and have primarily used the extN family of filesystems for most of the past 30 years. A filesystem only has to go corrupt on me once before i stop using it.
Edit to add a caveat: though i find the ideas behind ZFS, btrfs, etc., fascinating, i have no personal need for them so have never used them on personal systems (but did use ZFS on corporate Solaris systems many years ago). ext4 has always served me well, and comes with none of the caveats i regularly read about for any of the more advanced filesystems. Similarly, i've never needed an LVM or any such complexity. As the age-old wisdom goes, "complexity is your enemy," and keeping to simple filesystem setups has always served my personal systems/LAN well. i've also never once seen someone recover from filesystem corruption in a RAID environment by simply swapping out a disk (there's always been much more work involved), so i've never bought into the "RAID is the solution" camp.
I've personally had drive failures, fs corruptions due to power loss (which is supposed not to happen on a cow filesystem), fs and file corruption due to ram bitflips, etc. All the times btrfs handled the situation perfectly, with the caveat that I needed the help from the btrfs developers. And they were very helpful!
So yeah, btrfs has a bad rep, but it is not as bad as the common feeling makes it look like.
(note that I still run btrfs raid 1, as I did not find real return of experience regarding raid 5 or 6)
Try CachyOS (or at least the ZFS-Kernel) it has excellent ZFS integration.
1. The scheduler doesn't really exist. IIRC it is PID % num disks.
2. The default balancing policy is super basic. (IIRC always write to the disk with the most free space).
3. Erasure coding is still experimental.
4. Replication can only be configured at the FS level. bcachefs can configure this per-file or per-directory.
bcachefs is still early but it shows that it is serious about multi-disk. You can lump any collection of disks together and it mostly does the right thing. It tracks performance of different drives to make requests optimally and balances write to gradually even out the drives (not lasering a newly added disk).
IMHO there is really no comparison. If it wasn't for the fact that bcachefs ate my data I would be using it.
Or that the complexity is such that if a new bug is found, it may take a long time to be fixed because of the complexity, or it is fixed fast and has unexpected knock-on effects even for circumstances on the common path.
Something that takes a long time to be declared stable/reliable because of its complexity, needs to spend a long time after that declaration without significant issues before I'll actually trust it. Things like btrfs definitely live in this category.
bcachefs even won't be something I use for important storage until it has been battle-tested a bit more for a bit longer, though at this point it is much more likely to take over from my current simple ext4-on-RAID arrangement (and when/if it does, my backups might stay on ext4-on-RAID even longer).
Given the rather cheap price of durable storage these days, I would favour rock solid, high quality code for storing my data, at the expense of some optimisations. Then again, I still like RAID, instantaneous snapshots, COW, encryption, xattr, resizable partitions, CRC... It's it possible to have all this with acceptable performance and simple code bricks combined and layered on top of each other?
yeah, features rich/complete fs is complicated, that's why we have very few of them.
ZFS does something smarter here, it keeps track of the queue length for each drive in a mirror, and picks the one with the lowest number of pending requests.
Personally, I was one of those people. Very excited about the prospects of btrfs, switched several machines over to it to test, ended up with filesystem corruption and had to revert to ext. Now, whenever I peek at btrfs, I never see anything that's compelling over running ZFS, which I've run for close to 15+ years, and run hard, and have never had data loss. Even in the early days with zfs+fuse, where I could regularly crash the zfs fuse; the zfs+fuse developers quickly addressed every crash I ran into, once I put together a stress test.
Is it really? I must have missed the news. Back when it was released completely raw as a default for many distros, there were fundamental design level issues (e.g. "unbound internal fragmentation" reported by Shishkin). Plus all the reports and personal experiences of getting and trying to recover exotically shaped bricks when volume fills to 100% (which could happen at any time with btrfs). Is it all good now? Where can I read about btrfs behaving robustly when no free space is left?
BTW: Even SLES SuseLinux Enterprise says use XFS for data btrfs just for the OS i wonder why
You can do a replace, but then you need to buy a new drive.
- I never agreed with the btrfs default of root raid 1 system not booting up if a device is missing. I think the point of raid1 is to minimize downtime when losing a device and if you lose the other device before returning it to good state, that's 100% on you.
- Poor management tools compared to md (though bcachefs might be in the same boat). Some tools are poorly thought, e.g. there is a tool for defragmentation, but it undoes sharing (so snapshots and dedupped files get expanded).
- If a drive in raid1 drops but then later comes back, btrfs is still quite happy.
- Need of using btrfs balance, and in a certain way as well: https://github.com/kdave/btrfsmaintenance/blob/master/btrfs-... .
- At least it used to be difficult to recover when your filesystem becomes full. Helps if you have it on LVM volume with extra space.
- Snapshotting or having a clone of a btrfs volume is dangerous (due to the uuid-based volume participant scanning)
- I believe raid5/6 is still experimental?
- I've lost a filesystem to btrfs raid10 (but my backups are good).
- I have also rendered my bcachefs in a state where I could no longer write to the filesystem, but I was still able to read it. So I'm inclined to keep using bcachefs for the time being.
Overall I just have the impression that btrfs was complicated and ended up in a design dead-end, making improvements from hard to difficult, and I hope that bcachefs has made different base designs, making future improvements easier.
Yes, the number of developers for bcachefs is smaller, but frankly as long as it's possible for a project to advance with a single developer, it is going to be the most effective way to go—at the same time I hope this situation improves in the future.
Add "degraded" to default mount options. Solved.
That's been implemented; in Linux 6.11 bcachefs will correct errors on read. See
> - Self healing on read IO/checksum error
in https://lore.kernel.org/linux-bcachefs/73rweeabpoypzqwyxa7hl...
Making it possible to scrub from userspace by walking and reading everything (tar -c /mnt/bcachefs >/dev/null).
Repro: supposedly only good copy is copied to ram, ram corrupts bit, crc is recalculated using corrupted but, corrupted copy is written back to disk(s).
Why would it need to recalculate the CRC? The correct CRC (or other hash) for the data is already stored in the metadata trees; it's how it discovered that the data was corrupted in the first place. If it writes back corrupted data, it will be detected as corrupted again the next time.
Our RAM should all be ECC and our OSes should all be on self-healing filesystems.
0 problems in 2.5 months is not necessarily better than 1-2 problems in ~3 years, though. If we're just talking about the single partition boot drive use case, I think I'd go with the option that's had vastly more time to find and eliminate bugs. (If you're conservative about this stuff that probably means ext4, actually.)
- Stability but also
- Constant refactorings
and later
"Disclaimer, my personal data is stored on ZFS"
A bit troubling, I find
"RAID0 behavior is default when using multiple disks" never have I ever had the need for RAID0 or have I seen a customer using it. I think it was at one time popular with gamers before SSDs became popular and cheap.
"RAID 5/6 (experimental)
This is referred to as erasure coding and is listed as “DO NOT USE YET”, "
Well, you got to start somewhere, but a comparison with btrfs and ZFS seems premature.> A bit troubling, I find
I appreciated the candor
The approach of bcachefs developers is that they will only recommend it's usage if it's absolutely, 100% stable and won't eat your data. Bcachefs isn't in that state yet and the developers don't pretend it is.
This avoids the kind of trust issues that btrfs has
> The RAID56 feature provides striping and parity over several devices, same as the traditional RAID5/6. There are some implementation and design deficiencies that make it unreliable for some corner cases and the feature should not be used in production, only for evaluation or testing. The power failure safety for metadata with RAID56 is not 100%.
AFAIK ZFS has had deduplication support for a very long time (2009) and now even does opportunistic block cloning with much less overhead.
The new block cloning still had data corruption bugs quite recently.
But it has de-duplication, with your logic no non-CoW FS should be in that list because they are not comparable.
Chart should have a bloc dedup and file dedup separated columns if it is deemed non comparable.
In theory full file deduplication exists in every filesystem that has cow/reflink support
btrfs doesn't have a built-in encryption.
> ZFS Encryption Y
I cannot find the discussion right now but I remember reading that they were considering a warning when enabling encryption because it was not really stable and people were running into crashes.
https://github.com/openzfs/zfs/issues?q=is%3Aissue+label%3A%...
I see it more as an administrative problem than an issue with ZFS encryption.
I bought new a new ssd and hdd for my desktop this year and looked into running bcachefs because it offers caching as well as native encryption and cow. I also determined that it is not production ready yet for my use case, my file system is the last thing I want to beta tester of. Investigated using bcache again, but opted to use lvm caching, as it offers better tooling and saves on one layer of block devices (with luks and btrfs on top). Performance is great and partition manipulations also worked flawless.
Hopefully bcachefs gains more traction and will be ready for production use, as it combines several useful features. My current setup still feels like making compromises.
why is this a bad thing?
Never again.
I eagerly await bcachefs reaching maturity!
I have a USB stick with btrfs + LUKS on Arch Linux and it never had a problem like this
Tried again with btrfs and hard freezes again.
Btrfs is also far more reliable than ZFS in my view, because it has far far more real world testing, and is also much more actively developed.
Magical perfect elegant code isn't what makes a good filesystem: real world testing, iteration, and bugfixing is. BTRFS has more of that right now than anything else ever has.
On the other hand, while I haven't used it for /, dipping my toes in bcachefs with recoverable data has been a pleasant experience. Compression, encryption, checksumming, deduplication, easy filesystem resizing, SSD acceleration, ease of adding devices… it's good to have it all in one place.
That's not really true: it's deployed across a wide variety of workloads. Not databases, obviously, but reliability concerns have nothing to do with that.
My point isn't "they use it, it must be good": that's silly. My point is that they employ multiple full time engineers dedicated to finding and fixing the bugs in upstream Linux, and because of that, BTRFS is more well tested in practice than anything else out there today.
It doesn't matter how well thought out or "elegant" bcachefs or ZFS are: they don't have a team of full time engineers with access to thousands upon thousands of machines running the filesystem actively fixing bugs. That's what actually matters.
> Compression, encryption, checksumming, deduplication, easy filesystem resizing, SSD acceleration, ease of adding devices... it's good to have it all in one place.
BTRFS does all of that today.
ZFS has corruption bugs, this one was far worse than anything I've seen in btrfs recently: https://lists.freebsd.org/archives/freebsd-stable/2023-Novem...
I have both bcachefs and ext4 filesystems on the same machine, for different uses.