- Use sane defaults for pool creation. ashift=12, lz4 compression, xattr=sa, acltype=posixacl, and atime=off. Don't even ask me.
- Make encryption just on or off instead of offering five or six options
- Generate the encryption key for me, set up the systemd service to decrypt the pool at start up, and prompt me to back up the key somewhere
- `zfs list` should show if a dataset is mounted or not, if it is encrypted or not, and if the encryption key is loaded or not
- No recursive datasets and use {pool}:{dataset} instead of {pool}/{dataset} to maintain a clear distinction between pools and datasets.
- Don't make me name pools or snapshots. Assign pools the name {hostname}-[A-Z]. Name snapshots {pool name}_{datetime created} and give them numerical shortcuts so I never have to type that all out
- Don't make me type disk IDs when creating pools. Store metadata on the disk so ZFS doesn't get confused if I set up a pool with `/dev/sda` and `/dev/sdb` references and then shuffle around the drives
- Always use `pv` to show progress
- Automatically set up weekly scrubs
- Automatically set up hourly/daily/weekly/monthly snapshots and snapshot pruning
- If I send to a disk without a pool, ask for confirmation and then create a new single disk pool for me with the same settings as on the sending pool
- collapse `zpool` and `zfs` into a single command
- Automatically use `--raw` when sending encrypted datasets, default to `--replicate` when sending, and use `-I` whenever possible when sending
- Provide an obvious way to mount and navigate a snapshot dataset instead of hiding the snapshot filesystem in a hidden directory
Naming pools after hostnames: I have pools on a SAN which can be imported by more than one host.
Weekly scrubs, periodic snapshots, periodic pruning: This is really the job of the OS' scheduler (an equally opinionated view, I admit)
collapsing zpool and zfs commands - sure but why? so you can have zfs -pool XXXX and zfs -volume XXXX?
No recursive datasets? I have use cases where it's very useful.
`zfs list` should show if a dataset is mounted or not, if it is encrypted or not, and if the encryption key is loaded or not: Fully agree!
Don't make me type disk IDs when creating pools: You can address them in 3-4 different ways (by id, by WWN, by label, by sdX etc), and you have to specify in _some_ way which disks you want to go there, so not sure what's the point here.
Store metadata on the disk so ZFS doesn't get confused if I set up a pool with `/dev/sda` and `/dev/sdb` references and then shuffle around the drives: Already happening. Swap a few drives around and import the pool, it will find them.
Some of your suggestions are genuinely OK, at least as defaults, but some indicate you aren't really considering much outside your own usage pattern and needs. ZFS caters to a lot more people than you.
I think zpool is unnecessary as an additional command. For example, `zfs scrub`, `zfs destroy [pool | dataset]`, `zfs add`, `zfs remove` would all have clear meanings. There may be a couple commands that would need explicit disambiguation with a flag like `zfs create`.
And under the OP's proposal, those people would continue to ZFS entirely unaffected. The OP wasn't proposing changing the behaviour of ZFS, but rather "wrapping" this defined set into a well-defined recipe which could be used by people who aren't so opinionated.
This "dumbed down wrapper" wouldn't even need to be called ZFS, to avoid confusion. Personally I'd like to propose the name ZzzFS: which is ZFS made so simple you can do it in your sleep...
To be fair, that's not ZFS's problem, that is your problem for not keeping up with the times. PBCAK.
For quite some time now, Linux has had fully-qualified references, e.g. : `/dev/disk/by-id/ata-$manufactuer-$serial-$whatever`
That is what you should be using when building your pools.
> Don't make me name [...] snapshots.
You might like this little tool I wrote: https://github.com/rollcat/zfs-autosnap
You put "zfs-autosnap snap" in cron hourly (or however often you want a snapshot), and "zfs-autosnap gc" in cron daily, and it takes care of maintaining a rolling history of snapshots, per the retention policy.
It's not hard writing simple ZFS command wrappers, feel free to take my code and make your own tools.
I ended up finishing neither, and should pick them back up!
(I snapshot in big chunks with xargs to try to minimise temporal smear - snapshots created in the same `zfs snapshot` command are atomic)
At $DAYJOB I wrote a bunch of scripts to mechanize building ZFS arrays for whatever expected deployment I'd imagined on that day. Among the tasks was to make luks encrypted volumes on which to put the zvols, standardize the naming schemes, sane defaults like ashift=12, lz4 compression, etc. (this was well before encryption was part of ZFS; I haven't updated the scripts since to support encryption in zfs since it's not really been a problem this way)
I don't remember many of these flags now, but have a script as reference for documentation, and others on the team don't need to know much about ZFS besides run make-zfs-big-mirror or make-big-zfs-undundant-raid0 and magic happens.
Eventually maybe even that stuff will be automated away by our provisioning, if we ever are in a position to provision systems more than 20 times per year.
Not sure why, and I should probably make the test reproducable.
The ones I find most personally objectionable:
> - Don't make me name pools or snapshots. Assign pools the name {hostname}-[A-Z]. Name snapshots {pool name}_{datetime created} and give them numerical shortcuts so I never have to type that all out
Not naming pools is just bonkers. You don't create pools often enough to not simply name them.
Re: not naming snapshots, you could use `httm` and `zfs allow` for that[0]:
$ httm -S .
httm took a snapshot named: rpool/ROOT/ubuntu_tiebek@snap_2022-12-14-12:31:41_httmSnapFileMount
> - collapse `zpool` and `zfs` into a single command`zfs` and `zpool` are just immaculate Unix commands, each of which has half a dozen sub commands. One the smartest decisions the ZFS designers made was not giving you a more complicated single administrative command.
> - Provide an obvious way to mount and navigate a snapshot dataset instead of hiding the snapshot filesystem in a hidden directory
Again -- you can do this very easily via `zfs mount`, but you'll have to trust me that a stable virtual interface also makes it very easy to search for all file versions, something which is much more difficult to achieve with btrfs, et. al. See again `httm` [1].
[0]: https://kimono-koans.github.io/opinionated-guide/#dynamic-sn... [1]: https://github.com/kimono-koans/httm
TrueNAS
I like how ZFS is put together. I've been running it for about 13 years. I started with Nexenta, a Solaris fork with Debian userland. I've ported my pool twice, had a bunch of HDD failures, and haven't lost a single byte.
I agree with you on most of the encryption stuff. That is very recent and not fully integrated and the user experience isn't fully baked. I don't agree on unifying zpool and zfs; for a good long time, I served zvols from my zpool, and dividing up storage management and its redundancy configuration from file system management makes sense to me. Similarly, recursive datasets make sense; you want inheritance or something very like it when managing more than a handful of filesystems. I don't agree on pool names (why anyone would want ordinal pool naming and just replicate the problem you just stated re sda, sdb etc. is a bit mysterious), and I don't agree on snapshots (to me this is like preferring commit IDs in git to branch and tag names - manually created snapshots outside periodic pruning should be named).
ZoL on Ubuntu does periodic scrubs by default now. Sometimes I have to stop them because they noticeably impact I/O too much. Periodic snapshots is one of the first cronjobs I created on Nexenta, and while there's plenty of tooling, it also needs configuration - if you are not aware of it, it's an easy way to retain references to huge volumes of data, depending on use case. Not all of my ZFS filesystems are periodically snapshotted the same way.
Likewise, I appreciate being able to name snapshots, but it's annoying to have to manually name the snapshot I create in order to zfs send. The solution there is probably to not make me take a manual snapshot in the first place. `zfs send` should automatically make the snapshot for me. But in general, I don't see why zfs can't default to a generic name and let me override it with a `--name` flag.
Giving it more thought, I think I would keep pool naming. What I don't like is the possibility of having pool name collisions which isn't something you have to think about with, say, ext4 filesystems. But the upshot, as you point out, is with zfs you aren't stuck using sda, sdb, etc.
For snapshots and replication take a look at sanoid (https://github.com/jimsalterjrs/sanoid).
— please provide support for multiple key slots as in LUKS
— please build in the functionality of sanoid and syncoid, so that snapshots and replication don’t need a third party tool
— please build a usable deduplication, so that we don’t have to use external tools such as Restic or Borg
https://www.bsdcan.org/events/bsdcan_2023/sessions/session/1...
You can set "encryption=on", and it will select the default strongest option, currently AES-256-GCM
> - Generate the encryption key for me, set up the systemd service to decrypt the pool at start up, and prompt me to back up the key somewhere
Technically it does generate encryption keys internally, which is why the ones you provide can be rotated out. If you use a keyfile then automount with key load is easy (zfs mount -al), there is already an auto mount systemd service created automatically for Debian, however they did not add the -l flag for auto loading keys because they got stuck in a debate about supporting passphrase prompts at boot. For now you can simply edit it to add the -l flag and it work fine for datasets with keyfiles.
> - Don't make me type disk IDs when creating pools. Store metadata on the disk so ZFS doesn't get confused if I set up a pool with `/dev/sda` and `/dev/sdb` references and then shuffle around the drives
This is no longer the case for ZoL. I know because there is an issue with Linode for storage device identifier assignments where they consistently get jumbled up with Debian on every boot. ZFS finds the devices all the same, even if it has a different identifier every boot. I believe this is because it stores it's own UUID info on the devices. So you can create pools by referring to devices however you like, because they are only a temporary reference, i.e use /dev/sda etc (and I have, and it's fine). I think there is a lot of outdated advice about this floating around still.
> - Automatically set up weekly scrubs
This might be ZoL specific, but, the Debian package does exactly this, sets up systemd weekly scrub.
> - No recursive datasets
Why? this is too useful... inherited encryption roots, datasets with different properties so you can have databases and other filesystems all under the same root dataset, which can then be recursive replicated in one command. If you have no need for recursive datasets just don't use them, but they have many valid purposes.
how many disks per vdev? how much memory? etc
a lot of the things you've outlined are not universal at all, just situational
In fairness, learning about zfs is like learning about mdadm, lvm and a filesystem all at once… so it’s kinda justifiable in my opinion
One pattern I've found useful when writing wrapper shell scripts: Output the actual command(s) that actually get run, to stderr in yellow, before running them. This also serves as a sanity check.
This should have an option to integrate use of a TPM for super-encrypting the ZFS encryption key(s).
Nooo, that should not be done. They are very different tools, for very different things.
"Sane defaults", are you going to be arbiter of sanity? With the #3 recommendation, "set up systemd for me", I'd rather not have you at that position.
Most of the bullets you wrote induce a "...why?" thought in somebody that has ZFS experience. Why would you unify zpool and zfs? Why would you want automatic weekly scrubs on as default? Do you realize what ZFS scrubbing is and when is the time to perform it?
I'm a bit agitated by your writing I must confess. You want ZFS to exactly reflect your basic use case so you don't have to move your little finger (automatic naming, automagical configuration). It's not meant to be a hands-off filesystem, you are expected to understand encryption in ZFS in order to use it.
But the most annoying thing is that you did see a steeper learning curve, and want to avoid it. Why don't you write your own ZFS provisioning tool? Why are you still using /dev/sda and not disk-by-UUID or something more 2023? Etc.
??
- get to know the difference between zpool-attach(8) and zpool-replace(8).
- this one will tell you where your space is used:
# zfs list -t all -o space
NAME AVAIL USED USEDSNAP USEDDS USEDREFRESERV USEDCHILD
(...)
- ZFS Boot Environments is the best feature to protect your OS before major changes/upgrades--- this may be useful for a start: https://is.gd/BECTL
- this command will tell you all history about ZFS pool config and its changes:
# zpool history poolname
History for 'poolname':
2023-06-20.14:03:08 zpool create poolname ada0p1
2023-06-20.14:03:08 zpool set autotrim=on poolname
2023-06-20.14:03:08 zfs set atime=off poolname
2023-06-20.14:03:08 zfs set compression=zstd poolname
2023-06-20.14:03:08 zfs set recordsize=1m poolname
(...)
- the guide misses one important info: --- you can create 3-way mirror - requires 3 disks and 2 may fail - still no data lost
--- you can create 4-way mirror - requires 4 disks and 3 may fail - still no data lost
--- you can create N-way mirror - requires N disks and N-1 may fail - still no data lost
(useful when data is most important and you do not have that many slots/disks)[0] https://docs.freebsd.org/en/books/handbook/zfs/
[1] https://pthree.org/2012/04/17/install-zfs-on-debian-gnulinux...
I had an old HP Microserver with 1GB of ECC RAM lying around so I installed FreeBSD on it. I had 5 old 500GB hard drives lying around too so I set them up in a 5x mirror with help from the FreeBSD Handbook. First time using FreeBSD and it was a breeze.
(I realize now after writing it that maybe snapchat should have occurred to me first, but I have never used it)
I've been using ZFS in combination with rsync for backups for a long time, so I was fairly comfortable with it... and it all worked out, but it was a way bigger time sink than I expected - because I wanted to do it right - and there is a lot of misleading advice on the web, particularly when it comes to running databases and replication.
For databases (you really should at minimum do basic tuning like block size alignment), by far the best resource I found for mariadb/innoDB is from the lets encrypt people [0]. They give reasons for everything and cite multiple sources, which is gold. If you search around the web elsewhere you will find endless contradicting advice, anecdotes and myths that are accompanied with incomplete and baseless theories. Ultimately you should also test this stuff and understand everything you tune (it's ok to decide to not tune something).
For replication, I can only recommend the man pages... yeah, really! ZFS gives you solid replication tools, but they are too agnostic, they are like git pluming, they don't assume you're going to be doing it over SSH (even though that's almost always how it's being used)... so you have to plug it together yourself, and this feels scary at first, especially because you probably want it to be automated, which means considering edge cases... which is why everyone runs to something like syncoid, but there's something horrible I discovered with replication scripts like syncoid, which is that they don't use ZFS's send --replication mode! They try to reimplement it in perl, for "greater flexibility", but incompletely. This is maddening when you are trying to test this stuff for the first time and find that all of the encryption roots break when you do a fresh restore, and not all dataset properties are automatically synced. ZFS takes care of all of this if you simply use the build in recursive "replicate" option. It's not that hard to script manually once you commit to it, just keep it simple, don't add a bunch of unnecessary crap into the pipeline like syncoid does, (they actually slow it down if you test), just use pv to monitor progress and it will fly.
I might publish my replication scripts at some point because I feel like there are no good functional reference scripts for this stuff that deal with the basics without going nuts and reinventing replication badly like so many others.
IME with a decently busy (120K QPS) MySQL DB is that you do not need to touch either of these. If you think you do, monitor the time to fill the redo log, and the dirty page percent in the buffer pool. There are probably other parameters you should tune instead.
[0] https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.ht...
One unexpected thing to check (and do check, because your mileage will vary) - the suggestion is usually to align record sizes, which in practice tends to mean reducing the record size on the ZFS filesystem holding the data. I don't doubt that this is at some level more efficient, but I can empirically tell you that it kills compression ratios. Now the funny knock-on effect is that it can - and again, I say can because it will vary by your workload - but it can actually result in worse throughput if you're bottlenecked on disk bandwidth, because compression lets you read/write data faster than the disk is physically capable of, so killing that compression can do bad things to your read/write bandwidth.
I enabled lz4 compression and set recordsize for database datasets to 16k to match innoDB... turns out even at 16k my databases are extremely compressible 3-4x AFAIR (I didn't write the DB schema for the really big DBs, they are not great, and I suspect that there is a lot of redundant data even within 16k of contiguous data)... maybe I could get even more throughput with larger record sizes, but seems unlikely.
As you say, mileage will vary, it's subjective, but then I wasn't using compression before ZFS, so I don't have a comparison. I have only done basic performance testing, overall it's an improvement over ext4, but I've not been trying to fine tune it, I'm just happy to not have made it worse so far while gaining ZFS.
My only surprise was volblocksize default which is pretty bad for most RAIDZ configuration: you need to increase it to avoid loosing 50% of raw disk space...
Articles touching this topic :
https://openzfs.github.io/openzfs-docs/Basic%20Concepts/RAID...
https://www.delphix.com/blog/zfs-raidz-stripe-width-or-how-i...
And you end up on one of the ZFS "spreadsheet" out there:
ZFS overhead calc.xlsx https://docs.google.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT6...
RAID-Z parity cost https://docs.google.com/spreadsheets/d/1pdu_X2tR4ztF6_HLtJ-D...
The documentation in question was a PowerPoint presentation with difficult to read styling, somewhat evangelical language, lots of assumptions about knowledge and it was not regularly updated. It was vague on how much RAM was required, mainly just focused on having as much as possible. Needless to say I ignored all the red flags about the technology, the hype and my own knowledge and lost a load of data. Lots of lessons learnt.
The filesystem has gotten a lot more stable, and imo the documentation clearer.
That said, it's "more powerful and more advanced" than traditional journaling filesystems like ext3, and thus comes with more ways to shoot yourself in the foot.
- All redundancy in ZFS is built in the vdev layer. Zpools are created with one or more vdevs, and no matter what, if you lose any single vdev in a zpool, the zpool is permanently destroyed.
- Historically RAIDZs (parity RAIDs) cannot be expanded by adding disks. The only way to grow a RAIDZ is to replace each disk in the array one at a time with a larger disk (and hope no disks fail during the rebuild). So in my very amateur opinion, I would only consider doing a RAIDZ if it is something like a RAIDZ2 or 3 with a large number of disks. For n<=6 and if the budget can stand it, I would do several mirrored vdevs. (Again as an amateur I am less familiar with RW performance metrics of various RAIDs so do more research for prod).
If and only if you a. Have full, on-site backups b. Are fairly sure of your abilities and monitoring then I can suggest RAIDZ1. I have a pool of 3x3 drives, which ships its snapshots a few U down in my rack to the backup target that wakes up daily, and has a pool of 3x4 drives, also in RAIDZ1.
In the event that I suffer a drive failure in my NAS, my plan of action would be to immediately start up the backup, ingest snapshots, and then replace the drive. That should minimize the chance of a 2nd drive failure during resilvering destroying my data.
Truly important data, of course, has off-site as well.
So there aren't any errors in files. There aren't any errors in devices. There aren't any errors detected in scrub(?). And yet at runtime I get a dozen new "errors" showing up in zpool status per day. How?
- if you want to copy files for example and connect your drive to another system and mount your zpool there, it sets some pool membership value on the file system and when you put it back in your system it won’t boot unless you set it back. Which involved chroot
- the default settings I had made snapshot every time I apt installed something, because that snap shot included my home drive when I deleted big files thereafter I didn’t get any free space back until i figued out what was going on and arbitrarily deleted some old snapshots
- you can’t just make a swap file and use it,
Isn't this what `zpool export` is for?
cat /etc/apt/apt.conf.d/90_zsys_system_autosnapshot
// Takes a snapshot of the system before package changes.
DPkg::Pre-Invoke {"[ -x /usr/libexec/zsys-system-autosnapshot ] && /usr/libexec/zsys-system-autosnapshot snapshot || true";};
// Update our bootloader to list the new snapshot after the update is done to not block the critical path
DPkg::Post-Invoke {"[ -x /usr/libexec/zsys-system-autosnapshot ] && /usr/libexec/zsys-system-autosnapshot update-menu || true";};
but how would I get this to not snapshot , say /home/Downloads ..
make that its own zpool?What kind of schedule was it? I feel like the low-impact alternative to no snapshots at all is daily snapshots for half a week to a week, and maybe some n-hourly snapshots that last a day or two. Which I would not expect to use up very much space.
It's #3 where I need to do some more research/work. I need to spend some time sending snapshots/diffs to cloud blob storage and make sure I can restore. Yes, I know there is rsync.net.
Any experiences to share?
Clarification: Remote end also uses ZFS, so I can use cheap replication with encryption
Borg spilts your files up into chunks, encrypts them and dedupes them client-side and then syncs them with the server. Because of the deduping, versioning is cheap and you can configure how many daily, weekly, monthly, &c. copies to keep. For example you could keep 7 day's worth of copies, 6 monthly copies and 10 yearly copies.
Rysnc.net have special pricing for customers using Borg/Restic:
I'm not working with much data though so even if I wanted to I couldn't get a ZFS send/receive account with rsync.net. I like the way rsync.net give you separate credentials for managing the snapshots. This way even if my NAS gets compromised i will still have all the periodic snapshots.
For me privacy is my main concern and Restic's security model is good for me. The backup testing features are good too and rsync.net doesn't charge for traffic so these two work good together. I don't use the snapshots though because rsync.net already supports this via ZFS.
I do one about every month or so. I should probably add a crontab for that.
Haha, The only part of maintenance that I need to look up every time I do it is replacing a faulty hard drive.
Even this guide skips that.
(Hey looks like it's a sore spot!)
I very much regret the fragmentation of FS design, it has many mothers. "there can only be one" was never going to work, but we seem to have perhaps 4-5 more than we really need. ZFS manages to wrap up a number of behaviours cohesively with good version-dependent signalling so it should always be possible to know you're risking a non-reversible change to your flags. And, it keeps improving.
But, counter "it keeps improving" so do all the other current, maintained, developed FS and if somebody tells me they prefer to use Hammer, or one of the Linux FS models with a discrete volume-management and encryption layer, I don't think thats necessarily wrong.
Mainly I regret Apple walking away. That was about Oracle behaviour. It wasn't helpful. A lot of Apple's FS design ideas persist. I never got resource/data forks, it only ever appeared on my radar as .files in the UNIX AUFS backend model of them. Obviously inside Apples code, it was dealt with somehow. It felt like the wrong lessons about meta data had been learned. Maybe an Ex-VMS person went to Apple? Also Apple has a rather "yea maybe or no, dunno" view about case-independent or case-dependent naming. Time machine is good. Feels like it should fit ZFS well. Oh well.
There's quite a few "quality of life" differences like boot environments (boot into a pre upgraded OS state, even years old), built in SMB server with NFS v4 style ACLs, dtrace, built in snapshot scheduling and management, and Napp-It is an available web UI for management a la FreeNAS/TrueNAS.
It has a few differences, service management is quite different from other things, but overall very underrated as an OS I think.
Frankly, if even Debian can use it, it's a non-issue.
That depends on what you want. If you want a license that will play well with closed source software, then yeah, it's a downside. But the GPL family comes from the perspective of a developer who wants to retain their rights while respecting others' desire for the same. If you care about your rights, then this is an upside.
I agree this is unlikely, but so is someone being born with as much litigiousness as Larry Ellison.
As someone ignorant to file system development, I would almost expect something more likely to be BTRFS getting sued for copying a feature of ZFS or something like that.
If anyone (Oracle or Linus Torvalds) launches a ZFS-related lawsuit, it'll be as an author of GPLv2 kernel code. For the time being the solution has been to ship ZFS separate from the kernel, as any module with a non-GPL licence typically does.
The biggest hurdle to ZFS isn't legal, but technical. The various teams working on ZFS has so far been able to keep up with kernel churn and symbols (eg FPU ones) being made GPL-only.
That said, the kernel devs have made it clear that they don't care if you're open source or proprietary, they will make changes and mark new symbols GPL-only to fuck with you regardless.