It fills me with joy that someone has been coding a fs for 7 years due to perl installs taking too much space. Necessity is the mother of all invention.
How opposed would you be to this being reworked to being able to be mainline kernel support too?
I imagine these days you have more than 300GB hard disk space, making this all moot?
Thanks to mhx I can move them now back to my fast disk. This is also perfect for testers.
Edit: You could even have several read only shadow copies of the repo for parallel working directory usage, if your hard link the .git directory except for the HEAD ref in each.
@OP: Can you please explain why you keep 50 gigs of perl around? :-)
I use compressed read-only file systems all the time to save space on my travel laptop. I have one squashfs for firefox, one for the TeX base install, one for LLVM, one for qemu, one for my cross compiler collection. I suspect the gains over squashfs will be far less pronounced than for the pathological "400 perl version".
Sure. I've been the maintainer of a perl portability module (Devel::PPPort) for a long time and every release was tested against basically every possible version (and several build flag permutations) of perl that was potentially out in the wild.
(Not meant sarcastically :-)
If you upload a module to CPAN, you automatically get it tested against a huge matrix of configurations:
http://matrix.cpantesters.org/?dist=Log-Any-Adapter-FileHand...
Very true, and it's definitely a great service!
However, the set of versions/configurations is still limited, and it can take an awful lot of time for the matrix to fill up. I've fixed a bug specific to perl-5.10.0 about a week ago and so far the module hasn't been picked up by that version again.
So while this is definitely good as a service for the general public, it doesn't get you very far if you're trying to build a thing that's supposed to ensure compatibility for other Perl modules across 20 years of Perl history. :)
My use case is the MAME console archives, which are now full of copies of games from different localisations with 99% identical content. 7Z will compress them together and deduplicate, but breaks once the archive exceeds a few gigs.
These archives are already compressed (CHD format, which is 7Z + FLAC for ISOs), but it's deduplication that needs to happen on top of these already compressed files that I'm struggling with.
Sorry for the off-topic ask!
[1]: https://en.wikipedia.org/wiki/Windows_Imaging_Format [2]: https://wimlib.net/
If you're using a Windows client, there is a way of enabling this, but it's not exactly supported, for a variety of reasons.
[1]https://docs.microsoft.com/en-us/windows-server/storage/data...
I'll add more benchmarks, this is still WIP and so far I've mainly tried to satisfy my own needs. My intention with DwarFS wasn't to write "a better SquashFS", but to make it better in certain scenarios (huge, highly redundant data) than SquashFS. SquashFS still has the big advantage of being part of the kernel, which makes it a lot more attractive for things like root file systems.
lrzip -UL9 filetarball.tar
It would be a good data point for everyone.
1) Videos from my DSLR
2) RAW images from my DSLR
3) Various movies / TV series I downloaded
4) Game files (most of which are textures and 3D models)
None of that stuff is really compressible.
Having said that I did try to implement a deduplication layer for nbdkit, but what I found was that it wasn't very effective. It turns out that duplicate data in typical VM filesystems isn't common, and the other parts of the filesystem (block free lists etc) were not sufficiently similar to deduplicate given my somewhat naive approach.
It is not what you describe but it can help.
Btrfs can dedupe at the block level.
[0]: https://btrfs.wiki.kernel.org/index.php/Gotchas#Parity_RAID
My own experience indicates it's brittle to power failures and will corrupt in annoying ways in the event of a power failure (or hard reboot) as of ~1 year ago.
Does the performance benchmark show DwarFS versus single-threaded gzip compressed SquashFS?
> Parallel mksquashfs: Using 12 processors
You could theoretically try to build this with dwarfs, by using overlayfs and then compressing the upper layer again with dwarfs, but that sounds pretty fragile and cumbersome.
Would love some theorycrafting on possible ways to work with DwarFS being a FUSE filesystem.
mkdwarfs crashed with recursive links (1-level, just pointing to itself) and when I removed dirs while running mkdwarfs, which were part of of the input path. Which is fair, I assume.
That's odd, it shouldn't crash with links at all, as it doesn't actively follow links. Can you please file a bug if you can reproduce this?
> and when I removed dirs while running mkdwarfs, which were part of of the input path
I guess this is fair, but I'll try to take a look anyway. :-)
> On success, mkdwarfs needed 1 hr, and reduced 219 dirs to a size of 970 MB. Not just source files, but also the build and install object files.
My 500 MB image with the 1100+ perls is just installations, from which I've actually removed libperl.a as I've never needed it and it really bloats the image. I've got a separate image with debug information (everything built with -g in case I need to debug the binaries), so the binaries in the main image are essentially all stripped. If I need to debug, I'll just mount the debug image as well, which contains the source files and the stripped debug data.
> 1 hr is a lot, but just think how long squashfs would have needed.
It might be worth trying a lower compression level, especially if you find that mkdwarfs is CPU bound and not I/O bound.
I have lot of -g info, because I use it mainly for debugging XS problems with old versions. The hashes change for each object, so reduplication is mostly only useful for source files. I really need high compression, which is the default.
1 hr is a lot, but just think how long squashfs would have needed. Totally impractical. Thanks mhx
Because squashfs-tools seemed pretty unmaintained in late 2018 (no activity on the official site & git tree for years and only one mailing list post "can you do a release?" which got a very annoyed response) I released my tooling as "squashfs-tools-ng" and it is currently packaged by a hand full of distros, including Debian & Ubuntu.[1]
I also thoroughly documented the on-disk format, after reverse engineering it[2] and made a few benchmarks[3].
For my benchmarks I used an image I extracted from the Debian XFCE LiveDVD (~6.5GiB as tar archive, ~2GiB as XZ compressed SquashFS image). By playing around a bit, I also realized that the compressed meta data is "amazingly small", compared to the actual image file data and the resulting images are very close to the tar ball compressed with the same compressor settings.
I can accept a claim of being a little smaller than SquashFS, but the claimed difference makes me very suspicious. From the README, I'm not quite sure: Does the Raspbian image comparison compare XZ compression against SquashFS with Zstd?
I have cloned the git tree and installed dozens of libraries that this folly thingy needs, but I'm currently swamped in CMake errors (haven't touched CMake in 8+ years, so I'm a bit rusty there) and the build fails with some still missing headers. I hope to have more luck later today and produce a comparison on my end using my trusty Debian reference image which I will definitely add to my existing benchmarks.
Also, is there any documentation on how the on-disk format for DwarFS and it's packing works which might explain the incredible size difference?
[1] https://github.com/AgentD/squashfs-tools-ng
[2] https://github.com/AgentD/squashfs-tools-ng/blob/master/doc/...
[3] https://github.com/AgentD/squashfs-tools-ng/tree/master/doc
> Does the Raspbian image comparison compare XZ compression against SquashFS with Zstd?
That's correct. It's not an exhaustive matrix of comparisons.
> Also, is there any documentation on how the on-disk format for DwarFS and it's packing works which might explain the incredible size difference?
The format as of 0.2.0 is actually quite simple. It's a list of compressed data blocks, followed by a metadata block (and a schema describing the metadata block). The metadata format is implemented by and documented in in [1].
There are probably 3 things that contribute to compression level:
1) Block size. DwarFS can use arbitrary block sizes (artificially limited to powers of two), and uses a much larger block size (16M) by default. SquasFS doesn't seem to be able to go higher than 1M.
2) Ordering files by similarity.
3) Segment deduplication. If segments of files overlap with previously seen data, these segments are referenced instead of written again. The minimum size of these segments can be configured and defaults to 2k. For my primary use case, of the 47.6 GB of input data, 28.2 GB are saved by file-level deduplication, and another 12.4 GB by this segment-level deduplication. So before the "real" compression algorithms actually kick in, there are only 7 GB of data left. As these are ordered by similarity, and stored in rather big blocks, some of the 16M blocks can actually be compressed down to less then 100k.
[1] https://github.com/mhx/dwarfs/blob/main/thrift/metadata.thri...
have you investigated why this might be the case?
Very briefly. It looks like clang has a different strategy breaking up the code (which is mostly C++ templates) into actual functions vs. inlining it, and the hot code ultimately performs fewer function calls with clang than it does with gcc. But this is nowhere near a proper analysis of what's going on. :)
Just to clarify that last statement (and something to think about) with HDD's you want duplicate assets so that you don't cause seeks which are VERY slow on 5400rpm HDD's still found on some/alot of systems.
* can search archived files potentially faster because read access is potentially faster
* fit more data on bootable media