DwarFS: A fast high compression read-only file system (opens in new tab)

(github.com)

280 pointsdaantje5y ago111 comments

111 comments

> I started working on DwarFS in 2013 and my main use case and major motivation was that I had several hundred different versions of Perl that were taking up something around 30 gigabytes of disk space, and I was unwilling to spend more than 10% of my hard drive keeping them around for when I happened to need them.

It fills me with joy that someone has been coding a fs for 7 years due to perl installs taking too much space. Necessity is the mother of all invention.

mhx775y ago

Hahaha, I haven't actually been coding on this for that long, it's more that I coded for a few weeks back in 2013 and only found the motivation to resurrect the whole thing a few weeks back.

simcop23875y ago

Funny thing about it is that I've got a similar problem powering https://perl.bot/ (and the associated irc bot). I don't have as many installs as you currently but It's not far off and I want to add more compile time settings to them. I'd need to setup a full build server/system though because I need to regularly update them with new modules.

How opposed would you be to this being reworked to being able to be mainline kernel support too?

1 more reply

jodrellblank5y ago

> "taking up something around 30 gigabytes of disk space, and I was unwilling to spend more than 10% of my hard drive"

I imagine these days you have more than 300GB hard disk space, making this all moot?

1 more reply

pjc505y ago

Nowadays you can have the same problem with Python and Javascript too!

imhoguy5y ago

Same with Ruby (Gems).

rurban5y ago

I have about the very same problems as mhx, several hundreds of huge perl versions which are almost the same, taking up enourmous amounts of diskspace. E.g. I had to move most of them from my SSD to a spinning disk. I really want to move them back.

Thanks to mhx I can move them now back to my fast disk. This is also perfect for testers.

lambda_obrien5y ago

If they're almost the same, could you use one git repo with different branches for each version? Or archive them with restic into a folder and restore which one you need each time. Either method should deduplicate data if they're mostly the same file structure and content.

Edit: You could even have several read only shadow copies of the repo for parallel working directory usage, if your hard link the .git directory except for the HEAD ref in each.

deepstack5y ago

nice, wonder how this compare with MongoDB compression of files and objects. Seems like a great foundation for archiving data.

fefe235y ago

It looks like the benefit is some kind of block or file deduplication.

@OP: Can you please explain why you keep 50 gigs of perl around? :-)

I use compressed read-only file systems all the time to save space on my travel laptop. I have one squashfs for firefox, one for the TeX base install, one for LLVM, one for qemu, one for my cross compiler collection. I suspect the gains over squashfs will be far less pronounced than for the pathological "400 perl version".

mhx775y ago

> @OP: Can you please explain why you keep 50 gigs of perl around? :-)

Sure. I've been the maintainer of a perl portability module (Devel::PPPort) for a long time and every release was tested against basically every possible version (and several build flag permutations) of perl that was potentially out in the wild.

bufferoverflow5y ago

The single case in the known universe.

TimTheTinker5y ago

Very impressive, to say the least.

(Not meant sarcastically :-)

jasonjayr5y ago

Speculating here, but perl has a very rich test library and harnesses for running tests across multiple perls and platforms.

If you upload a module to CPAN, you automatically get it tested against a huge matrix of configurations:

http://matrix.cpantesters.org/?dist=Log-Any-Adapter-FileHand...

mhx775y ago

> If you upload a module to CPAN, you automatically get it tested against a huge matrix of configurations.

Very true, and it's definitely a great service!

However, the set of versions/configurations is still limited, and it can take an awful lot of time for the matrix to fill up. I've fixed a bug specific to perl-5.10.0 about a week ago and so far the module hasn't been picked up by that version again.

So while this is definitely good as a service for the general public, it doesn't get you very far if you're trying to build a thing that's supposed to ensure compatibility for other Perl modules across 20 years of Perl history. :)

rkeene25y ago

AppFS provides global file deduplication and also solves the distribution problem, and also you don't need to have all the resources locally

smitty1e5y ago

Whew! It was easy to find out how you actually initialize this thing, if it's read-only:

https://github.com/mhx/dwarfs/blob/main/man/mkdwarfs.md

slagfart5y ago

Perhaps not strictly on-topic, but is there any equivalent FS/program in Windows that will allow users to have read-only access to files that are deduplicated in some way?

My use case is the MAME console archives, which are now full of copies of games from different localisations with 99% identical content. 7Z will compress them together and deduplicate, but breaks once the archive exceeds a few gigs.

These archives are already compressed (CHD format, which is 7Z + FLAC for ISOs), but it's deduplication that needs to happen on top of these already compressed files that I'm struggling with.

Sorry for the off-topic ask!

storedbox5y ago

You could use a [WIM image][1]. They can be mounted rw or ro and have file-level deduplication. Microsoft's official tooling is necessary to mount them on Windows as [the only open source implementation I am aware of][2] uses FUSE for mounting.

[1]: https://en.wikipedia.org/wiki/Windows_Imaging_Format [2]: https://wimlib.net/

dashgreen5y ago

If you are using windows server, data deduplication[1] is available on non-system volumes to do exactly this.

If you're using a Windows client, there is a way of enabling this, but it's not exactly supported, for a variety of reasons.

[1]https://docs.microsoft.com/en-us/windows-server/storage/data...

rakoo5y ago

It's probably a hack, but you can try "backing up" your files with bup, restic or borg, and mount the resulting snapshot with FUSE

aidenn05y ago

You probably need to de-duplicate before compression, at least for many compression schemes.

robbyt5y ago

s/ask/request/g

Scaevolus5y ago

Neat! I'd like to see benchmarks for more typical squashfs payloads-- embedded root filesystems totalling under 100MB. Small docker images like alpine would be a decent proxy. The given corpus of thousands of perl versions is more appropriate for comparison against git.

mhx775y ago

Author here :)

I'll add more benchmarks, this is still WIP and so far I've mainly tried to satisfy my own needs. My intention with DwarFS wasn't to write "a better SquashFS", but to make it better in certain scenarios (huge, highly redundant data) than SquashFS. SquashFS still has the big advantage of being part of the kernel, which makes it a lot more attractive for things like root file systems.

londons_explore5y ago

Are there git filesystems? If so, they could be a good comparison point too - gits PACK file format is pretty magic...

1 more reply

throwmemoney5y ago

How much does the compression of the perl repo become when compressed with

lrzip -UL9 filetarball.tar

It would be a good data point for everyone.

david_draco5y ago

I wish there was a semi-compressed transparent filesystem layer which slowly compresses the least recently used files in the background, and un-compresses files upon use. That way you could store much more mostly unused content than space on the disk, without sacrificing accessibility.

bufferoverflow5y ago

I don't know about you guys, but most of the stuff that takes up space on my drives are:

1) Videos from my DSLR

2) RAW images from my DSLR

3) Various movies / TV series I downloaded

4) Game files (most of which are textures and 3D models)

None of that stuff is really compressible.

jandrese5y ago

The RAWs aren't compressible? Are they LZ encoded on the camera?

3 more replies

rwmj5y ago

You could probably build something easily in nbdkit to do this. (Note this is at the block layer). An advantage of nbdkit is you could write the whole thing in the high-level language of your choice, even a scripting language such as Python, which might make it easier to rapidly explore designs.

Having said that I did try to implement a deduplication layer for nbdkit, but what I found was that it wasn't very effective. It turns out that duplicate data in typical VM filesystems isn't common, and the other parts of the filesystem (block free lists etc) were not sufficiently similar to deduplicate given my somewhat naive approach.

yourapostasy5y ago

I believe the term of art that applies here is "Hierarchical Storage Management". Along with automatically moving data between high-cost and low-cost storage media, the low-cost storage media for your filesystem of choice for the kind of compressing you described can simply be fast disk on a compressing filesystem.

pjc505y ago

I believe NT file compression works like this, and before that MSDOS "DriveSpace" ...

Cojen5y ago

NTFS requires that files be manually converted to the compressed format. They're uncompressed in parts as requested, but this is only kept in RAM. I'm not aware of any built-in background task that converts files to/from the compressed format.

1 more reply

siscia5y ago

Checkout CVMFS

It is not what you describe but it can help.

hachari5y ago

Why not use BTRFS with file deduplication and transparent compression (zstd specifically)?

TimTheTinker5y ago

This is a read-only file system, so it’s able to exploit certain properties of that—- locating similar files next to each other, for example.

hachari5y ago

I don’t see how read only helps at all.

Btrfs can dedupe at the block level.

3 more replies

tutfbhuf5y ago

Is Btrfs stable yet?

packetlost5y ago

For most use-cases, yes. But not if you plan on using raid5+ [0]

[0]: https://btrfs.wiki.kernel.org/index.php/Gotchas#Parity_RAID

My own experience indicates it's brittle to power failures and will corrupt in annoying ways in the event of a power failure (or hard reboot) as of ~1 year ago.

1 more reply

Diederich5y ago

Yup: https://lwn.net/Articles/824855/

hachari5y ago

Btrfs is default in Fedora 33.

2 more replies

amelius5y ago

I'm only ever switching to a new filesystem if its correctness has been formally verified.

1 more reply

stabbles5y ago

mksquashfs supports gzip, xz, lzo, lz4 and zstd too, you can also compile it to have any of those as a default instead of gzip.

Does the performance benchmark show DwarFS versus single-threaded gzip compressed SquashFS?

Hello715y ago

> $ time mksquashfs install perl-install.squashfs -comp zstd -Xcompression-level 22

> Parallel mksquashfs: Using 12 processors

gnosek5y ago

Is this viable as a backup/archive format? Would it make sense to e.g. have an incremental backup as a DwarFS file, referring to the base backup in another DwarFS file?

iforgotpassword5y ago

I guess something like borgbackup would be better suited for this.

You could theoretically try to build this with dwarfs, by using overlayfs and then compressing the upper layer again with dwarfs, but that sounds pretty fragile and cumbersome.

giovannibonetti5y ago

This could be awesome for compressing Docker image layers. After all, they can be huge (hundreds of MB) and, if the Dockerfile is well organized, each step should contain a fairly homogeneous set of files (like apt-get artifacts, for example).

botto5y ago

It would amazing to see this work on OpenWRT, I think it would fit perfectly using less resources than squashfs. The other location would be on a Raspberry pi for scenarios where power can be cut at any time.

mhx775y ago

Author here :) I'm not sure low-spec hardware is necessarily the best use case for DwarFS. It doesn't necessarily use less resources than SquashFS, although it can create file systems that are smaller with much less CPU resources. However, it'll still need a reasonable amount of memory at run time to cache active, decompressed blocks.

JoshTriplett5y ago

Are you doing your own caching in userspace, or are you working with the kernel's caching? The latter would substantially reduce memory requirements.

2 more replies

rektide5y ago

I was thinking the same thing! I'm not sure what it would take to make /rom a FUSE based filesystem, to make it bootable. The current boot process involves the bare kernel mounting Squashfs to find it's init=/etc/preinit & booting from there[1].

Would love some theorycrafting on possible ways to work with DwarFS being a FUSE filesystem.

[1] https://openwrt.org/docs/techref/process.boot

jedberg5y ago

Does anyone remember back in the 90s when we'd install DoubleSpace to get on the fly compression? And then they built it into MSDOS 6 and that was a major game changer?

tssva5y ago

It was DoubleDrive until Microsoft licensed it and relabeled it as DoubleSpace. Stacker was the far more popular drive compression solution until MSDOS 6 was released.

evantahler5y ago

Oh wow. This would be excellent for language dependencies - ruby gems, node_modules, etc. Integrating this with something like pnpm [1], which already keeps a global store of dependencies would excellent. [1] - https://pnpm.js.org

rurban5y ago

So I tried it out on my 17BG of perl builds. (just on my laptop, not on my big machine).

mkdwarfs crashed with recursive links (1-level, just pointing to itself) and when I removed dirs while running mkdwarfs, which were part of of the input path. Which is fair, I assume.

mhx775y ago

> mkdwarfs crashed with recursive links (1-level, just pointing to itself)

That's odd, it shouldn't crash with links at all, as it doesn't actively follow links. Can you please file a bug if you can reproduce this?

> and when I removed dirs while running mkdwarfs, which were part of of the input path

I guess this is fair, but I'll try to take a look anyway. :-)

> On success, mkdwarfs needed 1 hr, and reduced 219 dirs to a size of 970 MB. Not just source files, but also the build and install object files.

My 500 MB image with the 1100+ perls is just installations, from which I've actually removed libperl.a as I've never needed it and it really bloats the image. I've got a separate image with debug information (everything built with -g in case I need to debug the binaries), so the binaries in the main image are essentially all stripped. If I need to debug, I'll just mount the debug image as well, which contains the source files and the stripped debug data.

> 1 hr is a lot, but just think how long squashfs would have needed.

It might be worth trying a lower compression level, especially if you find that mkdwarfs is CPU bound and not I/O bound.

rurban5y ago

No, I'm fine with the high default compression rate. I only do this once a decade, and one hour is fair for this. A Fedora upgrade needs 5-8 hrs.

I have lot of -g info, because I use it mainly for debugging XS problems with old versions. The hashes change for each object, so reduplication is mostly only useful for source files. I really need high compression, which is the default.

rurban5y ago

On success, mkdwarfs needed 1 hr, and reduced 219 dirs to a size of 970 MB. Not just source files, but also the build and install object files.

1 hr is a lot, but just think how long squashfs would have needed. Totally impractical. Thanks mhx

ed25519FUUU5y ago

I noticed that enabling compression on zfs made a huge difference with the source size of some of my largely text file petitions. I never turned on deduplication because I don’t want to bother with the memory overhead, but I bet that would help even further.

ggm5y ago

Most ZFS howto's now recommend against dedup on the prolongued memory cost consequences. Yes, you would get some block level compression outcome. But, you enter the cost/benefit hell of balancing CPU and memory at runtime.

Reelin5y ago

Can't you periodically run the dedup out of band (for example whenever you scrub)? https://btrfs.wiki.kernel.org/index.php/Deduplication

1 more reply

Twirrim5y ago

I'm curious, why do you have so many perl installations around. I thought I'd got a fair number of python venvs kicking around for each of the repos I'm dealing with, but nowhere near that many.

isoprophlex5y ago

My Python shits have pip requirements that easily dump 3-4 gigs in a venv folder. Do that once or twice a month when starting a new project for a couple of years and it gets messy...

b5n5y ago

I'd like to see a pip freeze of whatever you're doing to consistently need venvs of that size.

1 more reply

st_goliath5y ago

Circa 2 years ago, I was working on a side project and got so annoyed with SquashFS tooling, that I decided to fix it instead. After getting stuck with the spaghetti code behind mksquashfs, I decided to start from scratch, having learnt enough about SquashFS to roughly understand the on-disk format.

Because squashfs-tools seemed pretty unmaintained in late 2018 (no activity on the official site & git tree for years and only one mailing list post "can you do a release?" which got a very annoyed response) I released my tooling as "squashfs-tools-ng" and it is currently packaged by a hand full of distros, including Debian & Ubuntu.[1]

I also thoroughly documented the on-disk format, after reverse engineering it[2] and made a few benchmarks[3].

For my benchmarks I used an image I extracted from the Debian XFCE LiveDVD (~6.5GiB as tar archive, ~2GiB as XZ compressed SquashFS image). By playing around a bit, I also realized that the compressed meta data is "amazingly small", compared to the actual image file data and the resulting images are very close to the tar ball compressed with the same compressor settings.

I can accept a claim of being a little smaller than SquashFS, but the claimed difference makes me very suspicious. From the README, I'm not quite sure: Does the Raspbian image comparison compare XZ compression against SquashFS with Zstd?

I have cloned the git tree and installed dozens of libraries that this folly thingy needs, but I'm currently swamped in CMake errors (haven't touched CMake in 8+ years, so I'm a bit rusty there) and the build fails with some still missing headers. I hope to have more luck later today and produce a comparison on my end using my trusty Debian reference image which I will definitely add to my existing benchmarks.

Also, is there any documentation on how the on-disk format for DwarFS and it's packing works which might explain the incredible size difference?

[1] https://github.com/AgentD/squashfs-tools-ng

[2] https://github.com/AgentD/squashfs-tools-ng/blob/master/doc/...

[3] https://github.com/AgentD/squashfs-tools-ng/tree/master/doc

mhx775y ago

This is really cool, I'll give squashfs-tools-ng a try!

> Does the Raspbian image comparison compare XZ compression against SquashFS with Zstd?

That's correct. It's not an exhaustive matrix of comparisons.

> Also, is there any documentation on how the on-disk format for DwarFS and it's packing works which might explain the incredible size difference?

The format as of 0.2.0 is actually quite simple. It's a list of compressed data blocks, followed by a metadata block (and a schema describing the metadata block). The metadata format is implemented by and documented in in [1].

There are probably 3 things that contribute to compression level:

1) Block size. DwarFS can use arbitrary block sizes (artificially limited to powers of two), and uses a much larger block size (16M) by default. SquasFS doesn't seem to be able to go higher than 1M.

2) Ordering files by similarity.

3) Segment deduplication. If segments of files overlap with previously seen data, these segments are referenced instead of written again. The minimum size of these segments can be configured and defaults to 2k. For my primary use case, of the 47.6 GB of input data, 28.2 GB are saved by file-level deduplication, and another 12.4 GB by this segment-level deduplication. So before the "real" compression algorithms actually kick in, there are only 7 GB of data left. As these are ordered by similarity, and stored in rather big blocks, some of the 16M blocks can actually be compressed down to less then 100k.

[1] https://github.com/mhx/dwarfs/blob/main/thrift/metadata.thri...

hawski5y ago

I just want to say thank you for squashfs-tools-ng. For my usecase I had to patch mksquashfs and your tool fits just right. I'm yet to switch however.

Hello715y ago

> You can pick either clang or g++, but at least recent clang versions will produce substantially faster code

have you investigated why this might be the case?

mhx775y ago

> have you investigated why this might be the case?

Very briefly. It looks like clang has a different strategy breaking up the code (which is mostly C++ templates) into actual functions vs. inlining it, and the hot code ultimately performs fewer function calls with clang than it does with gcc. But this is nowhere near a proper analysis of what's going on. :)

aarchi5y ago

I have several highly-redundant NTFS backups that I'd like to compress into a read-only fs. Can DwarFS preserve all NTFS metadata?

kristianp5y ago

I think it uses FUSE, which is linux specific.

saurabhnanda5y ago

Is this useful for long-term log storage? say, from a typical webapp (eg. Nginx logs, Rails logs, Postgres logs, etc)

throwmemoney5y ago

Compression - anyone using lrzip on production servers?

GGfpc5y ago

What are the use cases for a read only file system?

mhx775y ago

Read-only media, for example. Or in general, stuff that doesn't really change. In my case: https://github.com/mhx/dwarfs#history

fishermanbill5y ago

Game asset packages - all game assets are read only and need to be compressed and nowadays with SSD's you don't want duplication.

Just to clarify that last statement (and something to think about) with HDD's you want duplicate assets so that you don't cause seeks which are VERY slow on 5400rpm HDD's still found on some/alot of systems.

FroshKiller5y ago

Have you ever used a CD-ROM or DVD-ROM?

fsiefken5y ago

the use case for a read only compressed filesystem is that one..

* can search archived files potentially faster because read access is potentially faster

* fit more data on bootable media

pjc505y ago

Booting. Arguably all containers, too.

rwmj5y ago

squashfs is widely used in Linux install media.

evantahler5y ago

Node_modules

j / k navigate · click thread line to collapse

111 comments

dj_mc_merlin5y ago

It fills me with joy that someone has been coding a fs for 7 years due to perl installs taking too much space. Necessity is the mother of all invention.

mhx775y ago

Hahaha, I haven't actually been coding on this for that long, it's more that I coded for a few weeks back in 2013 and only found the motivation to resurrect the whole thing a few weeks back.

simcop23875y ago

How opposed would you be to this being reworked to being able to be mainline kernel support too?

1 more reply

jodrellblank5y ago

> "taking up something around 30 gigabytes of disk space, and I was unwilling to spend more than 10% of my hard drive"

I imagine these days you have more than 300GB hard disk space, making this all moot?

1 more reply

pjc505y ago

Nowadays you can have the same problem with Python and Javascript too!

imhoguy5y ago

Same with Ruby (Gems).

rurban5y ago

Thanks to mhx I can move them now back to my fast disk. This is also perfect for testers.

lambda_obrien5y ago

Edit: You could even have several read only shadow copies of the repo for parallel working directory usage, if your hard link the .git directory except for the HEAD ref in each.

deepstack5y ago

nice, wonder how this compare with MongoDB compression of files and objects. Seems like a great foundation for archiving data.

fefe235y ago

It looks like the benefit is some kind of block or file deduplication.

@OP: Can you please explain why you keep 50 gigs of perl around? :-)

mhx775y ago

> @OP: Can you please explain why you keep 50 gigs of perl around? :-)

bufferoverflow5y ago

The single case in the known universe.

TimTheTinker5y ago

Very impressive, to say the least.

(Not meant sarcastically :-)

jasonjayr5y ago

Speculating here, but perl has a very rich test library and harnesses for running tests across multiple perls and platforms.

If you upload a module to CPAN, you automatically get it tested against a huge matrix of configurations:

http://matrix.cpantesters.org/?dist=Log-Any-Adapter-FileHand...

mhx775y ago

> If you upload a module to CPAN, you automatically get it tested against a huge matrix of configurations.

Very true, and it's definitely a great service!

rkeene25y ago

AppFS provides global file deduplication and also solves the distribution problem, and also you don't need to have all the resources locally

smitty1e5y ago

Whew! It was easy to find out how you actually initialize this thing, if it's read-only:

https://github.com/mhx/dwarfs/blob/main/man/mkdwarfs.md

slagfart5y ago

Perhaps not strictly on-topic, but is there any equivalent FS/program in Windows that will allow users to have read-only access to files that are deduplicated in some way?

These archives are already compressed (CHD format, which is 7Z + FLAC for ISOs), but it's deduplication that needs to happen on top of these already compressed files that I'm struggling with.

Sorry for the off-topic ask!

storedbox5y ago

[1]: https://en.wikipedia.org/wiki/Windows_Imaging_Format [2]: https://wimlib.net/

dashgreen5y ago

If you are using windows server, data deduplication[1] is available on non-system volumes to do exactly this.

If you're using a Windows client, there is a way of enabling this, but it's not exactly supported, for a variety of reasons.

[1]https://docs.microsoft.com/en-us/windows-server/storage/data...

rakoo5y ago

It's probably a hack, but you can try "backing up" your files with bup, restic or borg, and mount the resulting snapshot with FUSE

aidenn05y ago

You probably need to de-duplicate before compression, at least for many compression schemes.

robbyt5y ago

s/ask/request/g

Scaevolus5y ago

mhx775y ago

Author here :)

londons_explore5y ago

Are there git filesystems? If so, they could be a good comparison point too - gits PACK file format is pretty magic...

1 more reply

throwmemoney5y ago

How much does the compression of the perl repo become when compressed with

lrzip -UL9 filetarball.tar

It would be a good data point for everyone.

david_draco5y ago

bufferoverflow5y ago

I don't know about you guys, but most of the stuff that takes up space on my drives are:

1) Videos from my DSLR

2) RAW images from my DSLR

3) Various movies / TV series I downloaded

4) Game files (most of which are textures and 3D models)

None of that stuff is really compressible.

jandrese5y ago

The RAWs aren't compressible? Are they LZ encoded on the camera?

3 more replies

rwmj5y ago

yourapostasy5y ago

pjc505y ago

I believe NT file compression works like this, and before that MSDOS "DriveSpace" ...

Cojen5y ago

1 more reply

siscia5y ago

Checkout CVMFS

It is not what you describe but it can help.

hachari5y ago

Why not use BTRFS with file deduplication and transparent compression (zstd specifically)?

TimTheTinker5y ago

This is a read-only file system, so it’s able to exploit certain properties of that—- locating similar files next to each other, for example.

hachari5y ago

I don’t see how read only helps at all.

Btrfs can dedupe at the block level.

3 more replies

tutfbhuf5y ago

Is Btrfs stable yet?

packetlost5y ago

For most use-cases, yes. But not if you plan on using raid5+ [0]

[0]: https://btrfs.wiki.kernel.org/index.php/Gotchas#Parity_RAID

My own experience indicates it's brittle to power failures and will corrupt in annoying ways in the event of a power failure (or hard reboot) as of ~1 year ago.

1 more reply

Diederich5y ago

Yup: https://lwn.net/Articles/824855/

hachari5y ago

Btrfs is default in Fedora 33.

2 more replies

amelius5y ago

I'm only ever switching to a new filesystem if its correctness has been formally verified.

1 more reply

stabbles5y ago

mksquashfs supports gzip, xz, lzo, lz4 and zstd too, you can also compile it to have any of those as a default instead of gzip.

Does the performance benchmark show DwarFS versus single-threaded gzip compressed SquashFS?

Hello715y ago

> $ time mksquashfs install perl-install.squashfs -comp zstd -Xcompression-level 22

> Parallel mksquashfs: Using 12 processors

gnosek5y ago

Is this viable as a backup/archive format? Would it make sense to e.g. have an incremental backup as a DwarFS file, referring to the base backup in another DwarFS file?

iforgotpassword5y ago

I guess something like borgbackup would be better suited for this.

You could theoretically try to build this with dwarfs, by using overlayfs and then compressing the upper layer again with dwarfs, but that sounds pretty fragile and cumbersome.

giovannibonetti5y ago

botto5y ago

mhx775y ago

JoshTriplett5y ago

Are you doing your own caching in userspace, or are you working with the kernel's caching? The latter would substantially reduce memory requirements.

2 more replies

rektide5y ago

Would love some theorycrafting on possible ways to work with DwarFS being a FUSE filesystem.

[1] https://openwrt.org/docs/techref/process.boot

jedberg5y ago

Does anyone remember back in the 90s when we'd install DoubleSpace to get on the fly compression? And then they built it into MSDOS 6 and that was a major game changer?

tssva5y ago

It was DoubleDrive until Microsoft licensed it and relabeled it as DoubleSpace. Stacker was the far more popular drive compression solution until MSDOS 6 was released.

evantahler5y ago

rurban5y ago

So I tried it out on my 17BG of perl builds. (just on my laptop, not on my big machine).

mkdwarfs crashed with recursive links (1-level, just pointing to itself) and when I removed dirs while running mkdwarfs, which were part of of the input path. Which is fair, I assume.

mhx775y ago

> mkdwarfs crashed with recursive links (1-level, just pointing to itself)

That's odd, it shouldn't crash with links at all, as it doesn't actively follow links. Can you please file a bug if you can reproduce this?

> and when I removed dirs while running mkdwarfs, which were part of of the input path

I guess this is fair, but I'll try to take a look anyway. :-)

> On success, mkdwarfs needed 1 hr, and reduced 219 dirs to a size of 970 MB. Not just source files, but also the build and install object files.

> 1 hr is a lot, but just think how long squashfs would have needed.

It might be worth trying a lower compression level, especially if you find that mkdwarfs is CPU bound and not I/O bound.

rurban5y ago

No, I'm fine with the high default compression rate. I only do this once a decade, and one hour is fair for this. A Fedora upgrade needs 5-8 hrs.

rurban5y ago

On success, mkdwarfs needed 1 hr, and reduced 219 dirs to a size of 970 MB. Not just source files, but also the build and install object files.

1 hr is a lot, but just think how long squashfs would have needed. Totally impractical. Thanks mhx

ed25519FUUU5y ago

ggm5y ago

Reelin5y ago

Can't you periodically run the dedup out of band (for example whenever you scrub)? https://btrfs.wiki.kernel.org/index.php/Deduplication

1 more reply

Twirrim5y ago

I'm curious, why do you have so many perl installations around. I thought I'd got a fair number of python venvs kicking around for each of the repos I'm dealing with, but nowhere near that many.

isoprophlex5y ago

My Python shits have pip requirements that easily dump 3-4 gigs in a venv folder. Do that once or twice a month when starting a new project for a couple of years and it gets messy...

b5n5y ago

I'd like to see a pip freeze of whatever you're doing to consistently need venvs of that size.

1 more reply

st_goliath5y ago

I also thoroughly documented the on-disk format, after reverse engineering it[2] and made a few benchmarks[3].

Also, is there any documentation on how the on-disk format for DwarFS and it's packing works which might explain the incredible size difference?

[1] https://github.com/AgentD/squashfs-tools-ng

[2] https://github.com/AgentD/squashfs-tools-ng/blob/master/doc/...

[3] https://github.com/AgentD/squashfs-tools-ng/tree/master/doc

mhx775y ago

This is really cool, I'll give squashfs-tools-ng a try!

> Does the Raspbian image comparison compare XZ compression against SquashFS with Zstd?

That's correct. It's not an exhaustive matrix of comparisons.

> Also, is there any documentation on how the on-disk format for DwarFS and it's packing works which might explain the incredible size difference?

There are probably 3 things that contribute to compression level:

1) Block size. DwarFS can use arbitrary block sizes (artificially limited to powers of two), and uses a much larger block size (16M) by default. SquasFS doesn't seem to be able to go higher than 1M.

2) Ordering files by similarity.

[1] https://github.com/mhx/dwarfs/blob/main/thrift/metadata.thri...

hawski5y ago

I just want to say thank you for squashfs-tools-ng. For my usecase I had to patch mksquashfs and your tool fits just right. I'm yet to switch however.

Hello715y ago

> You can pick either clang or g++, but at least recent clang versions will produce substantially faster code

have you investigated why this might be the case?

mhx775y ago

> have you investigated why this might be the case?

aarchi5y ago

I have several highly-redundant NTFS backups that I'd like to compress into a read-only fs. Can DwarFS preserve all NTFS metadata?

kristianp5y ago

I think it uses FUSE, which is linux specific.

saurabhnanda5y ago

Is this useful for long-term log storage? say, from a typical webapp (eg. Nginx logs, Rails logs, Postgres logs, etc)

throwmemoney5y ago

Compression - anyone using lrzip on production servers?

GGfpc5y ago

What are the use cases for a read only file system?

mhx775y ago

Read-only media, for example. Or in general, stuff that doesn't really change. In my case: https://github.com/mhx/dwarfs#history

fishermanbill5y ago

Game asset packages - all game assets are read only and need to be compressed and nowadays with SSD's you don't want duplication.

FroshKiller5y ago

Have you ever used a CD-ROM or DVD-ROM?

fsiefken5y ago

the use case for a read only compressed filesystem is that one..

* can search archived files potentially faster because read access is potentially faster

* fit more data on bootable media

pjc505y ago

Booting. Arguably all containers, too.

rwmj5y ago

squashfs is widely used in Linux install media.

evantahler5y ago

Node_modules

j / k navigate · click thread line to collapse