Using `bup` (deduplicating backup tool using git packfile format) I deduplicated 4 Chromium builds into the size of 1. It could probably pack thousands into the size of a few.
Large download/storage requirements for updates are one of NixOS's few drawbacks, and I think deduplication could solve that pretty much completely.
I'm currently evaluating `bupstash` (also written in Rust) as a replacment. It's faster and uses a lot less memory, but is younger and thus lacks some features.
Here is somebody's benchmark of bupstas (unfortunately not including `bup`): https://acha.ninja/blog/encrypted_backup_shootout/
The `bupstash` author is super responsive on Gitter/Matrix, it may make sense to join there to discuss approaches/findings together.
I would really like to eventually have deduplication-as-a-library, to make it easier to put into programs like nix, or also other programs, e.g. for versioned "Save" functionality in software like Blender or Meshlab that work with huge files and for which diff-based incremental saving is more difficult/fragile to implement than deduplcating snapshot based saving.
I initially built this for having access to 1000+ Perl installations (spanning decades of Perl releases). The compression in this case is not quite as impressive (50 GiB to around 300 MiB), but access times are typically in the millisecond region.
We did some preliminary experiments with git a while back but found we were able to do the packing and extraction much faster and smaller than git was able to manage. However, we haven't had the time to repeat the experiments with our latest knowledge and the latest version of git. So it is entirely possible that git might be an even better answer here in the end. We just haven't done the best experiments yet. It's something to bear in mind. If someone wants, they could measure this fairly easily by unpacking our snapshots and storing them into git.
On our machines, forming a snapshot of one llvm+clang build takes hundreds of milliseconds. Forming a packfile for 2,000 clang builds with elfshaker can take seconds during the pack phase with a 'low' compression level (a minute or two for the best compression level, which gets it down to the ~50-100MiB/mo range), and extracting takes less than a second. Initial experiments with git showed it was going to be much slower.
Down the line maybe it would even be possible to have binaries as “first-class” (save for diff I guess)
> manyclangs is a project enabling you to run any commit of clang within a few seconds, without having to build it.
> It provides elfshaker pack files, each containing ~2000 builds of LLVM packed into ~100MiB. Running any particular build takes about 4s.
I'm not sure the linking step they provide is deterministic/hermetic, if it is that would prove a decent way to compress the final binaries while shaving most of the compilation time. Maybe the manyclangs repo could store hashes of the linked binaries if so?
I'm not seeing any particular tricks done in elfshaker itself to enable this, the packfile system orders objects by size as a heuristic for grouping similar objects together and compresses everything (using zstd and parallel streams for, well, parallelism). Sorting by size seems to be part of the Git heuristic for delta packing: https://git-scm.com/docs/pack-heuristics
I'd like to see a comparison with Git and others listed here (same unlinked clang artifacts, compare packing and access): https://github.com/elfshaker/elfshaker/discussions/58#discus...
These ROM-set archives — especially when using more modern compression algorithms, like LZMA/7zip — end up about 1.1x the size of a single one of the contained game ROM images, despite sometimes containing literally hundreds of variant images.
I use SolidWorks PDM at work to control drawings, BOMs, test procedures, etc. In all honesty, PDM does an alright job when it works, but when I have problems with our local server, all hell breaks loose and worst case, the engineers can't move forward.
In that light, I'd love to switch to another option. Preferably something decentralized just to ensure we have more backups. Git almost gets us there but doesn't include things like "where used."
All that being said, am I overlooking some features of Elfshaker that would fit well into my hopes of finding an alternative to PDM?
I also see there's another HN thread that asks the question I'm asking - just not through the lens of Elfshaker: https://news.ycombinator.com/item?id=20644770
If that's reasonably fast, perhaps an approach like that could work: server stores the entire pack, but upon user request extracts a delta between user's version and target binary.
Still, the devil is in the details of building all revisions of all software a single distribution has.
I wonder if this concept could be extended to other binary types that git has problems with, were you able to know/control more about the underlying binary format.
Honestly, I sort of looked at it for conventional backup strategy...as in, i wonder if it could work as a replacement for tar-zipping up a directory, etc. But, not sure if the use cases is appropriate.
Unfortunately it won't be uploaded until later but it will show up on the llvm YouTube channel:
git-lfs just offloads the storage of the large binaries to a remote site, and then downloads on demand.
If you have a lot of binary assets like artwork or huge excel spreadsheets, it's very useful, because in those cases, without git-lfs, the git repo will get very large, git will get extremely slow, and github will get angry at you for having too large a repo.
But it's not all roses with git-lfs, since now you're reliant on the external network to do checkouts, vs having fetched everything at once w/ the initial clone, and also of course just switching between revisions can get slower since you're network-limited to fetch those large files. (And though I'm not sure, it doesn't seem like git-lfs is doing any local caching.)
So you could imagine where something like having elfshaker embedded in the repo and integrated as a checkout filter could potentially be a useful alternative. Basically an efficient way to store binaries directly in the repo.
(Maybe it would be too small a band of use cases to be practicle though? Obviously if you have lots of distinct art assets, that's just going to be big, no matter what...)
Please see our new applicability section which explains the result in a bit more detail:
https://github.com/elfshaker/elfshaker/blob/1bedd4eacd3ddd83...
In manyclangs (which uses elfshaker for storage) we arrange that the object code has stable addresses when you do insertions/deletions, which means you don't need such a filter. But today I learned about such filters, so thanks for sharing your question!
In this comment, you say "20% compression is pretty good". AFAIK, usually "X% compression" means the measure of the reduction in size, not the measure of the remaining. Thus, 0.01% compression sounds almost useless, very different from the 10,000x written next to it.
The stored object files are compiled with -ffunction-sections and -fdata-sections, which ensures that insertions/deletions to the object file only have a local effect (they don't cause relative addresses to change across the whole binary).
As you observe, anything which causes significant non-local changes in the data you store is going to have a negative effect when it comes to compression ratio. This is why we don't store the original executables directly.
Is this the reason why manyclang (using llvms cmake based build system) can be provided easily, but it would be more difficult for gcc? Or is the object -> binary dependency automatically deduced?
> Most of them don't change very often so there are a lot of duplicate files,
> When they do change, the deltas of the [binaries] are not huge.
We need this but for node_modules
Node_modulea is already tons and tons of files, and when they are large, they are usually minified and hard to split on any "natural" boundary (like elf sections/symbols etc)
It seems obvious that whenever something is saved into IPFS, there might be a similar object already stored. If there is, go make a diff, and only store the diff.
[1] https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli...
It might make sense to check how they do it.
I'd also be interested in how elfshaker compares to those (and `bupstash`, which is written in Rust but doesn't have a FUSE mount yet) in terms of compression and speed.
Did you know of their existence when making elfshaker?
Edit: Question also posted in your Q&A: https://github.com/elfshaker/elfshaker/discussions/58#discus...
* There are many files,
* Most of them don't change very often,
* When they do change, the deltas of the binaries are not huge.
So, if the image files aren't changing very much, then it might work well for you. If the images are changing, their binary deltas would be quite large, so you'd get a compression ratio somewhat equivalent to if you'd concatenated the two revisions of the file and compressed them using ZStandard.
Thanks
But also, in general, it might not work well for your use case, and our use case is niche. Please give it a try before making assumptions about any suitability for use.
Author here. elfshaker itself does not have a dependency on any architecture to our knowledge. We support the architectures we have use of. Contributions to add missing support are welcome.
manyclangs provides binary pack files for aarch64 because that's what we have immediate use of. If elfshaker and manyclangs proves useful to people, I would love to see resource invested to make it more widely useful.
You can still run the manyclangs binaries on other architectures using qemu [0], with some performance cost, which may be tolerable depending on your use case.
[0] https://github.com/elfshaker/manyclangs/tree/main/docker-qem...
Right now git lfs takes up so much space when storing files locally.
There is also a usability difference: elfshaker stores data in pack files, which are more easily shareable. Each of the pack files released as part of manyclangs ~100 MiB and contains enough data to materialize ~2,000 builds of clang and LLVM.