Elfshaker: Version control system fine-tuned for binaries (opens in new tab)

Author here. I've used bup, and elfshaker was partially inspired by it! It's great. However, during initial experiments on this project I found bup to be slow, taking quite a long time to snapshot and extract. I think this could in principle be fixed in bup one day, perhaps.

nh24y ago

I also use bup for a long time, but found that for very large server backups I'm hitting performance problems (both in time and memory usage).

I'm currently evaluating `bupstash` (also written in Rust) as a replacment. It's faster and uses a lot less memory, but is younger and thus lacks some features.

Here is somebody's benchmark of bupstas (unfortunately not including `bup`): https://acha.ninja/blog/encrypted_backup_shootout/

The `bupstash` author is super responsive on Gitter/Matrix, it may make sense to join there to discuss approaches/findings together.

I would really like to eventually have deduplication-as-a-library, to make it easier to put into programs like nix, or also other programs, e.g. for versioned "Save" functionality in software like Blender or Meshlab that work with huge files and for which diff-based incremental saving is more difficult/fragile to implement than deduplcating snapshot based saving.

pdimitar4y ago

I used `bupstash` and evaluated it for a while. I am looking to do 5+ offsite backups of a small personal directory to services that offer 5GB of cloud space for free.

`bupstash` lacked good compression. I settled with `borg` because I could use `zstd` compression with it. Currently at 60 snapshots of the directory and the `borg` repo directory is at ~1.52GB out of 5GB quota. The source directory is ~12.19GB uncompressed. Very happy with `borg` + `zstd` and how they handle my scenario.

I liked `bupstash` a lot, and the author is responsive and friendly. But I won't be giving it another try until it implements much more aggressive compression compared to what it can do now. It's a shame, I really wanted to use it.

I do recognize that for many other scenarios `bupstash` is very solid though.

ybkshaw4y ago

Thank you for having such a good description on the project! Sometimes the links from HN lead to a page that takes a few minutes of puzzling to figure out what is going on but not yours.

Siira4y ago

Is elfshaker any good for backuping non-text data?

iTokio4y ago

interesting design document: https://raw.githubusercontent.com/bup/bup/master/DESIGN

mhx774y ago

Somewhat related (and definitely born out of a very similar use case): https://github.com/mhx/dwarfs

I initially built this for having access to 1000+ Perl installations (spanning decades of Perl releases). The compression in this case is not quite as impressive (50 GiB to around 300 MiB), but access times are typically in the millisecond region.

Nice, I bet dwarfs would do well at our use case too. Thanks for sharing.

pdimitar4y ago

That's super impressive, I will definitely give it a go. Thanks for sharing!

thristian4y ago

This seems very much like the Git repository format, with loose objects being collected into compressed pack files - except I think Git has smarter heuristics about which files are likely to compress well together. It would be interesting to see a comparison between this tool and Git used to store the same collection of similar files.

An author here, I agree! The packfile format is heavily inspired by git, and git may also do quite well at this.

We did some preliminary experiments with git a while back but found we were able to do the packing and extraction much faster and smaller than git was able to manage. However, we haven't had the time to repeat the experiments with our latest knowledge and the latest version of git. So it is entirely possible that git might be an even better answer here in the end. We just haven't done the best experiments yet. It's something to bear in mind. If someone wants, they could measure this fairly easily by unpacking our snapshots and storing them into git.

On our machines, forming a snapshot of one llvm+clang build takes hundreds of milliseconds. Forming a packfile for 2,000 clang builds with elfshaker can take seconds during the pack phase with a 'low' compression level (a minute or two for the best compression level, which gets it down to the ~50-100MiB/mo range), and extracting takes less than a second. Initial experiments with git showed it was going to be much slower.

johnyzee4y ago

As far as I was able to learn (don't remember the details, sorry), git does not do well with large binary files. I believe it ends up with a lot of duplication. It is the major thing I am missing from git, currently we store assets (like big PSDs that change often) outside of version control and it is suboptimal.

Performing poorly with non-textual data happens for a a number of reasons. Binary data, when changed, often have a lot of 'non-local' changes in them. For example, a PSD file might well have a compression algorithm already applied to it. An insertion/deletion is going to result in a very different compressed representation for which there is no good way to have an efficient delta. elfshaker will suffer the same problem here.

ChrisMarshallNY4y ago

> we store assets (like big PSDs that change often) outside of version control and it is suboptimal.

Perforce is still used by game developers and other creatives, because it handles large binaries, quite well.

In fact, I'm not sure if they still do it, but one of the game engines (I think, maybe, Unreal) used to have a free tier that also included a free Perforce install.

3np4y ago

Do you think it would be feasible to do a git-lfs replacement based on elfshaker?

Down the line maybe it would even be possible to have binaries as “first-class” (save for diff I guess)

veselink14y ago

An author here, we've opened a Q&A discussion on GitHub: https://github.com/elfshaker/elfshaker/discussions/58.

wlll4y ago

Related, and impressive: https://github.com/elfshaker/manyclangs

> manyclangs is a project enabling you to run any commit of clang within a few seconds, without having to build it.

> It provides elfshaker pack files, each containing ~2000 builds of LLVM packed into ~100MiB. Running any particular build takes about 4s.

Tobu4y ago

The clever idea that makes manyclangs compress well is to store object files before they are linked, with each function and each variable in its own elf section so that changes are mostly local; addresses will indirect through sections and a change to one item won't cascade into moving every address.

I'm not sure the linking step they provide is deterministic/hermetic, if it is that would prove a decent way to compress the final binaries while shaving most of the compilation time. Maybe the manyclangs repo could store hashes of the linked binaries if so?

I'm not seeing any particular tricks done in elfshaker itself to enable this, the packfile system orders objects by size as a heuristic for grouping similar objects together and compresses everything (using zstd and parallel streams for, well, parallelism). Sorting by size seems to be part of the Git heuristic for delta packing: https://git-scm.com/docs/pack-heuristics

I'd like to see a comparison with Git and others listed here (same unlinked clang artifacts, compare packing and access): https://github.com/elfshaker/elfshaker/discussions/58#discus...

Author here, I'd like to see such a comparison too actually, but I'm not in the position to do the work at the moment. We did some preliminary experiments at the beginning, but a lot changed over the course of the project and I don't know how well elfshaker fares ultimately against all the options out there. Some basic tests against git found that git is quite a bit slower (10s vs 100ms) during 'git add' and git checkout. Maybe that can be fixed with some tuning or finding appropriate options.

glandium4y ago

It would be interesting to compare to gitoxide tweaked to use zstd compression for packs.

perth4y ago

Reminds me of how Microsoft packages the Windows installer actually. If you’ve ever unpacked Microsoft’s install.esd it’s interestingly insane how heavily it’s compressed. I assume it’s full of a lot of stuff that provides semi redundant binaries for compatibility to a lot of different systems, because the unpacked esd container goes from a few GiBs to I think around 40-50 iirc.

derefr4y ago

The emulation community also has "ROMsets" — collections of game ROM images, where the ROM images for a given game title are all grouped together into an archive. So you'd have one archive for e.g. "every release, dump, and ROMhack of Super Mario Bros 1."

These ROM-set archives — especially when using more modern compression algorithms, like LZMA/7zip — end up about 1.1x the size of a single one of the contained game ROM images, despite sometimes containing literally hundreds of variant images.

Daishiman4y ago

How does this work? Do all the game series use the same engine code and assets?

https://en.wikipedia.org/wiki/Binary_delta_compression

wpietri4y ago

Ooh, neat. I was wondering why anybody would make a binary-specific VCS. And why "elf" was in the name. This answers both questions. Thanks!

mal10c4y ago

This project reminded me of something I've been looking for for a while - although it's not exactly what I'm looking for...

I use SolidWorks PDM at work to control drawings, BOMs, test procedures, etc. In all honesty, PDM does an alright job when it works, but when I have problems with our local server, all hell breaks loose and worst case, the engineers can't move forward.

In that light, I'd love to switch to another option. Preferably something decentralized just to ensure we have more backups. Git almost gets us there but doesn't include things like "where used."

All that being said, am I overlooking some features of Elfshaker that would fit well into my hopes of finding an alternative to PDM?

I also see there's another HN thread that asks the question I'm asking - just not through the lens of Elfshaker: https://news.ycombinator.com/item?id=20644770

kvnhn4y ago

Maybe not precisely what you want, but I built a CLI tool[1] that's like a simplified and decoupled Git-LFS. It tracks large files in a content-addressed directory, and then you track the references to that store in source control. Data compression isn't a top priority for my tool; it uses immutable symlinks, not archives.

[1]: https://github.com/kevin-hanselman/dud

henvic4y ago

Interesting. I wonder if this can also be [ab]used to, say, deliver deltas of programs, so that you can have faster updates, but maybe it doesn't make sense.

Author here, I don't think it would apply well to that scenario. elfshaker is good for manyclangs where we ship 2,000 revisions in one file (pack), so the cost of individual revision is amortized. If one build of llvm+clang costs you some ~400 MiB; a single elfshaker pack containing 2,000 builds has an amortized cost of around 40kiB/build. But this amazing win is only happening because you are shipping 2,000 builds at once. If you wanted to ship a single delta, you can't compress against all the other builds.

necovek4y ago

How fast would it be to get a delta between any two of the 2,000 builds in a single elfshaker pack?

If that's reasonably fast, perhaps an approach like that could work: server stores the entire pack, but upon user request extracts a delta between user's version and target binary.

Still, the devil is in the details of building all revisions of all software a single distribution has.

Yes you could do that. On the other hand, all revisions for a month is 100MiB, and all revisions we've built spanning 2019-now are a total of 2.8GiB, so we opted to forego implementing any object negotiation and just say 'you have to download the 100MiB for the month to access it'. I think you could a push/pull protocol could be implemented, but at that point probably git might do a reasonable job in that case :)

henvic4y ago

Thank you for the insight!

tedunangst4y ago

This is how openbsd binary patches for the kernel work. The changed object files are shipped; the end system relinks a new kernel.

carlmr4y ago

I find the description a bit confusing, is there and example where we can see the usage?

w0m4y ago

My top level being that it's a VCS (like Git) specialized for binaries; with commands baked in to prevent the slowdown that often comes with large git repositories.

throw_away4y ago

Specifically, it's for ELF binaries built in such a way that adding a new function or new data does not break however they cache existing functions/data.

I wonder if this concept could be extended to other binary types that git has problems with, were you able to know/control more about the underlying binary format.

One of the authors here, thanks for the feedback. We've tried to improve it here: https://github.com/elfshaker/elfshaker/pull/59

mxuribe4y ago

Same here. There is a usage guide, which helped a tiny bit: https://github.com/elfshaker/elfshaker/blob/main/docs/users/...

Honestly, I sort of looked at it for conventional backup strategy...as in, i wonder if it could work as a replacement for tar-zipping up a directory, etc. But, not sure if the use cases is appropriate.

Author here. We'd love this to be a thing, but this is young software, so we don't recommend relying on this as a single way of doing a backup for now. Bear in mind that our main use case is for things that you can reproduce in principle (builds of a commit history, see manyclangs).

mxuribe4y ago

> our main use case is for things that you can reproduce in principle (builds of a commit history, see manyclangs)

I appreciate your response, and thanks very much for the clarification of use case; very helpful! Thanks also of course for building this!

xdfgh11124y ago

For backup you probably want something like Borg to handle deduplication of identical content between backups.

Author here, I agree with xdfgh1112, please take care before using brand new software to store your backups!

https://www.youtube.com/c/LLVMPROJ

wyldfire4y ago

There is an associated presentation on manyclangs at LLVM dev meeting. I think they presented yesterday?

Unfortunately it won't be uploaded until later but it will show up on the llvm YouTube channel:

ot4y ago

I would guess it’s a way to quickly bisect on compiler versions.

dilap4y ago

Huh, interesting, could you maybe use this as an in-repo alternative to something like git-lfs?

Author here, I don't currently know how this compares to git-lfs. It it is possible git-lfs would perform quite well on the same inputs as elfshaker works on. If git-lfs does already work well for your use case I'd recommend using that rather than elfshaker, as it is more established.

dilap4y ago

Thanks for the response! I was more just curious about future possibilities vs immediate practicle use.

git-lfs just offloads the storage of the large binaries to a remote site, and then downloads on demand.

If you have a lot of binary assets like artwork or huge excel spreadsheets, it's very useful, because in those cases, without git-lfs, the git repo will get very large, git will get extremely slow, and github will get angry at you for having too large a repo.

But it's not all roses with git-lfs, since now you're reliant on the external network to do checkouts, vs having fetched everything at once w/ the initial clone, and also of course just switching between revisions can get slower since you're network-limited to fetch those large files. (And though I'm not sure, it doesn't seem like git-lfs is doing any local caching.)

So you could imagine where something like having elfshaker embedded in the repo and integrated as a checkout filter could potentially be a useful alternative. Basically an efficient way to store binaries directly in the repo.

(Maybe it would be too small a band of use cases to be practicle though? Obviously if you have lots of distinct art assets, that's just going to be big, no matter what...)

0942v86534y ago

Does it do any architecture-specific processing, i.e. BCJ filter? Or is there a generic version of this? The performance seems quite good.

https://github.com/elfshaker/elfshaker/blob/1bedd4eacd3ddd83...

Author here. No architecture specific processing currently. Most of the magic happens in zstandard (hat tip to this amazing project).

Please see our new applicability section which explains the result in a bit more detail:

In manyclangs (which uses elfshaker for storage) we arrange that the object code has stable addresses when you do insertions/deletions, which means you don't need such a filter. But today I learned about such filters, so thanks for sharing your question!

beagle34y ago

Thanks, great project!

In this comment, you say "20% compression is pretty good". AFAIK, usually "X% compression" means the measure of the reduction in size, not the measure of the remaining. Thus, 0.01% compression sounds almost useless, very different from the 10,000x written next to it.

mrich4y ago

I'm guessing this does not yield that high compression for release builds, where code can be optimized across translation units? Likewise when a commit changes a header that is included in many cpps?

Author here. The executables shipped in manyclangs are release builds! The catch is that manyclangs stores object files pre-link. Executables are materialized by relinking after they are extracted with elfshaker.

The stored object files are compiled with -ffunction-sections and -fdata-sections, which ensures that insertions/deletions to the object file only have a local effect (they don't cause relative addresses to change across the whole binary).

As you observe, anything which causes significant non-local changes in the data you store is going to have a negative effect when it comes to compression ratio. This is why we don't store the original executables directly.

zeotroph4y ago

Thank you for the explanation, so the pre-link storage is one of the magical ingredients, maybe mention this as well in the README?

Is this the reason why manyclang (using llvms cmake based build system) can be provided easily, but it would be more difficult for gcc? Or is the object -> binary dependency automatically deduced?

> maybe mention this as well in the README?

We've tweaked the readme, I hope it's clearer.

It would be great to provide this for gcc too. The project is new and we've just started out. I know less about gcc's build system and how hard it will be to apply these techniques there. It seems as though it should be possible though and I'd love to see it happen.

To infer the object->executable dependencies we currently read the compilation database and produce a stand-alone link.sh shell script, which gets packaged into each manyclangs snapshot.

mrich4y ago

Thanks. I had a use case in mind where LTO is enabled. Unfortunately the LTO step is quite expensive so relinking does not seem like a viable option. If I find some time I'll give it a try though.

ThinLTO can be pretty quick if you have enough cores, it might work. Not sure how well the LTO objects compress against each other when you have small changes to them. It might work reasonably.

manyclangs is optimized to provide you with a binary quickly. The binary is not necessarily itself optimized to be fast, because it's expected that a developer might want to access any version of it for the purposes of testing whether some input manifests a bug or has a particular codegen output. In that scenario, it's likely that the developer is able to reduce the size of the input such that the speed of the compiler itself is not terribly significant in the overall runtime. Therefore, I don't see LTO for manyclangs as such a significant win. But it is still hoped that the overall end-to-end runtime is good, and the binaries are optimized, just not with LTO.

lxe4y ago

> There are many files,

> Most of them don't change very often so there are a lot of duplicate files,

> When they do change, the deltas of the [binaries] are not huge.

We need this but for node_modules

ithkuil4y ago

The novel trick here is splitting up huge binary files and treat them as if they were many small files.

Node_modulea is already tons and tons of files, and when they are large, they are usually minified and hard to split on any "natural" boundary (like elf sections/symbols etc)

aabbcc12414y ago

checkout pnpm, it stores each version of package only once, and setup your project's node_modules with symbolic to the exact cached version

cerved4y ago

for what reason?

lxpz4y ago

This should be integrated with Cargo to reduce the size of the target directories which are becoming ridiculously large.

Author here. I'm unsure whether this would apply very well to cargo or not. If it has lots of pre-link object files, then maybe.

[1] https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli...

I'd like to see a version of this built into things like IPFS.

It seems obvious that whenever something is saved into IPFS, there might be a similar object already stored. If there is, go make a diff, and only store the diff.

hcs4y ago

It should be possible to do this in IPFS already if you use the go-ipfs --chunker option with a content-sensitive chunking algorithm like rabin or buzhash [1]. With this there's a good chance that a file with small changes from something already on IPFS will have some chunks that hash identically, so they'll be shared.

But that isn't quite as good as something like this that can 'understand' diffs in files, rather than simply relying on the fact a bunch of bytes in a row might be the same.

hcs4y ago

I don't think elfshaker actually does do any binary diffing (e.g. xdelta or bsdiff). It works well because it uses pre-link objects which are built to change as little as possible between versions. Then when it compresses similar files together in a pack, Zstandard can recognize the trivial repeats.

jankotek4y ago

Does it make a sense to turn it into fuse fs, with transparent deduplication?

Author here. Maybe, it's a fun idea. I have toyed with providing a fuse filesystem for access to a pack but my time for completing this is limited at the moment.

nh24y ago

Many packfile-deduplicating backup tools (bup, kopia, borg, restic) can mount the deduplicated storage as FUSE.

It might make sense to check how they do it.

I'd also be interested in how elfshaker compares to those (and `bupstash`, which is written in Rust but doesn't have a FUSE mount yet) in terms of compression and speed.

Did you know of their existence when making elfshaker?

Edit: Question also posted in your Q&A: https://github.com/elfshaker/elfshaker/discussions/58#discus...

(Copying from Q&A) Before starting out some time ago, I did some experiments with bup. I had a good experience with bup and high expectations for it. However, I found that quite a lot of performance was left on the table, so I was motivated to start elfshaker. Unfortunately that time has past so I don't have scientific numbers for you measured with other software at this time.

As an idea of how elfshaker performs, we see ~300ms time to create a snapshot for clang, and ~seconds-to-minute to create a binary pack containing thousands of revisions. Extraction takes less than a second. One difference of elfshaker compared with some other software I tested is that we do the compression and decompression in parallel, which can make a very big difference on today's many-core machines.

ghoul24y ago

If I already have, lets say a 100MB pack file containing (say) 200 builds of clang and then I import the 201st build into that pack file - is it possible to send across a small delta of this new, updated pack file to someone else who already had the older pack file (with 200 builds) such that they can apply the delta to the old pack and get the new pack containing 201 builds?

splittingTimes4y ago

In general, how do you handle repos that have a combination of source files and a lot of binary data side by side?

bogwog4y ago

Does this work well with image files? (PNG, JPEG, etc)

Author here, it works particularly well for our presented use case because it has these properties:

* There are many files,

* Most of them don't change very often,

* When they do change, the deltas of the binaries are not huge.

So, if the image files aren't changing very much, then it might work well for you. If the images are changing, their binary deltas would be quite large, so you'd get a compression ratio somewhat equivalent to if you'd concatenated the two revisions of the file and compressed them using ZStandard.

IceWreck4y ago

Please add these points under a usecase heading in your README.

Done, hopefully this is clearer. Please let us know if you see a way to improve it further: https://github.com/elfshaker/elfshaker/pull/60

shp0ngle4y ago

Ahhh that’s the key insight I have been missing, and that should be higher somewhere.

Thanks

i_like_waiting4y ago

Thanks, seems like that could be good solution for storing of daily backups of DB. I didn't know I needed it but seems like I do.

Author here, this software is young, please don't use it for backups!

But also, in general, it might not work well for your use case, and our use case is niche. Please give it a try before making assumptions about any suitability for use.

wpietri4y ago

In this age of rampant puffery, it's so... soothing to see somebody be positive and frank about the limits of their creation. Thanks for this and all your comments here!

the_duke4y ago

Borg, bup or restic are relatively popular incremental backup tools that reduplicate with chunking.

phil2944y ago

Have a look at Borg, it handles incremental backups very well

tttsxhub4y ago

Why does it depend on the CPU architecture?

[0] https://github.com/elfshaker/manyclangs/tree/main/docker-qem...

(Disclosure: I work for Arm, opinions are my own)

Author here. elfshaker itself does not have a dependency on any architecture to our knowledge. We support the architectures we have use of. Contributions to add missing support are welcome.

manyclangs provides binary pack files for aarch64 because that's what we have immediate use of. If elfshaker and manyclangs proves useful to people, I would love to see resource invested to make it more widely useful.

You can still run the manyclangs binaries on other architectures using qemu [0], with some performance cost, which may be tolerable depending on your use case.

9999000009994y ago

Would love to see this work with Unity projects.

Right now git lfs takes up so much space when storing files locally.

erichocean4y ago

Seems like the Nix people would be interested in enabling this kind of thing for Nix packages…

yincrash4y ago

Could this be useful for packing xcode's deriveddata folder for caching in ci builds?

xpe4y ago

Never shake a baby elf!

cyounkins4y ago

Cool! I wonder how this would compare to ZFS deduplication.

veselink14y ago

An author here. elfshaker uses per-file deduplication. When building manyclangs packs, we observed that the deduplicated content is about 10 GiB in size. After compression with `elfshaker pack`, that comes down to ~100 MiB.

There is also a usability difference: elfshaker stores data in pack files, which are more easily shareable. Each of the pack files released as part of manyclangs ~100 MiB and contains enough data to materialize ~2,000 builds of clang and LLVM.

goodpoint4y ago

I'm surprised nobody mentioned git-annex. It does the same using git for metadata. It's extremely efficient.

kristjansson4y ago

AFAIK, git-annex doesn't address address sub-file deduplication/compression at all, it just stores a new copy for each new hash it sees? I suppose that content-addressed storage, combined with the pre-link strategy discussed elsewhere for the related manyclangs project would produce similar, if less spectacular, results?

billconan4y ago

so how do I do diff and merge (resolve conflicts)?

axismundi4y ago

does it work on intel macs?

svilen_dobrev4y ago

will some of these work for (compressed) variants of audio? They're never same..

Author here. Compressed data is unlikely to work well in general, unless it never changes.

j / k navigate · click thread line to collapse

113 comments

nh24y ago

I experimented with something similar with a Linux distribution's package binary cache.

Using `bup` (deduplicating backup tool using git packfile format) I deduplicated 4 Chromium builds into the size of 1. It could probably pack thousands into the size of a few.

Large download/storage requirements for updates are one of NixOS's few drawbacks, and I think deduplication could solve that pretty much completely.

Details: https://github.com/NixOS/nixpkgs/issues/89380

nh24y ago

I also use bup for a long time, but found that for very large server backups I'm hitting performance problems (both in time and memory usage).

I'm currently evaluating `bupstash` (also written in Rust) as a replacment. It's faster and uses a lot less memory, but is younger and thus lacks some features.

Here is somebody's benchmark of bupstas (unfortunately not including `bup`): https://acha.ninja/blog/encrypted_backup_shootout/

The `bupstash` author is super responsive on Gitter/Matrix, it may make sense to join there to discuss approaches/findings together.

pdimitar4y ago

I used `bupstash` and evaluated it for a while. I am looking to do 5+ offsite backups of a small personal directory to services that offer 5GB of cloud space for free.

I do recognize that for many other scenarios `bupstash` is very solid though.

ybkshaw4y ago

Thank you for having such a good description on the project! Sometimes the links from HN lead to a page that takes a few minutes of puzzling to figure out what is going on but not yours.

Siira4y ago

Is elfshaker any good for backuping non-text data?

iTokio4y ago

interesting design document: https://raw.githubusercontent.com/bup/bup/master/DESIGN

mhx774y ago

Somewhat related (and definitely born out of a very similar use case): https://github.com/mhx/dwarfs

Nice, I bet dwarfs would do well at our use case too. Thanks for sharing.

pdimitar4y ago

That's super impressive, I will definitely give it a go. Thanks for sharing!

thristian4y ago

An author here, I agree! The packfile format is heavily inspired by git, and git may also do quite well at this.

johnyzee4y ago

ChrisMarshallNY4y ago

> we store assets (like big PSDs that change often) outside of version control and it is suboptimal.

Perforce is still used by game developers and other creatives, because it handles large binaries, quite well.

In fact, I'm not sure if they still do it, but one of the game engines (I think, maybe, Unreal) used to have a free tier that also included a free Perforce install.

3np4y ago

Do you think it would be feasible to do a git-lfs replacement based on elfshaker?

Down the line maybe it would even be possible to have binaries as “first-class” (save for diff I guess)

veselink14y ago

An author here, we've opened a Q&A discussion on GitHub: https://github.com/elfshaker/elfshaker/discussions/58.

wlll4y ago

Related, and impressive: https://github.com/elfshaker/manyclangs

> manyclangs is a project enabling you to run any commit of clang within a few seconds, without having to build it.

> It provides elfshaker pack files, each containing ~2000 builds of LLVM packed into ~100MiB. Running any particular build takes about 4s.

Tobu4y ago

I'd like to see a comparison with Git and others listed here (same unlinked clang artifacts, compare packing and access): https://github.com/elfshaker/elfshaker/discussions/58#discus...

glandium4y ago

It would be interesting to compare to gitoxide tweaked to use zstd compression for packs.

perth4y ago

derefr4y ago

Daishiman4y ago

How does this work? Do all the game series use the same engine code and assets?

https://en.wikipedia.org/wiki/Binary_delta_compression

wpietri4y ago

Ooh, neat. I was wondering why anybody would make a binary-specific VCS. And why "elf" was in the name. This answers both questions. Thanks!

mal10c4y ago

This project reminded me of something I've been looking for for a while - although it's not exactly what I'm looking for...

In that light, I'd love to switch to another option. Preferably something decentralized just to ensure we have more backups. Git almost gets us there but doesn't include things like "where used."

All that being said, am I overlooking some features of Elfshaker that would fit well into my hopes of finding an alternative to PDM?

I also see there's another HN thread that asks the question I'm asking - just not through the lens of Elfshaker: https://news.ycombinator.com/item?id=20644770

kvnhn4y ago

[1]: https://github.com/kevin-hanselman/dud

henvic4y ago

Interesting. I wonder if this can also be [ab]used to, say, deliver deltas of programs, so that you can have faster updates, but maybe it doesn't make sense.

necovek4y ago

How fast would it be to get a delta between any two of the 2,000 builds in a single elfshaker pack?

If that's reasonably fast, perhaps an approach like that could work: server stores the entire pack, but upon user request extracts a delta between user's version and target binary.

Still, the devil is in the details of building all revisions of all software a single distribution has.

henvic4y ago

Thank you for the insight!

tedunangst4y ago

This is how openbsd binary patches for the kernel work. The changed object files are shipped; the end system relinks a new kernel.

carlmr4y ago

I find the description a bit confusing, is there and example where we can see the usage?

w0m4y ago

My top level being that it's a VCS (like Git) specialized for binaries; with commands baked in to prevent the slowdown that often comes with large git repositories.

throw_away4y ago

Specifically, it's for ELF binaries built in such a way that adding a new function or new data does not break however they cache existing functions/data.

I wonder if this concept could be extended to other binary types that git has problems with, were you able to know/control more about the underlying binary format.

One of the authors here, thanks for the feedback. We've tried to improve it here: https://github.com/elfshaker/elfshaker/pull/59

mxuribe4y ago

Same here. There is a usage guide, which helped a tiny bit: https://github.com/elfshaker/elfshaker/blob/main/docs/users/...

mxuribe4y ago

> our main use case is for things that you can reproduce in principle (builds of a commit history, see manyclangs)

I appreciate your response, and thanks very much for the clarification of use case; very helpful! Thanks also of course for building this!

xdfgh11124y ago

For backup you probably want something like Borg to handle deduplication of identical content between backups.

Author here, I agree with xdfgh1112, please take care before using brand new software to store your backups!

https://www.youtube.com/c/LLVMPROJ

wyldfire4y ago

There is an associated presentation on manyclangs at LLVM dev meeting. I think they presented yesterday?

Unfortunately it won't be uploaded until later but it will show up on the llvm YouTube channel:

ot4y ago

I would guess it’s a way to quickly bisect on compiler versions.

dilap4y ago

Huh, interesting, could you maybe use this as an in-repo alternative to something like git-lfs?

dilap4y ago

Thanks for the response! I was more just curious about future possibilities vs immediate practicle use.

git-lfs just offloads the storage of the large binaries to a remote site, and then downloads on demand.

(Maybe it would be too small a band of use cases to be practicle though? Obviously if you have lots of distinct art assets, that's just going to be big, no matter what...)

0942v86534y ago

Does it do any architecture-specific processing, i.e. BCJ filter? Or is there a generic version of this? The performance seems quite good.

https://github.com/elfshaker/elfshaker/blob/1bedd4eacd3ddd83...

Author here. No architecture specific processing currently. Most of the magic happens in zstandard (hat tip to this amazing project).

Please see our new applicability section which explains the result in a bit more detail:

beagle34y ago

Thanks, great project!

mrich4y ago

I'm guessing this does not yield that high compression for release builds, where code can be optimized across translation units? Likewise when a commit changes a header that is included in many cpps?

zeotroph4y ago

Thank you for the explanation, so the pre-link storage is one of the magical ingredients, maybe mention this as well in the README?

Is this the reason why manyclang (using llvms cmake based build system) can be provided easily, but it would be more difficult for gcc? Or is the object -> binary dependency automatically deduced?

> maybe mention this as well in the README?

We've tweaked the readme, I hope it's clearer.

To infer the object->executable dependencies we currently read the compilation database and produce a stand-alone link.sh shell script, which gets packaged into each manyclangs snapshot.

mrich4y ago

Thanks. I had a use case in mind where LTO is enabled. Unfortunately the LTO step is quite expensive so relinking does not seem like a viable option. If I find some time I'll give it a try though.

ThinLTO can be pretty quick if you have enough cores, it might work. Not sure how well the LTO objects compress against each other when you have small changes to them. It might work reasonably.

lxe4y ago

> There are many files,

> Most of them don't change very often so there are a lot of duplicate files,

> When they do change, the deltas of the [binaries] are not huge.

We need this but for node_modules

ithkuil4y ago

The novel trick here is splitting up huge binary files and treat them as if they were many small files.

Node_modulea is already tons and tons of files, and when they are large, they are usually minified and hard to split on any "natural" boundary (like elf sections/symbols etc)

aabbcc12414y ago

checkout pnpm, it stores each version of package only once, and setup your project's node_modules with symbolic to the exact cached version

cerved4y ago

for what reason?

lxpz4y ago

This should be integrated with Cargo to reduce the size of the target directories which are becoming ridiculously large.

Author here. I'm unsure whether this would apply very well to cargo or not. If it has lots of pre-link object files, then maybe.

[1] https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli...

I'd like to see a version of this built into things like IPFS.

It seems obvious that whenever something is saved into IPFS, there might be a similar object already stored. If there is, go make a diff, and only store the diff.

hcs4y ago

But that isn't quite as good as something like this that can 'understand' diffs in files, rather than simply relying on the fact a bunch of bytes in a row might be the same.

hcs4y ago

jankotek4y ago

Does it make a sense to turn it into fuse fs, with transparent deduplication?

Author here. Maybe, it's a fun idea. I have toyed with providing a fuse filesystem for access to a pack but my time for completing this is limited at the moment.

nh24y ago

Many packfile-deduplicating backup tools (bup, kopia, borg, restic) can mount the deduplicated storage as FUSE.

It might make sense to check how they do it.

I'd also be interested in how elfshaker compares to those (and `bupstash`, which is written in Rust but doesn't have a FUSE mount yet) in terms of compression and speed.

Did you know of their existence when making elfshaker?

Edit: Question also posted in your Q&A: https://github.com/elfshaker/elfshaker/discussions/58#discus...

ghoul24y ago

splittingTimes4y ago

In general, how do you handle repos that have a combination of source files and a lot of binary data side by side?

bogwog4y ago

Does this work well with image files? (PNG, JPEG, etc)

Author here, it works particularly well for our presented use case because it has these properties:

* There are many files,

* Most of them don't change very often,

* When they do change, the deltas of the binaries are not huge.

IceWreck4y ago

Please add these points under a usecase heading in your README.

Done, hopefully this is clearer. Please let us know if you see a way to improve it further: https://github.com/elfshaker/elfshaker/pull/60

shp0ngle4y ago

Ahhh that’s the key insight I have been missing, and that should be higher somewhere.

Thanks

i_like_waiting4y ago

Thanks, seems like that could be good solution for storing of daily backups of DB. I didn't know I needed it but seems like I do.

Author here, this software is young, please don't use it for backups!

But also, in general, it might not work well for your use case, and our use case is niche. Please give it a try before making assumptions about any suitability for use.

wpietri4y ago

In this age of rampant puffery, it's so... soothing to see somebody be positive and frank about the limits of their creation. Thanks for this and all your comments here!

the_duke4y ago

Borg, bup or restic are relatively popular incremental backup tools that reduplicate with chunking.

phil2944y ago

Have a look at Borg, it handles incremental backups very well

tttsxhub4y ago

Why does it depend on the CPU architecture?

[0] https://github.com/elfshaker/manyclangs/tree/main/docker-qem...

(Disclosure: I work for Arm, opinions are my own)

Author here. elfshaker itself does not have a dependency on any architecture to our knowledge. We support the architectures we have use of. Contributions to add missing support are welcome.

You can still run the manyclangs binaries on other architectures using qemu [0], with some performance cost, which may be tolerable depending on your use case.

9999000009994y ago

Would love to see this work with Unity projects.

Right now git lfs takes up so much space when storing files locally.

erichocean4y ago

Seems like the Nix people would be interested in enabling this kind of thing for Nix packages…

yincrash4y ago

Could this be useful for packing xcode's deriveddata folder for caching in ci builds?

xpe4y ago

Never shake a baby elf!

cyounkins4y ago

Cool! I wonder how this would compare to ZFS deduplication.

veselink14y ago

goodpoint4y ago

I'm surprised nobody mentioned git-annex. It does the same using git for metadata. It's extremely efficient.

kristjansson4y ago

billconan4y ago

so how do I do diff and merge (resolve conflicts)?

axismundi4y ago

does it work on intel macs?

svilen_dobrev4y ago

will some of these work for (compressed) variants of audio? They're never same..