undefined | Better HN

0 pointsjohnyzee4y ago0 comments

As far as I was able to learn (don't remember the details, sorry), git does not do well with large binary files. I believe it ends up with a lot of duplication. It is the major thing I am missing from git, currently we store assets (like big PSDs that change often) outside of version control and it is suboptimal.

0 comments

peterwaller-arm4y ago

Performing poorly with non-textual data happens for a a number of reasons. Binary data, when changed, often have a lot of 'non-local' changes in them. For example, a PSD file might well have a compression algorithm already applied to it. An insertion/deletion is going to result in a very different compressed representation for which there is no good way to have an efficient delta. elfshaker will suffer the same problem here.

derefr4y ago

One could, in theory, write a git-clean filter (like the one used for git-lfs), that teaches git various heuristic approaches to "take apart" well-known binary container formats into trees of binary object leaf-nodes.

Then, when you committed a large binary that git could understand, what git would really be committing in its place would be a directory tree — sort of like the "resource tree" you see if you edit an MKV file, PNG file, etc., but realized as files in directories. Git would generate it, then commit it.

On checkout, this process would happen in reverse: a matching git-smudge filter could notice a metadata file in each of these generated directories, and collapse the contents of the directory together to form a binary chunk; recursively, up the tree, until you hit the toplevel, and end up with the original large binary again.

Since most of the generated leaf-nodes from this process wouldn't change on each commit, this would eliminate most of the storage overhead of having many historical versions of large files in git. (In exchange for: 1. the potentially-huge CPU overhead of doing this "taking apart" of the file on every commit; 2. the added IOPS for temporarily creating the files to commit them; and 3. the loss of any file-level compression [though git itself compresses its packfiles, so that's a wash.])

I'm almost inspired to try this out for a simple binary tree format like https://en.wikipedia.org/wiki/Interchange_File_Format. But ELF wouldn't be too hard, either! (You could even go well past the "logical tree" of ELF by splitting the text section into objects per symbol, and ensuring the object code for each symbol is stored in a PIC representation in git, even if it isn't in the binary.)

josephg4y ago

As I understand it, this is essentially what the Google chrome updater does. It disassembles the binary and recalculates jump labels. Then it generates a diff based on the assembly code. When it applies that diff on people’s computers, the your computer again pulls the chrome binary apart and reconstructs it. The code for this is complex, but it’s all opensource.

I remember reading about this technique years ago, thinking “cool when this catches on, software updates in all my software will be tiny”. But no, for some reason macos updates are still gigabytes in size. I have no idea why?

2 more replies

JoshTriplett4y ago

Can you talk a bit more about what ELF-specific heuristics elfshaker uses? What kind of preprocessing do you do before zstd? Do you handle offsets changing in instructions, like the BCJ/BCJ2 filter? Do you do anything to detect insertions/deletions?

peterwaller-arm4y ago

We've just added an applicability section, which explains a bit more what we do. We don't have any ELF specific heuristics [0].

https://github.com/elfshaker/elfshaker#applicability

In summary, for manyclangs, we compile with -ffunction-sections and -fdata-sections, and store the resulting object files. These are fairly robust to insertions and deletions, since the addresses are section relative, so the damage of any addresses changing is contained within the sections. A somewhat surprising thing is that this works well enough when building many revisions of clang/llvm -- as you go from commit to commit, many commits have bit identical object files, even though the build system often wants to rebuild them because some input has changed.

elfshaker packs use a heuristic of sorting all unique objects by size, before concatenating them and storing them with zstandard. This gives us an amortized cost-per-commit of something like 40kiB after compression with zstandard.

[0] (edit: despite the playful name suggesting otherwise -- when we chose the name we planned to do more with ELF files, but it turned out to be unnecessary for our use case)

1 more reply

ChrisMarshallNY4y ago

> we store assets (like big PSDs that change often) outside of version control and it is suboptimal.

Perforce is still used by game developers and other creatives, because it handles large binaries, quite well.

In fact, I'm not sure if they still do it, but one of the game engines (I think, maybe, Unreal) used to have a free tier that also included a free Perforce install.

mdaniel4y ago

It was my recollection, and I confirmed it, that they've almost always had a "the first hit is free" model for small teams, and they also explicitly call out indie game studios as getting free stuff too: https://www.perforce.com/how-buy

j / k navigate · click thread line to collapse

0 comments

peterwaller-arm4y ago

derefr4y ago

josephg4y ago

2 more replies

JoshTriplett4y ago

peterwaller-arm4y ago

We've just added an applicability section, which explains a bit more what we do. We don't have any ELF specific heuristics [0].

https://github.com/elfshaker/elfshaker#applicability

[0] (edit: despite the playful name suggesting otherwise -- when we chose the name we planned to do more with ELF files, but it turned out to be unnecessary for our use case)

1 more reply

ChrisMarshallNY4y ago

> we store assets (like big PSDs that change often) outside of version control and it is suboptimal.

Perforce is still used by game developers and other creatives, because it handles large binaries, quite well.

In fact, I'm not sure if they still do it, but one of the game engines (I think, maybe, Unreal) used to have a free tier that also included a free Perforce install.

mdaniel4y ago

j / k navigate · click thread line to collapse