I've considered using it before as an alternative to Git LFS.
* git diff doesn't work in any sensible way
* if you forget and do `git add` instead of `git annex add`, everything is fine, but you've now spoilt the nice thing that git annex does of de-duping files. (git annex only stores one copy of identical files)
* for our use case (which I'm sure is the wrong way of doing things) it's possible to overwrite the single copy of a file that git annex stores, which rather spoils the point of the thing. I do think it's down to the way we use it, though, so not specifically a git annex problem
The _great_ thing about git annex is it can be self-hosted. For various reasons we can't put our source data in one of the systems that uses git-lfs.
We've got about 800 GB of data in git annex and I've been happy with it despite the limitations.
git annex config --set annex.largefiles 'largerthan=1kb and not (mimeencoding=us-ascii or mimeencoding=utf-8)'
> By default, git-annex add adds all files to the annex (except dotfiles), and git add adds files to git (unless they were added to the annex previously). When annex.largefiles is configured, both git annex add and git add will add matching large files to the annex, and the other files to git. —https://git-annex.branchable.com/git-annex/Note that git add will add large files unlocked, though, since (as far as I understand) it’s assumed you’re still modifying them for safety:
> If you use git add to add a file to the annex, it will be added in unlocked form from the beginning. This allows workflows where a file starts out unlocked, is modified as necessary, and is locked once it reaches its final version. —https://git-annex.branchable.com/git-annex-unlock/
Huggingface uses git-lfs for large datasets with good success. git-lfs on GitHub gets very pricey at higher volumes of data. Would love the affordability of object storage, just with a better git blob storage interface, that will be around in the future.
Most of these systems do their own hash calculations and are not interchangeable with each other. I feel like git-lfs has the momentum at the momentum in data-science at the moment, but needs some better options for people who want a low cost storage option that they can control.
Huggingface is great, but it's one more service to onboard if you're in an enterprise. And data privacy/retention/governance means that many people would liek their data to reside on their own infrastructure.
If AWS were to give us a low cost git-lfs hosted service on top of S3 it would be very popular.
If anyone knows of some good alternatives, please let us know!
When does it use hard links? As far as I remember it used symlinks unless you used something like annex.hardlink (described in the man page: https://git-annex.branchable.com/git-annex/)