I gets super slow (waiting minutes) when there are a few thousand files tracked. Thousands files have to be tracked, if you have e.g. a 10GB file per day and region and artifacts generated from it.
You are encouraged (it only can track artifacts) if you model your pipeline in DVC (think like make). However, it cannot run tasks it parallel. So it takes a lot of time to run a pipeline while you are on a beefy machine and only one core is used. Obviously, you cannot run other tools (e.g. snakemake) to distribute/parallelize on multiple machines. Running one (part of a) stage has also some overhead, because it does commit/checks after/before running the executable of the task.
Sometimes you get merge conflicts, if you run a (partial parmaretized) stage on one machine and the other part on the other machine manually. These are cumbersome to fix.
Currently, I think they are more focused on ML features like experiment tracking (I prefer other mature tools here) instead of performance and data safety.
There is an alternative implementation from a single developer (I cannot find it right now) that fixes some problems. However, I do not use this because it propably will not have the same development progress and testing as DVC.
This sounds negative but I think it is currently the one of the best tools in this space.
[0]: https://github.com/kevin-hanselman/dud
[1]: https://github.com/kevin-hanselman/dud/tree/main/integration...
Encouraged to do what?
You might want to slow down on the use of parentheses, we are both getting lost in them.
However it's fantastic for tracking artifacts throughout an project that have been generated by other means, and for keeping those artifacts tightly in sync with Git, and for making it easy to share those artifacts without forcing people to re-run expensive pipelines.
However I agree that in general it's best for smaller projects and use cases. for example, it still shares the primary deficiency of Make in that it can only track files on the file system, and now things like ensuring a database table has been created (unless you 'touch' your own sentinel files).
Dvc is the best tool (I found) inspite of being dead slow and complex (trying to do many things).
What alternatives would you recommend?
https://www.crunchbase.com/organization/iterative-ai/company...
It should be easy to opt-out though `dvc config core.analytics false` or an env variable `DVC_ANALYTICS=False`.
Could you please clarify about the `several lines of code`? We were trying to make it very open and visible what we collect (it prints a large message when it starts) + make it easy to disable it.
That’s your prerogative as it’s your project but makes me think what else you’re doing that’s against users best interest and in your own.
That, combined with the nature of re-using the same filename for the metadata files, meant that it was common for folks to commit the binary and push it. Again, lots of history rewriting to get git sizes back down.
Maybe there exist solutions to my problems but I had spent hours wrestling with it trying to fix these bad states, and it caused me much distress.
Also configuring the backing store was generally more painful, especially if you needed >2GB.
DVC was easy to use from the first moment. The separate meta files meant that it can't get into mixed clean/smudge states. If you aren't in a cloud workflow already, the backing store was a bit tricky, but even without AWS I made it work.
1. All git-lfs files are kept in the same folder
2. No one can directly push commits to one of the main branches, they need to raise a PR. This means that commits go through review and its easy to tell if they've accidentally commit a binary, and we can just delete their branch form the remote bringing the size back down.
For example, I think CodeOcean might use git-lfs under the hood but handles upload download separately from the UI. In the below sample, you can clone the repo from the Capsule menu but data and results are downloadable from a contextual menu available from each, respectively.
Ideally I'd love to use git-lfs on top of S3, directly. I've looked into git-annex and various git-lfs proxies, but I'm not sure they're maintained well enough to be trusting it with long-term data storage.
Huggingface datasets are built on git-lfs and it works really well for them for storage of large datasets. Ideally I'd love for AWS to offer this as a hosted thin layer on top of S3, or for some well funded or supported community effort to do the same, and in a performant way.
If you know of any such solution, please let me know!
It comes with a smart versioning approach, checks the Δ based on the checksum and has a feature to visualize the lineage.
You can also use your existing object store and link it for very large / sensitive data.[2]
Disclaimer: I work at W&B.
[1]: https://docs.wandb.ai/guides/data-and-model-versioning/model... [2]: https://docs.wandb.ai/guides/artifacts/track-external-files#...
Thinking more abstractly, there is benefit for code and data to live "next" to each other, if possible. Atomically committed to a codebase and the latter loaded / used by the former without connecting to yet another workflow.
Another difference is that for DVC (surprisingly) data versioning itself is just one of the main fundamental layers that is needed to provide holistic ML experiments tracking and versioning. So, DVC has a layer to describe an ML project, run it, capture and version inputs/outputs. In that sense DVC becomes a more opinionated / high level tool if that makes sense.
Example: 100 TB, write us every 10 mins.
Or, 1tb, parquet, 40% is rewritten daily.
Maybe Pachyderm or Dolt would be better tools here.
I'd need such a tool to manage features, checkpoints and labels. This doesn't do any of that. Nor does it really handle merging multiple versions of data.
And I'd really like the code to be handled separately from the data. Git is not the place to do this. Because the choice of picking pairs of code and data should happen at a higher level, and be tracked along with the results - that's not going in a repo - MLFlow or Tensorboard handles it better.
What's the case for handling code and data separately? In my experience, the primary motivation for using such a tool are easy reproducibility through easy tracking of code, hyperparams, and data. It's not obvious to me how that goal would be advanced by tracking code and data separately.
Handling code and data separately is important, to allow easy updates to one or the other. They are loosely coupled to allow quicker updates, rather than having to increment versions on both as per DVC, and DVC is far heavier weight as it pulls the data referenced in the dvc files, and you have to pick out on the CLI which ones you want.
Downloading as required to a local cache when needed from your actual scripts works much better. It's just like what transformers does for pre-trained models.
Also, one thing is to physically provide a way to version data (e.g. partitioned parquet files, cloud versioning, etc, etc), but another one is to also have a mechanism of saving / codifying dataset version into the project. E.g. to answer the question which version of data this model was built with you would need to save some identifier / hash / list of files that were used. DVC takes care of that part as well.
(it has mechanics to cache data that you download, make-file like pipelines, etc)