Data Version Control (opens in new tab)

(dvc.org)

161 pointsHerrMonnezza3y ago59 comments

59 comments

DVC has had the following problems, when I tested it (half a year ago):

I gets super slow (waiting minutes) when there are a few thousand files tracked. Thousands files have to be tracked, if you have e.g. a 10GB file per day and region and artifacts generated from it.

You are encouraged (it only can track artifacts) if you model your pipeline in DVC (think like make). However, it cannot run tasks it parallel. So it takes a lot of time to run a pipeline while you are on a beefy machine and only one core is used. Obviously, you cannot run other tools (e.g. snakemake) to distribute/parallelize on multiple machines. Running one (part of a) stage has also some overhead, because it does commit/checks after/before running the executable of the task.

Sometimes you get merge conflicts, if you run a (partial parmaretized) stage on one machine and the other part on the other machine manually. These are cumbersome to fix.

Currently, I think they are more focused on ML features like experiment tracking (I prefer other mature tools here) instead of performance and data safety.

There is an alternative implementation from a single developer (I cannot find it right now) that fixes some problems. However, I do not use this because it propably will not have the same development progress and testing as DVC.

This sounds negative but I think it is currently the one of the best tools in this space.

kvnhn3y ago

You might be referring to me/Dud[0]. If you are, first off, thanks! I'd love to know more about what development progress you are hoping for. Is there a specific set of features that bar you from using Dud? As far as testing, Dud has a large and growing set of unit and integration tests[1] that are run in Github CI. I'll never have the same resources as Iterative/DVC, but my hope is that being open source will attract collaborators. PRs are always welcome ;)

[0]: https://github.com/kevin-hanselman/dud

[1]: https://github.com/kevin-hanselman/dud/tree/main/integration...

remram3y ago

> You are encouraged if you model your pipeline in DVC.

Encouraged to do what?

You might want to slow down on the use of parentheses, we are both getting lost in them.

nerdponx3y ago

I assume they meant to say "you are encouraged to use DVC to run your model and experiment pipeline". They want to encourage you to do this because they are trying to build a business around being a data science ops ecosystem. But the truth is that DVC is not a great tool for running "experiments" searching over a parameter space. it could be improved in that regard, but that's just not what I use it for nor is it what I recommend it to other people for.

However it's fantastic for tracking artifacts throughout an project that have been generated by other means, and for keeping those artifacts tightly in sync with Git, and for making it easy to share those artifacts without forcing people to re-run expensive pipelines.

shcheklein3y ago

> But the truth is that DVC is not a great tool for running "experiments" searching over a parameter space.

Would love your feedback what's missing there! We've been improving it lately - e.g.

- Hydra support https://dvc.org/doc/user-guide/experiment-management/hydra

- VS Code extension - https://marketplace.visualstudio.com/items?itemName=Iterativ...

1 more reply

jdoliner3y ago

DVC is great for use cases that don't get to this scale or have these needs. And the issues here are non-trivial to solve. I've spent a lot of time figuring out how to solve them in Pachyderm which is good for use cases where you do need higher levels of scale or might run into merge conflicts with DVC. There's trade-offs though. DVC is definitely easier for a single developer / data scientist to get up and running with.

nerdponx3y ago

I think it's worth noting that DVC can be used to track artifacts that have been generated by other tools. For example, you could use MLFlow to run several model experiments, but at the end track the artifacts with DVC. Personally I think that this is the best way to use it.

However I agree that in general it's best for smaller projects and use cases. for example, it still shares the primary deficiency of Make in that it can only track files on the file system, and now things like ensuring a database table has been created (unless you 'touch' your own sentinel files).

bagavi3y ago

The alternative tool you are referring to is `Dud` I believe

Dvc is the best tool (I found) inspite of being dead slow and complex (trying to do many things).

What alternatives would you recommend?

DougBTX3y ago

What’s best if parallel step processing is required?

mountainriver3y ago

Yeah we had a lot of problems with things getting out of sync and we just got tired of it

throwawaybutwhy3y ago

The package phones home. One has to set an env var or fix several lines of code to prevent that.

sva_3y ago

I wondered how they'll make money

https://www.crunchbase.com/organization/iterative-ai/company...

nerdponx3y ago

I think their plan was/is to make money on corporate licenses and support, as well as SaaS/cloud products.

machinekob3y ago

They won't, they can make investor money back only from selling company to Amazon/Microsoft/Google but in this economy it won't happen.

shcheklein3y ago

Hey, yes, we've decided to keep it opt-out for now and it collects fully anonymized basic statistics. Here is the full policy: https://dvc.org/doc/user-guide/analytics .

It should be easy to opt-out though `dvc config core.analytics false` or an env variable `DVC_ANALYTICS=False`.

Could you please clarify about the `several lines of code`? We were trying to make it very open and visible what we collect (it prints a large message when it starts) + make it easy to disable it.

prepend3y ago

This seems pretty anti user since most users prefer opt in. Seems pretty shady to keep in behavior that users don’t like and potentially harms them (you think it’s fully anonymized).

That’s your prerogative as it’s your project but makes me think what else you’re doing that’s against users best interest and in your own.

shcheklein3y ago

We are fully aware that it raises concerns. Trust me it hurts my feelings as well. E.g. on the websites (dvc.org, cml.dev, etc) - we don't use any cookies, GA, etc.

We've tried to make it as open as possible - code is available (its open source), we write openly about this at the very start, we have a policy online, made it easy to opt-out. If you have other ideas how to make it even more friendly, more visible, etc - let us know please.

Still, we've preferred so far to keep it opt-out since it's crucial for us to see major product trends (which features are being used more, product growth MoM etc). Opt-in at this stage realistically won't give us this information.

1 more reply

pabs33y ago

I wonder what the GDPR implications of this are. I note other projects (for eg Cura) switched their telemetry to opt-in.

https://github.com/Ultimaker/Cura/issues/2810

adhocmobility3y ago

If you just want a git for large data files, and your files don't get updated too often (e.g. an ML model deployed in production which gets updated every month) then git-lfs is a nice solution. Bitbucket and Github both have support for it.

kortex3y ago

I've used both extensively. Git-lfs has always been a nightmare. Because each tracked large file can be in one of two states - binary, or "pointer" - it's super easy for the folder to get all fouled up. It would be unable to "clean" or "smudge", since either would cause some conflict. If you accidentally pushed in the wrong state, you could "infect" the remote and be really hosed. I had this happen numerous times over about 2 years of using lfs, and each time the only solution was some aggressive rewriting of history.

That, combined with the nature of re-using the same filename for the metadata files, meant that it was common for folks to commit the binary and push it. Again, lots of history rewriting to get git sizes back down.

Maybe there exist solutions to my problems but I had spent hours wrestling with it trying to fix these bad states, and it caused me much distress.

Also configuring the backing store was generally more painful, especially if you needed >2GB.

DVC was easy to use from the first moment. The separate meta files meant that it can't get into mixed clean/smudge states. If you aren't in a cloud workflow already, the backing store was a bit tricky, but even without AWS I made it work.

adhocmobility3y ago

We resolve this in two ways

1. All git-lfs files are kept in the same folder

2. No one can directly push commits to one of the main branches, they need to raise a PR. This means that commits go through review and its easy to tell if they've accidentally commit a binary, and we can just delete their branch form the remote bringing the size back down.

adolph3y ago

I think the one thing that DVC does a bit better than git-lfs is that DVC doesn't keep the files directly in the repo. DVC puts a pointer file with a path and a hash of the file (to detect change). As far as I can tell, git-lfs only keeps them in the .git path of the repo.

For example, I think CodeOcean might use git-lfs under the hood but handles upload download separately from the UI. In the below sample, you can clone the repo from the Capsule menu but data and results are downloadable from a contextual menu available from each, respectively.

https://codeocean.com/capsule/2131051/tree/v1

kernelsanderz3y ago

I do feel like git-lfs is a good solution. Once you have 10s or 100s of GB of files (eg. a computer vision project), this gets pretty pricey.

Ideally I'd love to use git-lfs on top of S3, directly. I've looked into git-annex and various git-lfs proxies, but I'm not sure they're maintained well enough to be trusting it with long-term data storage.

Huggingface datasets are built on git-lfs and it works really well for them for storage of large datasets. Ideally I'd love for AWS to offer this as a hosted thin layer on top of S3, or for some well funded or supported community effort to do the same, and in a performant way.

If you know of any such solution, please let me know!

haensi3y ago

Have you tested Weights & Biases Artifacts[1]?

It comes with a smart versioning approach, checks the Δ based on the checksum and has a feature to visualize the lineage.

You can also use your existing object store and link it for very large / sensitive data.[2]

Disclaimer: I work at W&B.

[1]: https://docs.wandb.ai/guides/data-and-model-versioning/model... [2]: https://docs.wandb.ai/guides/artifacts/track-external-files#...

tzhenghao3y ago

+1. git-lfs is sufficient for tracking binaries, including a ML model, at that cadence.

Thinking more abstractly, there is benefit for code and data to live "next" to each other, if possible. Atomically committed to a codebase and the latter loaded / used by the former without connecting to yet another workflow.

simonw3y ago

It seems to be the solution Hugging Face have picked too.

tomthe3y ago

Can anyone compare this to DataLad [1], which someone introduced to me as "git for data"?

[https://www.datalad.org/]

benhurmarcel3y ago

And what about Dolt?

https://docs.dolthub.com/introduction/what-is-dolt

shcheklein3y ago

Dolt is for tabular data. It's like SQLite but with branching, versioning of the DB level. DVC is file-based. It saves large files, directories, etc to one of the supported storages - S3, GCP, Azure, etc. It's more like Git-lfs in that sense.

Another difference is that for DVC (surprisingly) data versioning itself is just one of the main fundamental layers that is needed to provide holistic ML experiments tracking and versioning. So, DVC has a layer to describe an ML project, run it, capture and version inputs/outputs. In that sense DVC becomes a more opinionated / high level tool if that makes sense.

remram3y ago

Doesn't use git-annex like DataLad. That alone is a huge benefit given the state of that tool.

imiric3y ago

I'm curious, what's the problem with git-annex?

I've considered using it before as an alternative to Git LFS.

niccl3y ago

things that I don't like about it:

* git diff doesn't work in any sensible way

* if you forget and do `git add` instead of `git annex add`, everything is fine, but you've now spoilt the nice thing that git annex does of de-duping files. (git annex only stores one copy of identical files)

* for our use case (which I'm sure is the wrong way of doing things) it's possible to overwrite the single copy of a file that git annex stores, which rather spoils the point of the thing. I do think it's down to the way we use it, though, so not specifically a git annex problem

The _great_ thing about git annex is it can be self-hosted. For various reasons we can't put our source data in one of the systems that uses git-lfs.

We've got about 800 GB of data in git annex and I've been happy with it despite the limitations.

2 more replies

remram3y ago

It lives in this weird wiki that seems to be read-only most of the time. I don't think it's alive. Its use of hard links also causes too many problems, of the silent corruption variety.

1 more reply

jefurii3y ago

What's wrong with git-annex? My work has been using it for almost 10 years to manage 40TB+ of data. It's always been rock solid.

polemic3y ago

If you're looking for something that actually tracks tabular data there's https://kartproject.org. Geo focused but also works with standard database tables. Built with git (kart repos are git repos), can track PostgreSQL, MSSQL, MySQL etc.

LaserToy3y ago

Can it be used for large and fast changing datasets?

Example: 100 TB, write us every 10 mins.

Or, 1tb, parquet, 40% is rewritten daily.

nerdponx3y ago

DVC is expressly for tracking artifacts that are files on disk, and only by comparing their MD5 hashes. So it can definitely track the parquet files, but you are not going to get row or field diffs or anything like that.

Maybe Pachyderm or Dolt would be better tools here.

AlotOfReading3y ago

Why would you use MD5 in anything written in the last 5 years? The SHA family is faster on modern hardware and there aren't trivial collisions floating around out there.

kortex3y ago

It was definitely a bad choice. I wasn't there so I can only speculate. My guess is because it is sort of ubiquitous and thus a low-hanging fruit and devs didn't know better, or the related corollary, it's what S3 uses for ETags, so it probably seemed logical. Either way, seems like someone did it and didn't know better, no one agrees on a fix or whether it's even necessary to change, and thus it's stuck for now.

There's an ongoing discussion about replacing/configuring the hash function, and it looks like there might be some movement toward replacing the hash and other speedups in 3.0

https://github.com/iterative/dvc/issues/3069

> We not only want to switch to a different algorithm in 3.0, but to also provide better performance/ui/architecture/ecosystem for data management, and all of that while not seizing releases with new features (experiements, dvc machine, plots, etc) and bug fixes for 2.0, so we've been gradually rebuilding that and will likely be ready for 3.0 in the upcoming months. - https://github.com/iterative/dvc/issues/3069#issuecomment-93...

nerdponx3y ago

Don't quote me on the specific hash algorithm, maybe it's SHA. Point is that it's just comparing modification times and hashes.

snthpy3y ago

What about Apache Iceberg for those?

smeagull3y ago

I don't think this tool can encompass everything you need in managing ML models and data sets, even if you limit it to versioning data.

I'd need such a tool to manage features, checkpoints and labels. This doesn't do any of that. Nor does it really handle merging multiple versions of data.

And I'd really like the code to be handled separately from the data. Git is not the place to do this. Because the choice of picking pairs of code and data should happen at a higher level, and be tracked along with the results - that's not going in a repo - MLFlow or Tensorboard handles it better.

davidatbu3y ago

How do you merge multiple versions of data using tensorboard? Or what other tool handles that for you?

What's the case for handling code and data separately? In my experience, the primary motivation for using such a tool are easy reproducibility through easy tracking of code, hyperparams, and data. It's not obvious to me how that goal would be advanced by tracking code and data separately.

smeagull3y ago

Tensorboard doesn't do that, I was referring to things a dataset/model management tool should do. For us, Tensorboard tracks the datasets as hyperparams. The actual multiple versions of data end up being handled on the warehouse side. Prefect is what we use for running those DAGs to make the different versions.

Handling code and data separately is important, to allow easy updates to one or the other. They are loosely coupled to allow quicker updates, rather than having to increment versions on both as per DVC, and DVC is far heavier weight as it pulls the data referenced in the dvc files, and you have to pick out on the CLI which ones you want.

Downloading as required to a local cache when needed from your actual scripts works much better. It's just like what transformers does for pre-trained models.

davidatbu3y ago

I forgot to say thanks regarding this!

> Tensorboard tracks the datasets as hyperparams.

Clever!

> Warehouse side .. Prefect

I'll have to checkout warehouse-side things and Prefect to see what you mean.

Appreciate all the pointers!

bs72803y ago

What value does this provide that I can't get by versioning my data in partitioned parquet files on s3?

shcheklein3y ago

I think parquet won't help with images, video, ML models.

Also, one thing is to physically provide a way to version data (e.g. partitioned parquet files, cloud versioning, etc, etc), but another one is to also have a mechanism of saving / codifying dataset version into the project. E.g. to answer the question which version of data this model was built with you would need to save some identifier / hash / list of files that were used. DVC takes care of that part as well.

(it has mechanics to cache data that you download, make-file like pipelines, etc)

j / k navigate · click thread line to collapse

59 comments

lizen_one3y ago

DVC has had the following problems, when I tested it (half a year ago):

I gets super slow (waiting minutes) when there are a few thousand files tracked. Thousands files have to be tracked, if you have e.g. a 10GB file per day and region and artifacts generated from it.

Sometimes you get merge conflicts, if you run a (partial parmaretized) stage on one machine and the other part on the other machine manually. These are cumbersome to fix.

Currently, I think they are more focused on ML features like experiment tracking (I prefer other mature tools here) instead of performance and data safety.

This sounds negative but I think it is currently the one of the best tools in this space.

kvnhn3y ago

[0]: https://github.com/kevin-hanselman/dud

[1]: https://github.com/kevin-hanselman/dud/tree/main/integration...

remram3y ago

> You are encouraged if you model your pipeline in DVC.

Encouraged to do what?

You might want to slow down on the use of parentheses, we are both getting lost in them.

nerdponx3y ago

shcheklein3y ago

> But the truth is that DVC is not a great tool for running "experiments" searching over a parameter space.

Would love your feedback what's missing there! We've been improving it lately - e.g.

- Hydra support https://dvc.org/doc/user-guide/experiment-management/hydra

- VS Code extension - https://marketplace.visualstudio.com/items?itemName=Iterativ...

1 more reply

jdoliner3y ago

nerdponx3y ago

bagavi3y ago

The alternative tool you are referring to is `Dud` I believe

Dvc is the best tool (I found) inspite of being dead slow and complex (trying to do many things).

What alternatives would you recommend?

DougBTX3y ago

What’s best if parallel step processing is required?

mountainriver3y ago

Yeah we had a lot of problems with things getting out of sync and we just got tired of it

throwawaybutwhy3y ago

The package phones home. One has to set an env var or fix several lines of code to prevent that.

sva_3y ago

I wondered how they'll make money

https://www.crunchbase.com/organization/iterative-ai/company...

nerdponx3y ago

I think their plan was/is to make money on corporate licenses and support, as well as SaaS/cloud products.

machinekob3y ago

They won't, they can make investor money back only from selling company to Amazon/Microsoft/Google but in this economy it won't happen.

shcheklein3y ago

Hey, yes, we've decided to keep it opt-out for now and it collects fully anonymized basic statistics. Here is the full policy: https://dvc.org/doc/user-guide/analytics .

It should be easy to opt-out though `dvc config core.analytics false` or an env variable `DVC_ANALYTICS=False`.

Could you please clarify about the `several lines of code`? We were trying to make it very open and visible what we collect (it prints a large message when it starts) + make it easy to disable it.

prepend3y ago

This seems pretty anti user since most users prefer opt in. Seems pretty shady to keep in behavior that users don’t like and potentially harms them (you think it’s fully anonymized).

That’s your prerogative as it’s your project but makes me think what else you’re doing that’s against users best interest and in your own.

shcheklein3y ago

We are fully aware that it raises concerns. Trust me it hurts my feelings as well. E.g. on the websites (dvc.org, cml.dev, etc) - we don't use any cookies, GA, etc.

1 more reply

pabs33y ago

I wonder what the GDPR implications of this are. I note other projects (for eg Cura) switched their telemetry to opt-in.

https://github.com/Ultimaker/Cura/issues/2810

adhocmobility3y ago

kortex3y ago

Maybe there exist solutions to my problems but I had spent hours wrestling with it trying to fix these bad states, and it caused me much distress.

Also configuring the backing store was generally more painful, especially if you needed >2GB.

adhocmobility3y ago

We resolve this in two ways

1. All git-lfs files are kept in the same folder

adolph3y ago

https://codeocean.com/capsule/2131051/tree/v1

kernelsanderz3y ago

I do feel like git-lfs is a good solution. Once you have 10s or 100s of GB of files (eg. a computer vision project), this gets pretty pricey.

If you know of any such solution, please let me know!

haensi3y ago

Have you tested Weights & Biases Artifacts[1]?

It comes with a smart versioning approach, checks the Δ based on the checksum and has a feature to visualize the lineage.

You can also use your existing object store and link it for very large / sensitive data.[2]

Disclaimer: I work at W&B.

[1]: https://docs.wandb.ai/guides/data-and-model-versioning/model... [2]: https://docs.wandb.ai/guides/artifacts/track-external-files#...

tzhenghao3y ago

+1. git-lfs is sufficient for tracking binaries, including a ML model, at that cadence.

simonw3y ago

It seems to be the solution Hugging Face have picked too.

tomthe3y ago

Can anyone compare this to DataLad [1], which someone introduced to me as "git for data"?

[https://www.datalad.org/]

benhurmarcel3y ago

And what about Dolt?

https://docs.dolthub.com/introduction/what-is-dolt

shcheklein3y ago

remram3y ago

Doesn't use git-annex like DataLad. That alone is a huge benefit given the state of that tool.

imiric3y ago

I'm curious, what's the problem with git-annex?

I've considered using it before as an alternative to Git LFS.

niccl3y ago

things that I don't like about it:

* git diff doesn't work in any sensible way

The _great_ thing about git annex is it can be self-hosted. For various reasons we can't put our source data in one of the systems that uses git-lfs.

We've got about 800 GB of data in git annex and I've been happy with it despite the limitations.

2 more replies

remram3y ago

It lives in this weird wiki that seems to be read-only most of the time. I don't think it's alive. Its use of hard links also causes too many problems, of the silent corruption variety.

1 more reply

jefurii3y ago

What's wrong with git-annex? My work has been using it for almost 10 years to manage 40TB+ of data. It's always been rock solid.

polemic3y ago

LaserToy3y ago

Can it be used for large and fast changing datasets?

Example: 100 TB, write us every 10 mins.

Or, 1tb, parquet, 40% is rewritten daily.

nerdponx3y ago

Maybe Pachyderm or Dolt would be better tools here.

AlotOfReading3y ago

Why would you use MD5 in anything written in the last 5 years? The SHA family is faster on modern hardware and there aren't trivial collisions floating around out there.

kortex3y ago

There's an ongoing discussion about replacing/configuring the hash function, and it looks like there might be some movement toward replacing the hash and other speedups in 3.0

https://github.com/iterative/dvc/issues/3069

nerdponx3y ago

Don't quote me on the specific hash algorithm, maybe it's SHA. Point is that it's just comparing modification times and hashes.

snthpy3y ago

What about Apache Iceberg for those?

smeagull3y ago

I don't think this tool can encompass everything you need in managing ML models and data sets, even if you limit it to versioning data.

I'd need such a tool to manage features, checkpoints and labels. This doesn't do any of that. Nor does it really handle merging multiple versions of data.

davidatbu3y ago

How do you merge multiple versions of data using tensorboard? Or what other tool handles that for you?

smeagull3y ago

Downloading as required to a local cache when needed from your actual scripts works much better. It's just like what transformers does for pre-trained models.

davidatbu3y ago

I forgot to say thanks regarding this!

> Tensorboard tracks the datasets as hyperparams.

Clever!

> Warehouse side .. Prefect

I'll have to checkout warehouse-side things and Prefect to see what you mean.

Appreciate all the pointers!

bs72803y ago

What value does this provide that I can't get by versioning my data in partitioned parquet files on s3?

shcheklein3y ago

I think parquet won't help with images, video, ML models.

(it has mechanics to cache data that you download, make-file like pipelines, etc)

j / k navigate · click thread line to collapse