Most things are settled, but we expect to collect a LOT of data that will be labeled and or auto labeled ( to the tune of 100 MIO video clips )
We will be training multiple models for different tasks from that data and we need a good system to organize it.
Does anybody have any tips experiences with that kind of thing. We can use any on premise or cloud solution....
Specifically we would need
* Data ingestion pipeline ( data will come from field personel ) * Data versioning * Being able to define datasets that are a subset of the whole collected data * Inexpensive storage ( e.g S3 or similar ) * Branching/Merging for maintaining production training data sets * Metadata storage and query capabilities ... * User interface for less tech savy people ( e.g just a git like command line is fine for engineers but not for field personell who are not in IT )
I know of tools like https://dvc.org/ but a) they are just layers on top of git b) break appart on huge datasets without a folder hierarchy ( git tree objects just don't work for linear lists of items ) are only useable by IT personell, and require checking out at least a part of the dataset.
Our datasets would be 100.000.000 x 100 MB = 10 PB of raw data. Training data should be delivered to training nodes via network etc.. we just can't have a full checkout of that data...