The burden of checking out and building snapshots from diff history is now borne by localhost, but that may change, as mentioned in the video. Smart navigation of git history from the nearest available snapshots, building snapshots with Spark, and other ways to save on data transfer and compute are being evaluated. This paradigm enables hibernating or cleaning up history on S3 for datasets no longer necessary to create snapshots, like those that are deleted, if snapshots of earlier commits are not needed. Individual data entries could also be removed for GDPR compliance using versioning on S3 objects, orthogonal to git.
The prototype already cures the pain point I built it for: it was impossible to (1) uniquely identify and (2) make available behind an API multiple versions of a collection of datasets and config parameters, (3) without overburdening HDDs due to small, but frequent changes to any of the datasets in the repo and (4) while being able to see the diffs in git for each commit in order to enable collaborative discussions and reverting or further editing if necessary. Some background: I am building natural language AI algorithms (a) easily retrainable on editable training datasets, meaning changes or deletions in the training data are reflected fast, without traces of past training and without retraining the entire language model (sounds impossible), and (b) that explain decisions back to individual training data. LLMs have fixed training datasets, whereas editable datasets call for a system to manage data efficiently, plus I wanted to have something that integrates naturally with common, tried and tested tools such as Git, S3, and MySQL, hence the Data Manager.
I am considering open-source: is that the best way to go? Which license to choose?