No comments yet.
Currently, my workflow for data analyses / modelling is essentially:
1. Write SQL query for desired dataset
2. Run query to produce CSV
3. Hash the file as an identifier
4. Upload the file to S3
5. Reference the file in Jupyter notebook / scripts etc.
6. Return to step 1 or 2 (depending on if I'm updating a report, or creating a new experiment with new data).
I'm curious if people have experience using tools such as DVC [0] for managing experiments. Git LFS could be useful, but it seems to be aimed more at binary assets, not large datasets of many GBs.
[0] https://dvc.org/