Show HN: Git for datasets and config table versioning, that commits diffs (opens in new tab)

(dropbox.com)

1 pointseliomattia3y ago0 comments

I am building the Data Manager to version datasets and configuration tables in a storage-efficient way, and easily identify and deploy to S3 datasets versions to feed other code. It works on top of git for versioning but calculates and commits incremental differences only, locally and in the cloud. Committing diffs can enable collaborating on huge repositories without full checkouts for certain use cases, using only a logical checkout of a few kilobytes, and letting other machines merge your contributions into branches.

D:\install\dir\dm>dm

will: * make sure the Data Manager is in sync with the Git HEAD * process the data pipelines configured in data-manager-config.json in the installation folder * for each source dataset calculate the diffs against the state represented by the HEAD * commit those diffs in a readable format, that the Data Manager can also parse * build snapshots and post them to S3 if configured

The installation .msi comes with sample \datasets and running dm.exe will automatically create sample \repos. Supported data sources: CSV, xlsx. You can create snapshots by tagging commits (there is a customizable “api_” tag prefix filter by default). Snapshots, identified by tag_name:commit_sha, can be posted to S3. Heavy files beyond a custom threshold will also be posted to S3, if configured, and referenced indirectly in the repo. The current best use case is for multiple datasets of a couple of gigabytes each and daily changes.

You need to have git installed and available in PATH (git --version) and you need to grant permissions with your antivirus and flag the executable (dm.exe) as trusted. Current usage constraints include: data must be structured and tabular, no dataset primary key changes allowed (there’s a workaround), merge features are work in progress. This early prototype will replay history using a naive algorithm.

AWS configuration in C:\Users\<username>\.aws\ with two files, (1) config and (2) credentials, no file extension.

(1) config content: [default] region=us-west-1

(2) credentials content: [default] aws_access_key_id=AKIA... aws_secret_access_key=wJalrXU...

S3 bucket name in data-manager-config.json, the bucket should be available in the configured region and accessible using the provided access key. { "s3": { "default-bucket": "mybucket" ... } }

0 comments

No comments yet.