No comments yet.
D:\install\dir\dm>dm
will: * make sure the Data Manager is in sync with the Git HEAD * process the data pipelines configured in data-manager-config.json in the installation folder * for each source dataset calculate the diffs against the state represented by the HEAD * commit those diffs in a readable format, that the Data Manager can also parse * build snapshots and post them to S3 if configured
The installation .msi comes with sample \datasets and running dm.exe will automatically create sample \repos. Supported data sources: CSV, xlsx. You can create snapshots by tagging commits (there is a customizable “api_” tag prefix filter by default). Snapshots, identified by tag_name:commit_sha, can be posted to S3. Heavy files beyond a custom threshold will also be posted to S3, if configured, and referenced indirectly in the repo. The current best use case is for multiple datasets of a couple of gigabytes each and daily changes.
You need to have git installed and available in PATH (git --version) and you need to grant permissions with your antivirus and flag the executable (dm.exe) as trusted. Current usage constraints include: data must be structured and tabular, no dataset primary key changes allowed (there’s a workaround), merge features are work in progress. This early prototype will replay history using a naive algorithm.
Related post: https://news.ycombinator.com/item?id=35806843
AWS configuration in C:\Users\<username>\.aws\ with two files, (1) config and (2) credentials, no file extension.
(1) config content: [default] region=us-west-1
(2) credentials content: [default] aws_access_key_id=AKIA... aws_secret_access_key=wJalrXU...
S3 bucket name in data-manager-config.json, the bucket should be available in the configured region and accessible using the provided access key. { "s3": { "default-bucket": "mybucket" ... } }