undefined | Better HN

0 pointsperl4ever6y ago0 comments

The notebook/procedure thing. Like, doesn't everybody everywhere operate on a basis of mixed manual/automated procedures, where it needs to fluidly transition from one to another, yet be controlled and recorded and verified and structured?

0 comments

westurner6y ago

REES is one solution to reproducibility of the computational environment.

> BinderHub ( https://mybinder.org/ ) creates docker containers from {git repos, Zenodo, FigShare,} and launches them in free cloud instances also running JupyterLab by building containers with repo2docker (with REES (Reproducible Execution Environment Specification)). This means that all I have to do is add an environment.yml to my git repo in order to get Binder support so that people can just click on the badge in the README to launch JupyterLab with all of the dependencies installed.

> REES supports a number of dependency specifications: requirements.txt, Pipfile.lock, environment.yml, aptSources, postBuild. With an environment.yml, I can install the necessary CPython/PyPy version and everything else.

REES: https://repo2docker.readthedocs.io/en/latest/specification.h...

REES configuration files: https://repo2docker.readthedocs.io/en/latest/config_files.ht...

Storing a container built with repo2docker in a container registry is one way to increase the likelihood that it'll be possible to run the same analysis pipeline with the same data and get the same results years later.

...

Pachyderm ( https://pachyderm.io/platform/ ) does Data Versioning, Data Pipelines (with commands that each run in a container), and Data Lineage (~ "data provenance"). What other platforms are there for versioning data and recording data provenance?

...

Recording manual procedures is an area where we've somewhat departed from the "write in a lab notebook with a pen" practice. CoCalc records all (collaborative) inputs to the notebook with a timeslider for review.

In practice, people use notebooks for displaying generated charts, manual exploratory analyses (which does introduce bias), for demonstrating APIs, and for teaching.

Is JupyterLab an ideal IDE? Nope, not by a longshot. nbdev makes it easier to write a function in a notebook, sync it to a module, edit it with a more complete data-science IDE (like RStudio, VSCode, Spyder, etc), and then copy it back into the notebook. https://github.com/fastai/nbdev

westurner6y ago

> What other platforms are there for versioning data and recording data provenance?

Quilt also versions data and data pipelines: https://medium.com/pytorch/how-to-iterate-faster-in-machine-...

https://github.com/quiltdata/quilt (Python)

j / k navigate · click thread line to collapse

0 comments

westurner6y ago

REES is one solution to reproducibility of the computational environment.

REES: https://repo2docker.readthedocs.io/en/latest/specification.h...

REES configuration files: https://repo2docker.readthedocs.io/en/latest/config_files.ht...

...

In practice, people use notebooks for displaying generated charts, manual exploratory analyses (which does introduce bias), for demonstrating APIs, and for teaching.

westurner6y ago

> What other platforms are there for versioning data and recording data provenance?

Quilt also versions data and data pipelines: https://medium.com/pytorch/how-to-iterate-faster-in-machine-...

https://github.com/quiltdata/quilt (Python)

j / k navigate · click thread line to collapse