undefined | Better HN

0 pointsmlthoughts20187y ago0 comments

How do you handle custom environment requirements, whether it’s Python version, library version, or more complex things in the environment that some code might run on?

Basically, suppose I wanted everything that I could define in a Docker container to be available “as the environment” in which the notebook is running. How do I do that?

I ask because I’ve started to see an alarming proliferation of “notebook as a service” platforms that don’t offer that type of full environment spec, if they offer any configuration of the run time environment at all.

I’ve taught probability and data science at university level and worked in machine learning in a variety of businesses too, and I’d say for literally all use cases, from the quickest little pure-pedagogy prototype of a canned Keras model to a heavily customized use case with custom-compiled TensorFlow, different data assets for testing vs ad hoc exploration vs deployment, etc., the absolutely minimum thing needed before anything can be said to offer “reproducibility” is complete specification of the run time environment and artifacts.

The trend to convince people that a little “poke around with scripts in a managed environment” offering is value-additive is dangerous, very similar to MATLAB’s approach to entwine all data exploration with the atrocious development havits that are facilitated by the console environment (and to specifically target university students with free licenses, to use a drug dealer model to get engineers hooked on MATLAB’s workflow model and use that to leverage employers to oblige by buying and standardizing on abjectly bad MATLAB products).

Any time I meet young data scientists I always try to encourage them to avoid junk like that. It’s vital to begin experiments with fully reproducible artifacts like thick archive files or containers, and to structure code into meaningful reproducible units even for your first ad hoc explorations, and to absolutely always avoid linear scripting as an exploratory technique (it is terrible and ineffective for such a task).

Kaggle Kernels seems like a cool idea, so long as the programmer must fully define artifacts that describe the complete entirety of the run time environment, and nobody is sold on the Kool Aid of just linear scripting in some other managed environment.

Each kernel for example could have a link back to a GitHub repo containing a Dockerfile and build scripts for what defined the precise environment the notebook is running in. Now that’s reproducible.

0 comments

westurner7y ago

Here are the Kaggle Kernels Dockerfiles:

- Python: https://github.com/Kaggle/docker-python/blob/master/Dockerfi...

- R: https://github.com/Kaggle/docker-rstats/blob/master/Dockerfi...

https://mybinder.org builds containers (and launches free cloud instances) on demand with repo2docker from a (commit hash, branch, or tag) repo URL: https://repo2docker.readthedocs.io/en/latest/config_files.ht...

mlthoughts2018OP7y ago

That’s a great first step! Adding the ability to customize on a per-notebook basis would be impressive.

gertlex7y ago

Regarding "thick archive files or containers" for reproducibility: I'm curious what (at least in your view) the solution to reproducibility looked like prior to easily shareable containers like Docker? (I'm also not sure what a "thick archive" would be.)

For a brief window of time, I was aware of colleagues distributing ubuntu virtualbox VMs for providing complex software environments to students, which sounded like it mostly worked. Not sure if such was used to package up reproducible research, too.

mlthoughts2018OP7y ago

Before containers and even widespread VMs, “thick archives” basically just meant a tar file that contained all of the build tooling in addition to the project code.

So you might create an archive with a whole compiler toolchain and shell scripts / makefiles to invoke it locally on the host machine.

Usually a project would have a build system that auto-generated these archives for any combination of platform / compiler options targeted for support. So you’d choose the MacOS archive if you use a Mac (maybe further separated based on your architecture’s precision and which compiler, etc.)

It leads to a Chinese Menu problem of multiplicity: archive files for X platforms times Y precisions times Z compilers, etc. (especially painful for embedded devices).

It’s a reasonable way to ship the entire build artifacts though.

VMs are a perfectly good way to distribute reproducible research. Though I think containers are the best way currently because of the usability of most container APIs (standard recipes & build experience, managed container repos, etc.).

In principle you could build convenience APIs around thick archives or VMs too, it just seems less common for whatever reason.

_fbpt7y ago

>avoid linear scripting as an exploratory technique

What do you recommend instead for exploratory {data analysis? science?}

mlthoughts2018OP7y ago

The same thing you do for other types of development. Place separate units of logic into well modularized functions / classes / units of organization; factor out any aspects of config; add a makefile or other build scripts.

An experiment would most often be the creation or modification of a config file followed by just invoking a build command.

No cell-by-cell evaluation, no commenting things out to run differently, no magic constants or big sequences of plotting code sprinkled all over.

The program itself to explore data or fit a model might be an imperative program, but that doesn’t mean it should exist in a single large functional unit that receives modification through commenting things out, re-running a notebook cell to change parameters, etc.

While obviously there is a trade off regarding how much design effort to put in for an experiment, most often people are not putting any design into it, nowhere close to the boundary where the trade off matters at all. Basic things like organizing separate functions, putting constants into a simple config file, etc., cost almost nothing but drastically improve usability and clarity, so you should pretty much always believe those efforts are worth it from the beginning of starting a project.

j / k navigate · click thread line to collapse

0 comments

westurner7y ago

Here are the Kaggle Kernels Dockerfiles:

- Python: https://github.com/Kaggle/docker-python/blob/master/Dockerfi...

- R: https://github.com/Kaggle/docker-rstats/blob/master/Dockerfi...

mlthoughts2018OP7y ago

That’s a great first step! Adding the ability to customize on a per-notebook basis would be impressive.

gertlex7y ago

mlthoughts2018OP7y ago

Before containers and even widespread VMs, “thick archives” basically just meant a tar file that contained all of the build tooling in addition to the project code.

So you might create an archive with a whole compiler toolchain and shell scripts / makefiles to invoke it locally on the host machine.

It leads to a Chinese Menu problem of multiplicity: archive files for X platforms times Y precisions times Z compilers, etc. (especially painful for embedded devices).

It’s a reasonable way to ship the entire build artifacts though.

In principle you could build convenience APIs around thick archives or VMs too, it just seems less common for whatever reason.

_fbpt7y ago

>avoid linear scripting as an exploratory technique

What do you recommend instead for exploratory {data analysis? science?}

mlthoughts2018OP7y ago

An experiment would most often be the creation or modification of a config file followed by just invoking a build command.

No cell-by-cell evaluation, no commenting things out to run differently, no magic constants or big sequences of plotting code sprinkled all over.

j / k navigate · click thread line to collapse