Basically, suppose I wanted everything that I could define in a Docker container to be available “as the environment” in which the notebook is running. How do I do that?
I ask because I’ve started to see an alarming proliferation of “notebook as a service” platforms that don’t offer that type of full environment spec, if they offer any configuration of the run time environment at all.
I’ve taught probability and data science at university level and worked in machine learning in a variety of businesses too, and I’d say for literally all use cases, from the quickest little pure-pedagogy prototype of a canned Keras model to a heavily customized use case with custom-compiled TensorFlow, different data assets for testing vs ad hoc exploration vs deployment, etc., the absolutely minimum thing needed before anything can be said to offer “reproducibility” is complete specification of the run time environment and artifacts.
The trend to convince people that a little “poke around with scripts in a managed environment” offering is value-additive is dangerous, very similar to MATLAB’s approach to entwine all data exploration with the atrocious development havits that are facilitated by the console environment (and to specifically target university students with free licenses, to use a drug dealer model to get engineers hooked on MATLAB’s workflow model and use that to leverage employers to oblige by buying and standardizing on abjectly bad MATLAB products).
Any time I meet young data scientists I always try to encourage them to avoid junk like that. It’s vital to begin experiments with fully reproducible artifacts like thick archive files or containers, and to structure code into meaningful reproducible units even for your first ad hoc explorations, and to absolutely always avoid linear scripting as an exploratory technique (it is terrible and ineffective for such a task).
Kaggle Kernels seems like a cool idea, so long as the programmer must fully define artifacts that describe the complete entirety of the run time environment, and nobody is sold on the Kool Aid of just linear scripting in some other managed environment.
Each kernel for example could have a link back to a GitHub repo containing a Dockerfile and build scripts for what defined the precise environment the notebook is running in. Now that’s reproducible.