Reproducible machine learning with Jupyter and Quilt
Jupyter notebooks document the interaction of code and data. Code dependencies are simple to express:
import numpy as np
import pandas as pd
Data dependencies, on the other hand, are messier: custom scripts acquire files from the network, parse files in a variety of formats, populate data structures, and wrangle data. As a result, reproducing data dependencies across machines, across collaborators, and over time can be a challenge. Domino's Reproducibility Engine meets this challenge by assembling code, data, and models into a unified hub.
We can think of reproducible machine learning as an equation in three variables:
code + data + model = reproducible machine learning
The open source community has produced strong support for reproducing the first variable, code. Tools like git, pip, and Docker ensure that code is versioned and uniformly executable. Data, however, poses entirely different challenges. Data is larger than code, comes in a variety of formats, needs to be efficiently written to disk, and read into memory. In this article, we'll explore an open source data router, Quilt, that versions and marshalls data. Quilt does for data what pip does for code: packages data into reusable versioned building blocks that are accessible in Python.
In the next section, we'll set up Quilt to work with Jupyter. Then we'll work through an example that reproduces a random forest classifier.
Launch a Jupyter notebook with Quilt
In order to access Quilt, Domino cloud users can select the "Default 2017-02 + Quilt" Compute environment in Project settings. Alternatively, add the following lines to
requirements.txt under Files:
Next, launch a Jupyter Workspace and open a Jupyter notebook with Python.
Quilt packages for machine learning
Let's build a machine learning model with data from Wes McKinney's Python for Data Analysis, 2nd Edition. The old way of accessing this data was to clone Wes' git repository, navigate folders, inspect files, determine formats, parse files, and then load the parsed data into Python.
With Quilt the process is simpler:
# Python versions prior to 2.7.9 will display an SNIMissingWarning
The above code materializes the data from the "titanic" folder of the akarve/pydata_book package. We use the "features" tag to fetch a specific version of the package where a collaborator has done some feature engineering. Each Quilt package has a catalog entry for documentation, a unique hash, and a historical log (
$ quilt log akarve/pydata_book).
We can import data from Wes' book as follows:
from quilt.data.akarve import pydata_book as pb
If we evaluate
pb.titanic in Jupyter, we'll see that it's a GroupNode that contains DataNodes:
We can access the data in
pb.titanic as follows:
features = pb.titanic.features()
train = pb.titanic.train()
trainsub = train[features.values]
Note the parentheses in the code sample above. Parentheses instruct Quilt to "load data from disk into memory." Quilt loads tabular data, as in
features, as a pandas DataFrame.
Let's convert our training data into numpy arrays that are usable in scikit-learn:
trainvecs = trainsub.values
trainlabels = train['Survived'].values
Now let's train a random forest classifier on our data, followed by a five-fold cross-validation to measure our accuracy:
from sklearn.model_selection import cross_val_score as cvs
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(max_depth=4, random_state=0)
scores = cvs(rfc, trainvecs, trainlabels, cv=5)
The model scores 81% mean accuracy. Let's serialize the model.
from sklearn.externals import joblib
We can now add the serialized model to a Quilt package so that collaborators can replicate our experiment with both the training data and trained model. For simplicity the
titanic sub-package already contains our trained random forest model. You can load the model as follows:
from sklearn.externals import joblib
model = joblib.load(pb.titanic.model_pkl2())
# requires scikit-learn version 0.19.1
To verify that it's the same model we trained above, repeat the cross-validation:
scores = cvs(model, trainvecs, trainlabels, cv=5)
Expressing data dependencies
Oftentimes a single Jupyter notebook depends on multiple data packages. We can express data dependencies in a
quilt.yml as follows:
quilt.yml is like
requirements.txt, but for data. As a result of using
quilt.yml, your code repository remains small and fast.
quilt.yml accompanies your Jupyter notebook files so that anyone who wants to reproduce your notebooks can type quilt install in Terminal and get to work.
We demonstrated how Quilt works in conjunction with Domino's Reproducibility Engine to make Jupyter notebooks portable and reproducible for machine learning. Quilt's Community Edition is powered by an open source core. Code contributors are welcome.