At the recent Strata Data event in NYC, Paige Roberts of Syncsort had a chance to sit down and speak with Josh Poduska, the Chief Data Science Officer at Domino Data Lab. In part one, Roberts and Poduska discussed data science maturity levels and how to make data science more achievable and practical.
Part two focuses on lineage, both data and model lineage for auditability, and how to tackle both data and model bias.
Roberts: At Syncsort we approach it from the other side, from the data side, rather than the model side. You mentioned tracking model lineage. One of the things we’ve been focused on a lot in the last few years is trying to get the data lineage tracked so you can tell where the data came from, what changes were made, and who touched it, what did they do, etc. for a similar purpose. When you’re tracking model lineage, what are you tracking exactly?
Poduska: I think the way to answer that is to think about what makes a project reproducible. In all science in general, there’s a problem of being able to reproduce results. When we think about data science, what does it take to reproduce the data science project? You have to know the hardware it was run on. You have to know the software that was used including the environment, the version of the operating system down to the packages and tools of the data science software, whether that’s Python or R, or anything else.
What Domino does is it automatically captures all of that inside of its data science platform, and allows you to share that code, those results, those runs of the experiments, with other colleagues and allow other colleagues to collaborate with you, all using the same exact environment so that you can reproduce the results.
Underneath the hood we have Docker and Kubernetes that we’re using to help do this, and we can leverage any web-based UI for data science. Python scripting, R scripting in RStudio, Jupyter notebooks, Zeppelin notebooks, H2O, DataRobot, SAS has a web UI version that we partner with … Really, we’re open to anybody who can play in that space.
Roberts: Okay. So if it tracks everything that was done while I was training the model, when I go to put it in production, I know what I have to do to get it there. Is that the idea?
Poduska: Yes. Additionally, if you ever have to backtrack, and retrace your steps, and explain why your model does what it does. Did you account for this data, or that data? Did you consider this scenario? You can look back through Domino’s project structure, and see the discussions that were had because it captures discussions between data scientists. It captures the collaborative work that was done and the intermediate results, even if those intermediate results were run on a different version of the software or a different operating system.
Domino captures all of that, and you can very quickly spin up a Docker container for example, which will allow you to see what was done at that step. It’s really having a bread crumb trail of everything that is happening for the data scientist to enable them to work faster and smarter. They can build on prior art, on what other data scientists are doing. Another added benefit of all this is for data science managers and leaders. They can now see the overall, the 30,000 foot overview of the trajectory of their team. For example, we had one large institution that, after they adopted Domino, they realized that they had two groups working on the exact same project in isolation.
Oh, no! [laughing]
They had no idea until they brought it all together in one place.
I can see where that’s a huge benefit. You get project acceleration. Redundancy elimination. You get that feedback loop.
Bias in data science is a big problem that people are running up against more and more. People are realizing that your data set can be biased, you model can be biased, or your whole question can be biased if you’re not careful. With this model lineage, can you see what you’ve done along the way to tackle that?
For those that are thinking about “model bias,” as they think about that, they build in different steps of the process, whether in data preparation or model building. Domino can help you track that along the way, so definitely. It helps. It’s still up to the user, though. You can build a model that’s biased in Domino just like you could with any other tool.
We feel that technology can help enable some of these best practices. It’s not going to be a silver bullet.
Yeah, people have to act responsibly with data science. So, I guess one of the things we at Syncsort pay a lot of attention to is the data preparation aspect, particularly the data engineers job of getting production data pipelines up and running. Is there any data preparation aspect to the Domino platform?
One of the reasons data scientists love using Domino is because you could bring in any of your favorite tools, and it works right inside the platform. Whether you’re doing some type of automated data preparation, maybe with H2O or DataRobot, or whether you have your favorite package in Python, Spark or R that you want to use, all of that’s accessible right inside the same platform. So, you can switch back-and-forth between hardware and software tooling, and still have the project view in one place with all the steps and details that you have worked through viewable and searchable in one location.
That’s pretty cool.
Yeah! The reason I came to Domino is I saw the vision of where the company is trying to go. And it’s definitely the right time for the data science industry to start growing up a little bit, and thinking like this. Domino gives data science teams what they want, agility, parallel execution, DevOps problem solutions, and also enables management to track what the team is doing and what the ROI is.
So, is there anything coming up that you’d like to let the blog readers know about?
Yes actually. Domino has what we call “Data Science Pop-ups.” They can be thought of as small conferences geared specifically for data science practitioners and data science leaders. Half of the day is for practitioners where there’s hands-on workshops and the other half is for leaders getting into the specific challenges that face data science leaders today. There was one September 19 in Austin. The next ones will be in Chicago, one in New York, one in Boston. Next year, we’re planning one in London.
Sounds very cool. Maybe I can make it to one.
Thanks for the conversation.
Any time. I’m glad we at least got to have lunch. It’s good to see you again.