Data Science Leaders | Episode 44 | 23:22 | March 29, 2022
Get new episodes in your inbox
Without a clearly defined methodology, complex projects with multiple technical and business stakeholders often fall apart. The risk is especially high when trying to scale data science work in an enterprise organization.
That’s why David Von Dollen, Head of AI at Volkswagen of America, integrated agile methodology with CRISP-DM to help his team navigate roadblocks and accelerate progress on the path to model deployment. He shares how this hybrid approach enables his team to be more strategic about project lifecycles, unlocking real business impact even faster.
Plus, David provides advice for building relationships with key business stakeholders and shares his philosophy on using the art of data science to benefit humanity.
Mentioned during the show:
Welcome to another episode of the Data Science Leaders podcast. I’am your host, Dave Cole. Today's guest is David Von Dollen. David is the Head of AI for the Americas at Volkswagen. David, how are you doing today?
I'm doing well. Thanks for having me.
We're going to be talking about CRISP-DM, an acronym. What is it? It's a process that David has made into his own. I’m going to love to hear a little bit about that.
The second thing we're going to be talking about is building out horizontal relationships within the organization. We're going to be talking about David's philosophy there.
Last, but not least, we’ll talk about data science outside of work: taking your data science expertise and applying it to the world outside of Volkswagen. David's got some interesting takes there.
Why don't we start at the top? David, what is CRISP-DM?
CRISP-DM is a process and a methodology. It was conceived in 1996 by IBM and a couple of different organizations. Essentially it allows business and data science teams to understand different phases of data science projects. Originally it was for data mining but you could extend it to data science, machine learning or AI, and it's cyclical, so you have different phases within CRISP-DM. You have a business understanding phase, a data understanding phase, a data preparation phase, modeling, evaluation, and then eventually deployment. You can use it to help the business understand the phases that a data science project might need to go through before reaching an eventual deployment, where you might need to establish some proof of concept, or test out some hypotheses and answer some experimental questions.
So it starts with pretty standard stuff, right, like understanding the business, making sure you understand the data, then actually doing some of the data preparation, modeling, evaluation, etc.
How have you made this process your own? Have you just taken this process as-is, or have you changed it up?
I mean, I'm not going to take full credit for it. I have to give credit to my colleagues, people that I've worked with and the current AI team. It's a continual process of refinement. One way that we have kind of hybridized it is that we have integrated it into an agile process. As you develop a new feature or epic for a product or initiative, you can use these phases to integrate story points under that epic, which you can then map to Agile sprint cycles. From there you can understand, through the feedback process of where you are at in your project lifecycle, and iterate to eventually move towards deployment.
It seems like you've implemented a bit of a hybrid, right? I'm staring at the diagram here. It is a circle, right, as you might imagine. When you deploy the model, you learn new things. You want to end up adding additional features so that's where that cycle comes into play. Imagine if you have five, six or seven projects going on at any one point in time. Does the business understand where you are in each of the phases of all these various models that you're trying to build? Is that where it's helpful? Where does the business come into play in the various stages?
I think that by using an Agile methodology, you can set up a regular cadence with the business so that you have check-ins. You'll work towards sprint goals which are linked to larger milestones. I think in a data science context, those milestones link to certain questions that you might ask of your data set. You might have a hypothesis or the business might want to build a model around the data, or make some predictions. One of the first things you can do is test those assumptions using this process.
It also helps people to understand that it's not necessarily linear, but more cyclical. The dependencies between different phases can cycle back and forth, right? You can go from understanding the business problem to asking a question in response to that, then exploring your data and generating some descriptive statistics around it. That might refine your understanding of the business problem, before you reform like that again, so it can cycle around.
I like to tie it to Agile because you want the cycle to move you towards a deployment. You don't want to stay in this cycle forever. With that in mind, I think it can help you to understand where you're at in your project as well as iterate your MVP to get closer and closer towards that deployment.
What I do know about CRISP-DM is, while the diagram shows the various phases, it does look waterfall-esque. The thinking behind it is that as you understand the requirements, look at the data and prepare it, you can revert to the business before you try to move on with missing information. You can go back to the business and say, "Hey, your hypothesis here is missing a feature. It is going to prevent us from moving forward in this project. We need to revisit how we're going to attack this problem. How are we going to build this model?"
That makes a lot of sense to me. It seems like a mix of understanding where you are in the process, what step comes in next and who’s responsible for owning these various stages in CRISP-DM. Having said this, there is some practicality to it. Do I have that right?
Yeah, absolutely. Another important feature is that it allows a data scientist to really track what has been done before. Often in projects you can get caught up in-between these loops and you might not keep track of what you're doing. Maybe you tested your model with certain hyper-parameters and got a certain set of results. Then you might find yourself going around in circles. Integrating it with Agile allows you to document it so then you understand and reflect on the whole timeline of the project, how certain things evolved and what you might have tried in the past that worked or not. Then you can build on top of that.
Great. How about the roles on your team? Do you have a single data scientist doing all of these things or are there various roles based on where you are in the process?
That's a great question. The way I see the formulation of a really crack data science team, a really top-notch team, would be that you might have some software engineers helping build out all these systems, doing more data-centric AI or data engineering type tasks, looking at building out and maintaining different databases and repositories, working on building out different systems. This would all be in the effort to process that data, build pipelines to transform, to do feature engineering and selection.
Then you might have more traditional data scientists who come from a statistics or machine learning, computer science background. Their bread and butter is building machine learning models and maybe R&D, working towards practical applications of that. You might have other people who are responsible for interfacing with the business or enabling a self-service model for descriptive statistics. You might also have people who work in data visualization. Maybe that goes hand-in-hand with the last case, people who can distill the business problem and technical solution, then translate that back in a way that's easily digestible.
Yeah, that's storytelling. What I hear you say is that you do have multiple roles on your team that engage in various stages. Knowing when you've gone from one stage to the next is sort of a tip-off to that team or individual that they need to get engaged. Does that sound about right?
Yeah. You can also have experts in different domain areas. You could have an expert in logistics and optimization, another in computer vision and one in natural language processing. You have all these different applications of AI which could potentially be owned by people on your team, or various teams. Within the team what I try to do as a data science leader is facilitate open information and knowledge sharing. We're all pretty much pushing towards the same goal. If you look at the different pillars of AI, natural language processing and computer vision, there's so much research that has been and is currently being done. Just sharing that information amongst the group is also a key point that I try to support as a leader.
Do you have any advice to fellow data leaders out there around the best ways to share knowledge?
It can be as informal as writing a blog post or sharing a medium article in a Slack app. It can be facilitating lunch-and-learns, where you might be investigating a new technology, people might be talking about a new paper, there's a new algorithm that's been developed, etc. Just facilitating those kinds of conversations keeps things interesting for teams, and makes people feel like they're engaged.
With the pace of innovation going on in the world of data science, if you're not carving time out to do those things, you can be left behind.
Switching gears here, I'd like to move on to our next topic, talking about building out horizontal relationships in the organization. How do you view your key stakeholders? What advice do you have in building out those relationships?
The first thing is trying to understand their pain, right? Where are the bottlenecks? Where are the KPIs that they're measuring their performance against, that could be improved through data science, machine learning or data? Once you have an understanding in that domain and you can set up that communication with your stakeholder, it becomes a lot easier to figure out how to translate that pain into a potential solution. That way you can find these win-win scenarios. There are a lot of people thinking about how to push machine learning into production nowadays and thinking about productionizing AI.
At the end of the day you have to think about your end-user and how they are going to use your machine learning models or algorithms to optimize their process or a mundane task, or even drive some KPI that might be a source of pain for them.
Right. You have got to get to know them, understand the KPIs they care about and what bottlenecks they have. When you mention "win-win," I assume you're talking about potentially solving problems for them and for yourself or your team—something along those lines?
As you're building out those relationships, understanding what success is, what is your opinion on your role as a data science leader? Do you see yourself as essentially a shared service to the various parts of Volkswagen? Do you think or know how to potentially allow your users to self-serve? Do you maybe build out a platform that allows them to answer questions on their own? Where do you fall with your opinion here?
You could think about it from the different levels of analytics, right? You have descriptive analytics and statistics. At the bottom level: what is happening in my business? What are some statistics that I can track to understand the current state of the business? On the lowest level, I think organizations that set up self-service models just for that can become quicker, more agile. They can start tracking how to become more efficient. Another level up, you think about predictive analytics. That's where a data science team comes in because you use machine learning to augment that process. You're thinking a step ahead, right? What's happening today, but what's going to happen tomorrow? At the highest level, you have optimizing analytics, where you optimize off of your prediction.
When you think about things from a descriptive level, those are the things that you want to try and automate. With the machine learning component, it becomes a little trickier. I think we're entering an era where automated machine learning is becoming more prevalent in the enterprise. I don't think, as data science folks, we're ever going to automate ourselves out of a job, though. I think there's always going to be a need for people to test our assumptions and pipelines, and make sure the models are performing the way they should, monitoring for concept drift and that kind of thing.
That's kind of where I lie in regard to automated self-service ML. It's a great place to enable businesses, and I definitely agree with enabling self-service AI. It's good to also be skeptical and help people understand that your machine learning model is not always going to be 100% correct. Help them understand the trade-offs and risks when the machine learning model is wrong. How does that translate to the business value? That's where a data science team can come in and help.
I'm hearing you say that on the descriptive side, there have been solutions in place for self-serve that have been around for some time. When it comes to more predictive analytics, that's where you tread cautiously, handle it with care. It sounds like from a self-service standpoint, it's more of a partnership than just, "Here's the AutoML solution. Have at it. Build bottles on your own." It sounds like you would advocate for more of a partnership between your team and the business side of the house.
Yeah, I think so. When you're building a model, you might be approximating your data somehow that might be underfitting or overfitting your data. There's that classic trade-off in data science that we're constantly, as machine learning practitioners, trying to figure out how to build models that generalize well and are explainable. All those dimensions are things that you want to help people understand instead of just sitting them down in front of a UI and saying, "Okay, here's some data. Build a model."
Let's move on to the next topic, talking about data science outside of work. I know this is a passion of yours. Most of us who are passionate about the art of data science welcome the opportunity to potentially solve challenges outside of our workday. What is your philosophy on it and what have you done there?
I've kind of been in a constant state of professional development and personal development. I definitely believe in lifelong learning. There are so many different areas that data science overlaps with that I find really interesting.
One organization that I'd like to give a shout-out to is DataKind. I'm sure a lot of folks have heard about DataKind. I've never actually been directly involved with one of their projects but they're doing work that I think is really commendable. They use data science to understand global poverty, using time series to improve access to sanitary conditions and water. I think we're living in an age where we have access to so much information and systems to process that information. I think using our time towards benefiting humanity in better ways can only be a good thing, right? You can build up your AI superpowers but how are you going to help humanity with those AI superpowers? That's what I like to think about.
Those of you out there can certainly research DataKind. It looks very interesting, basically using data to help various humanitarian issues: education; poverty; human rights…I think that all of us, whether we have the time or we need to make the time, can use our brain power to help. I certainly am fascinated by DataKind and I think it makes a lot of sense.
Well, David, I really enjoyed the conversation. Is there anything else you want to leave us with? Any pearls of wisdom?
Just make sure to always QA your stuff before pushing to production.
Yes. I think all of us have dealt with the pitfalls of not strictly QA’ing our models before they make their way into production. David, I have thoroughly enjoyed the conversation. If people want to reach out to you on LinkedIn and shoot you questions, is that okay?
Awesome. Well, David, I really appreciate it. Thanks for your time.
Thank you, Dave.
Data Science Leaders is a podcast for data science teams that are pushing the limits of what machine learning models can do at the world’s most impactful companies.
In each episode, host Dave Cole interviews a leader in data science. We’ll discuss how to build and enable data science teams, create scalable processes, collaborate cross-functionality, communicate with business stakeholders, and more.
Our conversations will be full of real stories, breakthrough strategies, and critical insights—all data points to build your own model for enterprise data science success.