Data Science Leaders | Episode 33 | 38:04 | January 11, 2022
Supply Chain Solutions and the Role of the ML Engineer
Get new episodes in your inbox
powered by Sounder
When highly disruptive events like the COVID-19 pandemic occur, data science teams may have to throw historical data out the window. Models trained on what happened in the past simply don’t work in a radically different present.
In this episode, Karin Chu, VP Data Science and Digital Analytics at Peapod Digital Labs, discusses how her team is tackling that challenge head on, particularly as the global supply chain crisis impacts sectors from grocery to apparel.
Plus, she explains why two things are so vital to the success of a data science team: ML engineers and a culture of communication.
- How data science teams are navigating the supply chain crisis
- The vital role of an ML engineer
- Tips for communicating about data science in business
Welcome to another episode of the Data Science Leaders Podcast. I am your host, Dave Cole. Today's guest is Karin Chu. Karin is the VP of Data Science and Digital Analytics at Peapod Digital Labs. She holds a PhD in statistics from Texas A&M. Welcome Karin!
Thank you and hi, Dave.
It's great to be here. Today’s topic is global supply chain challenges brought about by COVID, and how that has created an opportunity for your team, since it has been a focus area of yours.
The second topic is diving into the role of the ML engineer. Why do I need to have an ML engineer on my team? What does an ML engineer do? What does the career path look like for an ML engineer etc.?
We talk a lot, in the Data Science Leaders podcast, about the importance of communication for data scientists and data science leaders. I know you have a unique perspective there.
Before we start diving into the agenda topics maybe you could tell me a little bit about Peapod Digital Labs. What is Peapod?
Peapod is the ecommerce and digital arm for Ahold Delhaize. In the US, it may be something that's foreign to a lot of consumers, however Ahold Delhaize is one of the world's largest grocers. They're based out of the Netherlands. In the US, our stores are mostly on the east coast. So your east coast listeners would have heard of the Giant brands: Hannaford, Food Lion, Stop & Shop, and most recently the acquisition of FreshDirect in New York. So Peapod Digital Labs, or PDL, enables both online and omnichannel experiences for our grocery customers.
My role as the VP for Data Science and Digital Analytics team for PDL, is to oversee the process of building the algorithms for customers’ online experiences, as well as all things related to Google Analytics. Those items would be all the tagging, tracking of the data, monitoring where the users go, what kind of experiences they are having. Then, of course, the ecommerce and loyalty metrics. This is basically the end-to-end user experience from an analytical perspective.
Got it. Do you do that across the various grocery store chains?
Very interesting. I imagine, given the nature of your business, that global supply chain challenges have introduced opportunities for your team to apply data science to some of them. Is that true?
It absolutely is. My training is statistics. I am a classically trained statistician. Everything we do is based on historical data and what kind of patterns we saw last year, month or decade? We then layer seasonality over that. What happened with COVID is that none of those assumptions held anymore. This is a once in a lifetime, once in a decade, kind of event. We don't have any historical data to look back on. We don't have any sort of seasonality to speak of, really, to use our algorithms. That is important because all the models or machine learning algorithms that we have built in the past don't work anymore. In statistics there's a whole science called Bayesian Statistics, which is about looking at your priors. That doesn't work here.
Yeah. No priors.
So what do you do? There are a few methods to manage supply chain issues. Traditionally you have the standard supply chain management approaches via operations research: finding the optimal path that will minimize your costs. That's one approach. There are multiple assumptions with that as well, which no longer hold, either, due to COVID. That's one challenge. The other approach is your very classical model. Your demand and forecast, time series models, simple regression models. Those, again, all depend on your historical data. Those also no longer hold, due to COVID. You can still apply them but there's going to be a lot of variability in your forecast, to the point that it causes doubt over the accuracy of your numbers.
In the absence of all that, what's the alternative? From a tactical standpoint, there's a whole family of algorithms or logic called recommendation systems. Recommendation engines are really about looking at whether persons A and B are very similar in terms of their shopping behavior, or in terms of where they live, their preferences and whatnot. If person A tends to buy items one to three, then person B is likely to also buy items one to three, to a certain extent. If certain items are out of stock, are there other items that we could recommend to this particular customer based on what others like them have purchased in the past, or based on other items that are similar to the ones that are out of stock? Building recommendation systems in the absence of having historical data is something that we are relying quite heavily on, not just in the grocery business, but other industries like fashion retail. Anytime something is out of stock you are able to come up with recommendations that could replace that.
The customer will then subsequently give you feedback whether or not they're purchasing it, whether or not they like it. From there you are able to update your recommendations, all in real time. In a perfect world you would have the supply chain fixed, your optimal delivery path and how to minimize the cost. Layer on the forecast of how much inventory you need to keep on hand and then all the recommendations. Similar items, brands or people. In the absence of all that, there are still tools that can be used to help manage that process that won't rely on having this prior knowledge in place. I think that would be something that may behoove organizations in this environment to look into that type of approach.
Hopefully COVID is a once in a lifetime event. Focusing on your industry, just looking at the way it has impacted folks going to the grocery store, it has changed what people purchase from the early days of COVID: tissues, hand sanitizers and things like that. More recently, as the economy has started to improve and bounce back, although not everybody has, there's this big demand for humans to help with that supply chain. The demand has increased faster than we can get some of those folks back to work and that has created these supply chain disruptions. When you say that we don't have the historical data, are you talking about the unique elements of demand for the types of things that people are purchasing being different from what they have purchased historically? Is it more on the supply chain side where goods that normally would've been very easy to purchase to meet the demand, that's no longer the case and you have to find other, similar products and put those in front of consumers?
Excellent point. It's actually both. The products that inherently would be available are no longer available for whatever reason. That's on the supply side. On the demand side, consumer behaviors have also changed. In the early days of COVID, like you mentioned, people were going out there buying tons of toilet paper. That was unheard of prior to that. How you incorporate that kind of phenomenon into your algorithm becomes a real challenge. I would say that there are no priors because your customer behavior, people's behavior, has changed, as well as the amount of supply that is available. On top of that, the economy has changed. A lot of people are no longer working because of COVID. How do you factor that into the equation? Then you have the government stimulus coming in. That, too, at different time points. And that's no longer part of your regular pay cycle. How do you bake that into the equation?
I would say, as a statistician, there are a lot of different ways that we can look at the data. Perhaps running a lot more simulations, Monte Carlo simulations, for example, putting in a lot more probabilities. These are real challenges. I think we're in a place in our science where this is something we've not encountered in the past.
One thing, though, we do have is a vast amount of data that we're collecting today. It will be very interesting to see how we, as a community, manage and treat that data, say, five and even 10 years from now. How do we handle this data? Do we smooth it out? Do we incorporate it, set it as a new baseline? What are we going to do with that? I think we're still living through that process right now but it'll be very interesting and useful to see how this data will be applied for our future studies.
Yeah. My vote is to exclude that data altogether, just so we can wipe this entire last couple years from memory.
That's right. It didn't exist.
Yeah, exactly. In all seriousness, though, I have to imagine that there are probably some lessons learned here that might be applicable when—hopefully—things get back to normal. Do you agree?
I mean, are there some things that you're doing today, thanks to COVID, that would've been very helpful if you had a new sort of recommendation engine that suggests different products, pre-COVID? Does that make sense?
It makes sense. There are a couple things. One is, it definitely expedited the growth of ecommerce. Prior to this, there was still a lot of building our algorithms and a sense of ‘it will be’, and this applies to many industries. Maybe we put it into production. Maybe we will still go with our business logic. This has thrown that out the window. In this environment, maybe there will be more scenarios, e-Commerce will become a focal point in our economy. If that's the case, we need to be more prepared. Basically our growth and maturity in ecommerce just expedited, skyrocketed, within the past two years. I guess it has been two years since the initial COVID lockdown. So there is that aspect.
On top of that, how do you manage all that data that's coming in? I'm sure the audience is familiar with Google Codelabs, Python notebooks, and stuff like that. You can't just go and use your notebook and expect to run all these analyses and go live with them. There's a whole process of building the algorithm then deploying. This really helped us learn how to go from running an algorithm in a simple notebook to deploying it in real time, because the entire landscape is constantly evolving. One of the biggest learnings from the industry, and this cuts across grocery, banking, retail etc. is that we had to become very mature and very stable as a platform very quickly.
There is a humongous lesson learned from that experience that will carry us for many years, I hope. Also, it's just that mentality. We need to be much more agile, much more flexible, in terms of what we need to have in place, who our key players are, where the data comes from. I could go on and on about that. This has definitely moved us forward by leaps and bounds from where we were two years ago.
That all makes perfect sense. It's not just the behavior, like what the customers are buying has changed, but also how they're buying. They're buying their groceries now through a website and having them delivered, thanks to COVID, a lot more often, I imagine, than pre-COVID. Because more buyers have moved online, that's an opportunity for ML to play an outside role.
This is a good segue into our next topic, which is talking about ML engineers and what role they've potentially played in helping roll these models out very quickly and in a scalable way. What is an ML engineer?
An ML engineer, I would say, is something that's fairly new to most industries. I've been doing data science for quite some time. I remember back in the days when you built an algorithm. You would build a model, which generated some coefficient or weights. You would pass those weights over to your tech team to put them into production. The model ran and everything got scored and out came the results. That's it. Pretty straightforward. Every now and then you'd update your weights and your model then call it a day. That's not the case today in the tech environment.
Models are becoming much more dynamic and complicated. If you're building a recommendation system, there are a whole bunch of ways to do it. A whole bunch of features that need to go in there and you have to marry multiple data sources. You've got your online click stream data, product data, shopper data. You have to be able to combine all that data, put that all into the model and then generate the output. At the end of the day, that model needs to get deployed. Deployment no longer means sending these coefficients over to the tech team for them to implement. It's no longer that simple because you're talking about hundreds of weights, at least.
There's got to be a way in which the model can be deployed such that on the front end, if a user comes to our website and clicks on something, on the side we recommend items that will go with that item. If they click on strawberry jam, maybe we recommend peanut butter, bread or butter. So you can get complimentary items off to the side. For that, there's a lot of logic, coupled with business rules, that are kicking off in the back end. The process of packaging the data science algorithms and putting that into production, such that the model will kick in to receive all that data and generate the output, is more or less the role for the ML engineer. I know I'm simplifying it. An ML engineer basically automates the ML algorithms or models.
It's not DevOps, because DevOps don't really typically deal with data, versus the ML engineer who does. They take the machine learning algorithms and then package them so they're automatically deployed, trained, scaled, monitored, monitoring the performance—everything that is wrapped around deploying the algorithm.
I know you didn't ask this, but there is another piece to this, which would be the data engineer. In your modern ML algorithm deployment environment you do have the data engineer on the front end that takes in all the data from all these different sources (metadata, click stream data, etc.) then cleans them, preps and crunches them, creates features with them and then passes those over to the machine learning data scientists who built the algorithm. The algorithm is then passed over to the ML engineer for deployment. Those are kind of the three pillars in terms of deploying the algorithms.
If you're working with your web team, recommending that they leverage a model that your team has built, I have to imagine that they’re going to raise their eyebrows about it, if it’s a Rest API or something like that. This is in the context of providing those attributes and then being able to spit back out a list of products to recommend to the customer alongside whatever product they're viewing. I imagine that their response would be to say that you have certain low latency rules like wanting our users to make sure that there's a sub-second response time, there are page load times to be concerned with. A data scientist might not be an expert in standing up an API, making it performant and making it scale to the tens of thousands, if not millions, of users that are hitting the site.
It sounds to me, at least in that real-time use case, that's where an ML engineer comes into play. A lot of folks want to talk about real-time models. I'd argue that most machine learning models, that are put into production, are more batch than they are real-time, in most companies. I could be wrong but that has just been my experience. Do you see an ML engineer playing a role in those batch models or do you think, actually, a data engineer plays a role there?
I think the ML engineer would still play a role in that world. Think about whether the model is being called in real time and scored in real time. For example, user experience on site. If you come to the website and you click on Item A, and then because you clicked on that, pick out an apple or a shirt, depending on what industry you're talking about, it leads you to a different site, a different page. Basically, based on your experience, your customer journey will be different than another customer. That would be more of a real-time response. So in that case, obviously, an ML engineer and also DevOps will play critical roles in mapping that out. However, if it's something like theoretically running recommendation systems or search algorithms, with the items already scored and put into a table somewhere, your model would then basically be querying a table.
So to your point then, in that case, is it still an ML engineer who's doing that job or can a data engineer do that? I would say the ML engineer, in this case, would be responsible for the training, maintenance and upkeep of that model and its results. That person needs to make sure, once the model is deployed, that the tables are being populated accordingly and that the results are accurate. In that case, it will be a blend between the data engineer, who's managing the database and data lake environment, and the ML engineer, who's ensuring that the model is being trained on time and monitoring any issues—performance monitoring. I would say that would be a blend.
In most cases, though, I would say, yes, you could run the model as a batch, the results being put as a batch, but then it depends on how often you want to update that. I've been in companies where some databases are updated on a daily basis. Some are on an hourly basis. Then there are other types of models where you generate a pickle file. The pickle file is put somewhere in the environment such that you're constantly calling and scoring it. It's scored in real time. The scores are generated then it's done, then you throw them away. There's no persistence to that score, but that all depends on how you want to track it. There are multiple ways and it depends on the use cases. There is one thing I do want to go back on, because you brought up a very interesting point, that the data scientist's role is building the algorithms. Looking back at my career in different industries, having started out in banking, then I moved over to healthcare and then into retail.
In the early days of machine learning, when we had these algorithms being built, many times the algorithms were not actually deployed, unfortunately. The reason is that the data scientists, modelers or machine learning data scientists, were assumed by business to have both roles of building the model and then deploying it. Data scientists, modelers, are not familiar with how to actually package this up. You're building software, basically. How do you do that? They don't. It just becomes a very complicated process. Their subject matter expertise is not on the deployment side. Their subject matter expertise is in building logic, statistical logic. Some of them are also very familiar with the data and data management part of it, but the deployment was a little trickier. As a result, a lot of the algorithms are not able to get deployed.
When they're not being able to go into production then you have someone who's manually babysitting them, for lack of a better way of saying it. As a result, the benefit and business value of those algorithms are, I think, underutilized. As time progressed, and more and more companies realized that they had all this wealth of data sitting there, they wondered how they were going to capitalize on that data. That’s when the thought set in: let's bring in a data scientist. But the algorithms did not deploy. What was the issue? Through these trials and errors, companies are now beginning to realize that there's a whole separate class of data scientists in the form of ML engineers that we need to have in our staff so that we can get these models into production. It's one of the evolutions of our science, which is all good, but it was a very interesting learning process, in my opinion, to get where we are today.
This is a good thing, because the market is realizing that one of the challenges of getting models into production is to have a dedicated role carved out, right? There are so many models that need to make their way into production that there actually needs to be that new role.
That's a good thing for all data scientists out there. It's also a recognition that data scientists are good at certain things but maybe not in others. One thing that you clarified with me, which is great, is the role of the ML engineer. It's not just about deploying the model and working with, in this case, the web team and making sure that model's performant. It's also about monitoring that model, making sure there's no drift, and that it’s as accurate as it was when it was first trained.
It’s about being required to do the maintenance of the model in terms of retraining it. I imagine that there is a very strong handoff between the data scientist and the ML engineer, such that they can take on that role. Over time, as you have more and more models in production, so too does your ML engineering team, I imagine, grow. My last question on the ML engineering topic is where are you finding these folks? What do their backgrounds look like? What are you telling them in terms of career paths? What do their career paths look like?
I built several data science teams throughout my career. The one I'm leading right now is my fourth data science team. I built the teams through a lot of trial and error earlier on.
I would say the best operating model, from what I've seen and experienced so far, is if these roles are under the same umbrella. The reason I say that is that I think everyone knows what it takes to be a successful data scientist, but to be a successful machine learning modeler, you need to know your stats. You need to know how the programming, Python, R, Spark, you name them. You also need to know how to handle data, basically. To be a successful ML engineer, in my opinion, you need to know a little bit about the algorithms. At the end of the day, that handoff between the modeling team and the engineering team is, “Here's my model, help me deploy.”
You can't just take it without understanding the outputs you should be seeing and the inputs you should be getting. There needs to be a little bit of understanding or appreciation of what the algorithm does and how it has been built such that, if the algorithm breaks during training or performance monitoring, the engineer knows what to look for and potentially triage or fix them. If the actual foundation, the model, breaks, then it has to go back to the data scientists. When it doesn't, and most of the time it won't, it's really up to the engineer. For the engineer to be able to troubleshoot, they need to understand how the model works.
The engineer needs to have, obviously, some basic understanding of how the algorithm works and, of course, the data component and programming of it, and how to link that up to the front end, building the API, building the software, if need be. It’s heavier on the engineering side than the data science side, but they definitely need to have some data science background. At the end of the day, they are very connected.
Got it. Great. Let's move on to the last topic: the importance of communication as a data science leader. You have a bit of a unique story in terms of how you got into the world of statistics and data science. I'd love for you to share that with our audience.
I came to the US when I was pretty young. I was a tween, so pre-teen, and I didn't speak any English. I only knew two words, one of which I can't share with your audience. The only thing that was universal, that didn't require understanding of the language, was math. I was an average math student back in Taiwan, but came to the US and I was like a stellar math student here. This was in sixth grade. I kind of stuck with math because I did quite well and I liked it, and it didn't require communication. As I progressed through high school, I had to learn the language. I also ended up tutoring a lot of folks through undergrad. I had to learn how to explain sin and cosine and why they needed to know this type of stuff. Complex number—what the heck is that? How can numbers be negative? I had to explain all these concepts to the students.
It became very transparent to me that communication is very important. That led me down the path of finishing my math degree and needing to decide what came next. I decided to go into statistics, which is basically applying math, but in a more probabilistic realm. I got my PhD in statistics from Texas A&M and through all that entire journey I continued to teach classes, do talks and so on and so forth. After I got my PhD, what I walked away with was the knowledge of how to build for practical use cases in the world. While I enjoyed doing stats and building algorithm models, what I really enjoyed doing was knowing the models.
In order for me to do that well, I need to learn how to communicate. If no one understands why I'm building certain things, or I can't relay the value of my model, that's not what I was looking for. I stayed in the industry and then did more education and training. The longer I've been doing this, the more value and importance it has for me, that we need to be able to communicate. One of the biggest challenges that I have seen throughout my career, and I've been doing this for a long time, is that businesses know the value of data. They know there is value in using machine learning algorithms but how do you connect these algorithms to the day-to-day business value?
That's a real art, and a challenge for yourself. If you have a team that's focusing exclusively on building the most accurate predictive algorithms, you really have to ask them what is the value? What is the ROI? What is the conversion rate? What's the add-to-cart rate? What's the lift? That's where we need to be heading: how to measure the actual value. You can build the most accurate model but how much revenue does it drive? I know those are not easy questions to answer. If you ask a data scientist, it's going to be a tough question. It's a tough question anyways, and you're asking a data scientist, who's focusing on building the models. It's a tough question to answer, but it needs to be addressed.
To make sure your models are addressing those business values in the most efficient and effective way, you need to understand the business. Otherwise your business is asking you one thing and you end up building something totally different. That's not sustainable. You need to understand the business. How do you do that? You can't understand the business by just sitting there, working and programming on your own. I’m not saying that's not a good thing, but there are different aspects to our roles. To understand the business, you need to be able to communicate. You need to be able to speak the business language.
If you're working in the fashion industry, you need to understand the skill and style. If you're working in the grocery business, again, you still need to understand the skill, but also the seasonality and things like that. Getting down to the business language, understanding what their needs are—all this is wrapped around communication. Data scientists, I would say, in this day and age have multiple hats that they need to wear. Being a good communicator is definitely one of them.
Definitely. Well, I learned a few things there. When you were talking about communication and talking to the business, ROI and lift, conversion rates—that's their language. Understanding the importance of the outcomes of the models sometimes, not always, causes their eyes to glaze over when you talk about the how, and what algorithms you used.
Some of the better folks on the business side are interested in that. They're also very interested in the features, and which ones are primary contributors to the accuracy of the model and so on. That informs, potentially, the business decisions that they make down the road. The other thing I learned is that data science leaders out there need to be hiring from Taiwan.
That was definitely a very interesting experience growing up.
Yeah. I can only imagine, because having grown up in the US and not having to deal with coming into a foreign country and learning the language, clearly you learned a lot along the way. It sounded like you were giving back. You were teaching others, not just in statistics, but also those who had similar language barriers. That's also one way in which to hone your craft and skills, to teach others. I think sometimes that's the best way.
I had a blast.
Likewise, thank you.
Today we discussed global supply chain issues and how that's impacting your world, Karin. Also, we got to go deep on what it means to be an ML engineer, and why data science teams need them. Last, but not least, we talked about the importance of communication. Thank you so much for being on the Data Science Leaders podcast.
Thank you for having me.
If people want to reach out and chat with you, can they hit you up on LinkedIn?
Of course, anytime. Send me a note on LinkedIn. Happy to do that.
Well, thank you so much, Karin. It was a blast.
Thank you, Dave.
Listen how you wantUse another app? Just search for Data Science Leaders to subscribe.
About the show
Data Science Leaders is a podcast for data science teams that are pushing the limits of what machine learning models can do at the world’s most impactful companies.
In each episode, host Dave Cole interviews a leader in data science. We’ll discuss how to build and enable data science teams, create scalable processes, collaborate cross-functionality, communicate with business stakeholders, and more.
Our conversations will be full of real stories, breakthrough strategies, and critical insights—all data points to build your own model for enterprise data science success.