Data Science Leaders | Episode 40 | 26:29 | March 1, 2022
Getting to Ground Truth with Strategies from ML in Electronics Manufacturing
Get new episodes in your inbox
powered by Sounder
Many people assume that once you establish a manufacturing line, the hard work is done and things remain relatively static. The reality, especially in electronics manufacturing, is entirely different.
Constantly changing data streams and endlessly dynamic variables present some unique challenges for data scientists in the field. But there are lessons on data sharing, model adoption, and real-time impact that ML professionals in any field can learn from.
In this episode, Alon Malki, Senior Director of Data Science at NI (National Instruments), opens a window into the world of data science in electronics manufacturing. Plus, he shares why human-in-the-loop processes are essential to gaining buy-in for AI in the enterprise.
- Data science in electronics manufacturing
- Strategies for sharing data to improve manufacturing processes
- Human-in-the-loop applications
- Looking for challenge-motivated data science talent
Hello, welcome to another episode of the Data Science Leaders podcast. I'm your host, Dave Cole, and today's guest is Alon Malki. Alon is the Senior Director of Data Science at National Instruments. How are you doing today, Alon?
Great. I'm doing fine. Thanks for inviting me here.
Great. So Alon is based in Israel and what we're going to be talking a little bit about today is data science in the manufacturing of electronics, which is Alon's world here and what he's been doing for probably at least the last, I'd say 10 years, does that sound about right?
Quite close. Yes.
Yeah. Quite close. All right so I'm pretty close. There's something interesting that Alon has some experience with in sharing data between parties in order to improve the overall manufacturing process. That is one of the things that he's working on. And we're going to talk a little bit about why that's important and I think there's some universal data science-related issues with regards to ground truth that Alon has a lot of experience with, at least in the world of electronics that I think is interesting.
And then we're going to talk a little bit about building models in the lab versus production. If we have some time, we'll talk about human in the loop in data science.
So why don't we start at the top? So why don't you just ground us all, Alon, in what your world is. Talk a little bit about just the basics of manufacturing electronics, and then where data science plays a role.
Cool. The area we are working in is industrial IoT. It's where we manufacture semiconductors and electronics. OptimalPlus, which is the company that NI acquired, was in that field for only 15 years. Customers that we serve are customers like NVIDIA, Qualcomm, practically all the leaders in semiconductors and electronics, Bosch, Continental, and companies like that. So, we have agents in the factories connected to the manufacturing machines and testing machines, or to databases and systems that collect data. We take the data directly from the sources. We have a layer of edge analytics on the factory floor that include rules and models and predictions, etc., but we stream the data also to the cloud or to the data center of the customer, where there we have the big data analytics, because we stream data from many factories and we have long testing periods in the cloud.
This is where we actually bring the tool for big data analytics for machine learning, building the models. This is where you will build the model where you have all the data and the complete power. And then in a click of a button, you can push the model or the rule into the factory flow. So it can run in real time as part of the manufacturing line.
Got it. So these agents that you have in the manufacturing side, what exactly are they?
So it depends. In testers, machines, and semiconductors, for example, we actually have an agent deployed on any type of tester. It's running on the tester itself, hooked to the operation system and to the software of the tester, and actually collects everything that happens during the testing. In other areas, we can connect to the logs of a certain machine or to APIs of a certain machine. The main idea is collecting all the measurements of test data that is relevant for the process.
Got it. These agents are sending in real time, they're streaming the results of the tests and they're also streaming some logs for some of the other machines. So you mentioned edge analytics, so talk to us a little bit about what you mean by edge analytics. I mean, I have a pretty good idea, but maybe not everyone is familiar with that term.
Sure. So let me explain for a second, what are the use cases that we do in it? We help us go through that. So the thing that we are trying to do is, for example, identifying units that all the tests and everything says that the unit is okay, but we can do some prediction and predict that these units are going to fail in production. So, we try to catch escapes, units that are bad, but look good to the process itself. We have capabilities to improve and optimize the process itself. We have capabilities for doing adaptive manufacturing, which is monitoring certain parameters that are typically drifting in production. And from time to time calculating and adjusting the knobs, like changing configurations of the machines in real time. These are things that when they are done manually takes days to identify that there's a drift and calculate the things that need to be fixed.
And when we put a model, it happens in seconds or minutes. It saves tons of money and increases the quality, because when it's not tuned you may have fails in the field. So some of these predictions or analytics or the insights can happen later on. It can happen a day later or a week later. And these ones can run in the cloud, but most of the things needs to run in line as part of the manufacturing, because a unit does not need to move to the next station if our model says that it's a bad unit, or if you want to skip in it, it needs to be in line. So, this is edge analytics.
One important thing I want to mention is that most of the companies that are doing analytics in the industrial IoT field, are actually monitoring the machines themselves. So in a factory you have tens or maybe hundreds of machines, they produce parameters and logs, and most of the companies collect this data and try to come up with use cases like predictive maintenance and things like that.
We, on the other hand, track the products themselves. So those tens or hundreds of machines, they produce millions of units every day, and we are tracking those units. So if you think about semiconductors, for example, just in the testing stage, you have millions of units, each one of them passing ten stations in each station you have thousands of tests, all these huge amounts of data. This is the data that we collect and we track not only the chip we can track it also when this chip is placed on board and the board is placed in the system and the system is built on a product, we track all this genealogy. We call it product lifecycle, and we track all the phases and allow you to make decisions according to that.
Got it. What I heard you say is that most companies will just focus on the predictive maintenance of the machines themselves, but what you're calling lifecycle analytics, is looking at the products that are being produced, so the specific semiconductor, or the specific piece of electronics that is being produced in the manufacturing flow, all the tests that it goes through, and then you decide… There's an opportunity for you and your team to then identify whether or not that specific product that has been in manufacturing is good or bad and worth going to the next step in the process. Is that right?
Exactly. And again, we are not here to replace the testing themselves that they try to decide if the chip is good or not. We come on top of that. We come in places where the test program says this unit is good. And we say, "Hey, we look at it, I know you think it's good, it's not good. Don't send it to be in an airplane."
So what happens if the testing machines themselves say it's passed all the tests, but then when you're looking at your models and analyzing the data essentially, and they predict that, you know what? This is not as good as it looks. It looks bad. Then what?
So usually before we actually deploy the model, we agree, we prove to the customer that the models make sense, right? That the results make sense using historical data. And then we deploy it. So in many cases, once our rule with the machine learning model decides that this unit is bad, it'll be scrapped. They will kill the units. In other places, they can just do a retest and a retest or examine by a human, etc. It depends on the cost of the unit.
These are the tests though, that are happening on the manufacturing floor for the products. But then the product goes out into the real world, and you can test these products hundreds of times in a bunch of different ways, but until they go into the real world, there are still maybe electronics that slip through the cracks that still end up having errors or issues. So, how have you addressed that problem?
So, first of all, we don't address all the problems. Some things go to production and fail. In many cases, when our customers have failures, they get reports on that, and then we do root cause analysis, and we try to analyze those RMAs versus the batches that were produced together with them and we try to identify the root cause. And the main idea is to identify which are the other units that maybe we can do an early recall to avoid the unit failing in production, maybe even not send it out of the factory. They have a failing unit that still didn't get out of the factory. And if you identify bad units in the field and you identify the problems with a certain batch of a component, and you can track the other units, it's much cheaper to catch them before they go out to the field. And of course, when you are doing the root cause analysis, we also allow the customers to improve their process. So next time, this type of error or mistake will not happen again.
Just one more thing before you actually go to the field, remember that the customer of a chip manufacturer is usually an electronic manufacturer. So production for the chip manufacturer is sometimes just the next factory. And in this case, it's easier to catch it because the electronics manufacturing can have a problem with the board and then realize that the problem is actually in the chip that came into the board. And even accounts for data showing, et cetera, that maybe we'll get to it also.
Yeah. So that's what I was going to segue into, which is the data sharing element. So you have the manufacturer of the chip then gets sent to their customer, which may take a number of chips, put it on a board and then they run their tests. And then they identify that one of these chips is bad. If you're the manufacturer of the chip, you want to know that that chip was bad. So, then there is some sharing of data. Correct?
Exactly. So it's even more than that. Sometimes the electronic manufacturer doesn't know that the chip is a problem. He tests his board, he puts the chip on the board, he assumes the chip is fine, and then he gets failures on the board. He tests the board and there are problems. So in some cases, we suspect that the problem is not in the board themselves, but maybe on the chips, then we need to convince the chip supplier to share his data, share different test parameters, or manufacturing parameters of those batches of the chips that he sent to that customer. Once we have that, we can do the correlation between those chip results and identify that, "Hey, these series of problems in the boards can actually be correlated to a certain geographical location of the wafer of the chips."
And there is a geographical location on the wafer, the chip itself, a bad sector or something like that on the wafer.
Yeah. In semiconductors a wafer is a piece of Silicon where you print the chips on top of it before you cut them out. So the wafer itself, there's a geographic issue about it. If there's a problem in a certain area of the wafer, it's a material. So sometimes we identify both are failing, and we see that all the chips that are placed on the failing boards are located in certain areas on the wafers. So, we know that there was a problem in the wafer itself, on the semiconductor's manufacturer, not in the electronic manufacturer.
Got it. So at the end of the day for your team, there's data coming from the agents and the original factories themselves. There are models that your team is building that are actually improving the tests on the factory floor of where the chips are being made. They also are actually changing certain configurations that you mentioned to improve the overall quality of the chips being made. And then also, you're getting the results of the data from the customer who is receiving these chips from the original manufacturer, and you're combining that all. You're doing this to try to get to ground truth.
You're trying to understand which of these chips were successful, which failed, which had use, why they have issues, in order to improve your model. I mean, there's a lot going on there. There's multiple different parties where you're receiving the data from. What are some of the challenges that you've faced in this entire process? For those data science leaders out there that are trying to improve their models and dealing with, I have to imagine, massive amounts of data coming from all over the place. So, what are some things that you have learned over the years?
So, I think the most important thing we learned and we learned it the hard way, is that you cannot just take a bunch of data to your lab and try to build a model. Because reality in the manufacturing world is very dynamic. Things are changing all the time.
People may assume that once you set up a manufacturing line, things are static. The reality is completely different. A machine’s maintenance state is changing. Configurations are changing. Operators are changing. Even test programs are changing from time to time. So everything is changing. And for example, if you take three months of data and try to predict the next month, sometimes it'll work just fine, sometimes you will realize that the data is changing so fast that if you take three months, you cannot actually teach a model. So, you will actually decide to build a model from three weeks of data and predict the next week, and then throw it again and use the next model. But this is just one example.
Other things that can happen is that you may, for example, you can build a model and you have a very good accuracy per unit. And then you realize that practically in the manufacturing flow, in this specific operation, you cannot make decisions on a specific unit, you need to make decisions on a batch. So all the models that we did are completely different now. So what we learned is from day one, not even…. There was a stage where we would build the initial model, just looking at the data, splitting the data, train data, test data, etc. And then only then try to adjust it to production after we have initial results. We even dropped that method because sometimes we just completely start from the beginning. So today, we are getting the data. First, we are learning what exactly happens in the use case, what exactly happens in the factory flow? What is the rate of the data changes? What are the constraints? And only then we start to develop the model from the first minute exactly as it will behave in production. So, this is one important thing.
Yeah. So there's something interesting here. So, I think for most of us in the data science world, we all…not assume, but we hope that the data engineers that we're working with, the data quality, the format of the data is relatively static. Yes, there are use cases where the data can change. You're dealing with semi-structured data and the like, but that's not quite what I'm talking about here. You're dealing with structured data, and because of what's going on in the manufacturing floor, the data itself is evolving and changing. There might be new tests or a new machine that was added. And I have to imagine that you have to get ahead of that in some way. You have to build in safeguards to say that, "Hey the data has fundamentally changed, we need to retrain our model."
So, how have you built those safeguards? How have you found a way to proactively recognize that? I can't imagine you're waiting for a human to send you an email or something and say , "Hey, we just added a new machine. FYI, you're getting a whole different data set. And let me explain it to you." So, what have you done to adjust to the changing data?
This is a great question. So, we built a quite strong infrastructure for us to manage the life cycle of the model on the factory floor. So once we deploy a model together with the model, we deploy another pipeline, which is a scheduled monitoring pipeline, continuously checking a bunch of things, starting from data distribution, changes, scores distribution changes, and also it depends on the use case. So, we continuously monitor different things and try to identify cases where we need to retrain the model. By the way, we also do a scheduled retrain anyway, and see if the new model behaves better than the old model. But if we see that there is a drift or something is changing, we automatically retrain the model on the new data, try to come up with a new model, check the performance, and if it's better, we trigger the user to replace the model. Otherwise, we alert him to call the data scientist as fast as possible.
Got it. Yeah, that makes a lot of sense. Most data science teams have started with scheduled retraining, but I think the more sophisticated teams out there are actually monitoring drift in real time, and then triggering the retraining of their models accordingly. That's very interesting. I want to talk briefly too, what words of wisdom do you have… One of the things that I know you mentioned that you want to be careful about, that you want to underscore is the changing of the training data set and just being careful, like we just mentioned that those models based on the training data set, because things are so dynamic that aren't necessarily going to work in production, but what other words of wisdoms do you have for data science leaders that you collected over the years?
One important thing that we also realized is how hard it is to actually make users adopt models in production, especially when those models are coming to replace decisions that were made by humans before that. It takes quite a long time because users find it hard to trust those models. They don't have the time and priority to give you feedback and allow you to conclude the results when the model behaves differently than the originally human decisions. So originally we would run our models in shadow mode, just in parallel to the human decisions on the factory floor. Then, we realize that once you have discrepancies and you come to the users, they don't have time for you. Sometimes, they find it hard to admit that they made a mistake. And sometimes they will argue between themselves, what is the right decision?
Because one of them is making the labels when you train the model, one of them is making the decision today. So who is the right one? So eventually what we are actually doing with almost every model that we deploy, we bring it together with a human-in-the-loop application. And the main idea is that instead of starting with a decision-making application, we start with the decision-support application. So, the model sits on the production line, gets the inputs, and makes the decision. Instead of sending it to the manufacturing management system, it sends it to human beings. The human being has a nice interface where he can either approve decisions or correct the decisions. Only then we take the final decisions, send them to the manufacturing systems.
What we get with that is that first of all, we can bring the model, it doesn't have to be perfect when we start. It can be quite good, but doesn't have to be perfect. Then, we let the user give us his corrections, we improve the model with the corrections from the user. We don't need to waste his time solving discrepancies, because he's just walking in his line, actually has a better interface and even a system that helps him. So it's easier for him, but still gives us all the feedback that we need. And the more we improve the model, we also improve the confidence score that we bring to be more accurate. So eventually we start bringing a threshold so less and less decisions are actually going to the user. And eventually only very low confidence code decisions are going to the user. Most of them are just going automatically.
In order to gain the trust of the human on the manufacturing floor that previously was making decisions all on his own instead, you start just by reporting, "Hey, I think there's a potential issue with this chip or with this batch" or what have you. And then you have the human say, "Yep. You're right". Or "No, that doesn't look right to me." And, that is helping to train your model. So, you're helping to get direct human feedback, but also it's helping to gain trust. So that individual is seeing that, and over time, noticing that the model is getting better.
And then you actually are then, depending on the confidence score, you're deciding whether or not to directly flag, I'm just throwing this out here, but just flag whether or not that that chip is bad or not. But if it's low confidence, then that's when the human actually goes in and says, "Yeah, you're right." Or "No, you're not right." And then eventually, does it ever get to a point where the human is not in the loop or are you always sending those low confidence scores to the human?
So in many cases, we get to a point that practically all the units are just passing automatically. In many cases, this is the case. There are use cases that are too sophisticated. And then there are cases where we don't want to take the risk. In our industry, a mistake can be very expensive. So we always prefer to take the risk. But in many cases, if you reduce the manual effort by 95 or 98%, this is a huge saving. And it's also an improvement of quality because people make mistakes. Models once they're attuned correctly, they make much less mistakes than new ones. We have tons of examples of that.
Well, this has been fascinating. One of the things that I've just been thinking about, is that your world, Alon, is a data scientist's dream. Most data scientists love to know that the work that they're doing is having a direct impact. And it seems to me, your world is one where you can measure that impact in a pretty concrete way. I mean, yes, there are going to be gaps and that are maybe downstream customers from the manufacturing floor who may not share that data and they get ground truth, but by and large, you're actually seeing those results in real time. But not only that, it's not a static thing. It's a very dynamic environment where you're constantly having to improve upon the models that you're building.
So when it comes to recruiting and you're looking for data scientists out there, is there anything that you particularly look for? Do you look for people who have a background in electronics from manufacturing or is that just merely a nice to have?
So a background in electronics is nice to have. For me, the most important thing that I'm looking for is people that are motivated by challenges. The most important thing that we need is someone that when there's a problem, there's a challenge, he jumps on it. He will read whatever he needs to read. He will try whatever he needs to try. He will fight until he wins. This is what we need, because we are working with customers in production. We have different use cases every day. We have new things that surprise us all the time. So this is what I'm looking for practically.
That's awesome. Well, Alon, I've thoroughly enjoyed our conversation. I've learned a lot and the world that you live in certainly sounds exciting and dynamic to me. And If I was a little younger, I'd love to be a member of your team. So, thanks so much for being on the Data Science Leaders podcast. I really appreciate it.
Thanks for hosting me here. It was great. I really enjoyed it. Thanks.
Listen how you wantUse another app? Just search for Data Science Leaders to subscribe.
About the show
Data Science Leaders is a podcast for data science teams that are pushing the limits of what machine learning models can do at the world’s most impactful companies.
In each episode, host Dave Cole interviews a leader in data science. We’ll discuss how to build and enable data science teams, create scalable processes, collaborate cross-functionality, communicate with business stakeholders, and more.
Our conversations will be full of real stories, breakthrough strategies, and critical insights—all data points to build your own model for enterprise data science success.