What if we could predict how long our models will last in the field?
Is there a mathematical way to estimate mean time to failure for a specific model?
In this episode, Dave Cole is joined by Celeste Fralick, Chief Data Scientist at McAfee, to discuss AI reliability and how it can help predict model decay.
Celeste also explained:
Welcome to the Data Science Leaders podcast. I am Dave Cole, your host. And today we have an interesting topic. We’re going to be talking primarily about AI reliability. And if you’re wondering what AI reliability is, then you’ve come to the right place. Our guest today is Celeste Fralick. She is currently the Chief Data Scientist at McAfee. She hails from north Texas, originally growing up in Alaska. That’s quite the latitudinal change there, Celeste. She also has her PhD from Arizona State in biomedical engineering. And interestingly enough, her dissertation was “A Statistical Clinical Decision Support Tool for Determining Thresholds in Remote Monitoring Using Predictive Analytics.” So, when I saw that dissertation, Celeste, I immediately thought that must be the source of your passion for AI reliability. So talk to us a little bit about what is AI reliability?
Well, hi Dave, and thanks for having me on. Oddly enough, that title, the only thing that title doesn’t have in it is “novel.” But the dissertation which I worked full-time and went to school part-time, really focused on whether grandma was going to have a COPD episode, a chronic obstructive pulmonary disease episode. She was having coughing and things like that. And could I predict when she would have a COPD episode? And at the end of the day, I could predict within 10 days. So, that was pretty cool.
So, at any rate, AI reliability is really about assessing and predicting your intended function over time in the field. And the one thing that we don’t do around AI reliability is…have it. We don’t have it at all. And National Institute of Standards and Technology, in the US and ISO, we’ve got AI for trustworthiness specs. I think they’re really great on identifying trustworthiness, but they have stopped short of AI reliability and measuring reliability in the field.
Right. So, when you say functions, I mean, are you talking about like trained models that are out there in production, producing predictions? And then dumb it down for your host here, what do you mean by reliability, specifically? What are you measuring? Is it the accuracy of the model? Help us understand that.
When I think of reliability, the general definition of reliability is that the probability of a product, a system, or a service will perform its intended function adequately for a specific period of time. And so if I dumb it down, I’d say, don’t you want to know how long your model lasts in the field? And we have model decay, but do we have a mathematical comparison like we do with, say software or hardware product, that we can say, my mean time to failure is X for this model. We don’t have that. I can’t compare company to company. And so as a customer purchasing a product with a model embedded in it, AI embedded in it, I don’t know how long that’s going to last.
I don’t know how long the retraining period is. I don’t know about concept drift. I don’t know about data decay. I don’t know about any of those things. And NIST and ISO specs do a really great job of describing what things make up reliability, but they don’t say “reliability.” They speak about robustness. They speak about trustworthiness. They speak about resilience. I believe those three things can go into reliability, and then we can identify those specific items that will drive your model decay.
So, can you walk us through maybe an example of a model that you’ve put into production, and what sort of processes have you put in place to measure the reliability? One thing that’s also interesting is you talk not just about monitoring the model to ensure that it is still producing accurate results, and accurate predictions, at least compared to how the model perform during training. But you also have discussed being able to predict, as you were talking about the mean time to decay. So how long that model actually is useful, right? So, it’s almost like you’re building a model on top of the performance of the model.
Right, exactly. Exactly. In fact, the thing that I recommend is when you have a specific development operations or DevOps arena, and you have an MLOps arena of an organization, both of those should have incorporated monitors within those processes to monitor certain items that you do see in the NIST and ISO specs. The concept drift, the data decay, the data pipeline. And when I speak of data decay, I’m looking at the changes of the data over time, I’m looking at the change of the labels over time. So if I put those monitors in place, then I know that when those things start to crap out, I know that my model decay will also start crapping out. And so there’s a number of things that you can look at. If you look at model decay going down over time, you can look at a certain number of contributors, a certain number of monitors, that can contribute to that decay over time.
The best part of all of this is that you can mathematically derive all of this and you can then predict it to get your mean time to decay, MTTD. So it’s a wonderful type of measurement that if you say, “Yes, I want MTTD,” then you’ve got to go back and say, “First of all, how am I going to measure it? Am I going to do it by model? Am I going to do it by file type? Am I going to do it by family of threats, because I’m in security?”
So you make that decision, and then you go back into your processes of DevOps and MLOps, and you go, “Okay, what monitors do I have in place? Do I have a monitor of AML, adversarial machine learning, or model hacking? Do I have explainability? Do I have data decay? Do I have the volumes, the skewness, the kurtosis, the variance, those type of things. Do I monitor those? If I monitor those and they go down or up, how does that impact the model decay?” And with that, you can get a mathematical end result. Does that make sense?
It does make sense. One thing I didn’t want to clear up for our audience here, you talk about DevOps and MLOps. Can you help me understand what you mean by that?
It depends. That’s what a good data scientist would always say. It depends on the company you’re in. There are many models out there. So, DevOps can include the model development, the model training, the model evaluation, the model packaging, the data prep, the data cleaning. Whereas, MLOps is taking that model that all the data scientists created, or models, and put that into operations. So you look more along the lines of model governance, model discovery, model security. And certainly you look at monitors in the field like efficacy. So in our case, we tend to build a model embedded in a product, and then in MLOps we will validate that it works in the product, and then it works in conjunction with other products and then put it in silent mode to get feedback.
What is silent mode?
Silent mode is where we’re getting the information, but it’s not impacting the customer at all. So it’s not live.
That’s interesting. So you put a model in silent mode and obviously you’re starting to monitor its prediction and it might take some time to get ground truth. And then, so what are you actually looking for during that silent mode? Are you looking to make sure that you’re measuring its accuracy and things like that?
Right, we’re measuring all the model metrics such as the ROC curve, the FP (the false positive), the true positive, the false negative, true negative. We’re looking at whether it has any spikes in it, or not, whether the trend goes down, whether the trend goes up and how it compares to the model in the field.
I’m curious who owns the reliability of the model? Is that the MLOps side of the house, or is that the DevOps side of the house, or is it a joint effort?
It’s really joint at this point. Because MTTD is a new type of mathematical model, and you can predict, that’s what we’d like to do is to be able to predict MTTD, mean time to decay, and that would help us immensely for resourcing. Right now, our focus is on implementing the monitors throughout DevOps and the MLOps, and being very specific about what we want to measure on AI reliability. The reason I’ve been driving reliability is because of my background. I started in reliability and quality, and nobody has addressed this. When I do see papers about reliability, it’s not about the system.
It’s about like Microsoft has some really great monitors in place, and they have measuring reliability of software products. They have a great paper out there that speaks to that, but it doesn’t speak about the system as a whole. So, I think at the end of the day, what we want to do is to have a bellwether measurement, like hardware has for many, many years, and compare it from company to company to company, because who wins at the end? We all do. The customer especially wins because they can now do a comparison of one-to-one-to-one of the companies that offer embedded AI.
That’s interesting. I would think to be able to compare the reliability of a model between companies, it would have to be a fairly standard type of model. It would be sort of predicting the same thing I would imagine, right?
Right. In hardware we have FITS, or mean time to failure. FITS is failure in time, 10 to the ninth device hours. So, you can actually compare company to company to company. And I could go from one company looking at particular DRAM and look at that reliability and compare it to another company. In software we have typically looked at errors detected at each stage. And when you look at the CISQ, the software requirements the CISQ is the quality assessment of your software. They speak about resiliency. They speak about stability. They speak about trustworthiness, and it’s like models really don’t fit into the software nor do they really fit into hardware. So, we’ve got to come up with something, and I’m not saying that I’ve got all the answers. I think what we’ve got to do is dream big, start small, and be able to just put something out there and let people throw rocks at it and get better.
Interesting. I mean, do you have any ideas of what that might look like? Have you gone down that path at all? Certainly you’re mentioning meantime to decay for a model, but are there any other sort of reliability metrics that you’ve tried to put into practice that all your models fit to?
Well, we specifically look at efficacy. How well our model is working for the customer, because at the end of the day, we want our model to work perfect for as long as possible. What I’m looking for is how do I mathematically predict as long as possible?
Right, right. That makes a lot of sense.
It kind of all came together in reliability. And I started many, many, many years ago, about two miles west of here at Texas Instruments. And I had some surgery and I was working, I had just come back from Singapore and I was doing a really good job. I was an engineer. And the Vice President of Quality called me. Now, that was about 10 layers up, right? And he called me and I’m recovering from surgery at home. So, he’s got my home phone number, landline then, and I’m going, well, “Hey, Ron, how are you?” And I’m kind of half-drugged. And he goes, “How would you like to do reliability in Dallas for Texas Instruments?” I went, “Okay.” When you’re young… One of the things that I was always taught from him was learn as much as you can about the area that you’re studying, every portion of it, and then you will continue to increase your knowledge, increase your stature, and grow in your career. He was a great mentor.
That’s great. I’ve read some things out there and some of the things you’ve said, and you’ve reiterated a number of times: really don’t forget the fundamentals. When you’re going deep into a particular area, and especially in the world of data science there’s many areas to go into, don’t forget the fundamentals. Is there anything that you do or any recommendations that you have to stay up on your fundamentals? I know myself, I have to look things up and I kick myself when I read it over again. But is there anything that you do as a process?
I’m actually the same way. I had hired three women, not intentionally but they just happened to be the best for the job, and we ran advanced analytics at McAfee. We’ve since dispersed them. They’re all PhDs and they brought to light state of the art work. And because they’re young and hungry, they always sent me the greatest and the latest. The thing that I find interesting is that most data scientists—the PhD, masters, whatever—don’t have a solid foundation on statistics. If you’re going to monitor to drive to an excellent reliability of your model in the field, you’ve got to look at things like skewness, and kurtosis, and mean, and distribution, and volume of your features and the type of your features. Those type of basics need to be put in at the very beginning. And they’re a great start for monitors.
One of the challenges in the data science field, as you’ve pointed out, is you really need to be fairly wide in your expertise. You need to have data expertise. You need to have, ideally, a statistical background. You obviously need to be able to code. You need to be able to talk to business users and understand the business challenges and translate that and be able to take that as an input, as you build out your model to make sure that it’s actually driving business value. So it is not easy, right? And I can just imagine that you will forget things and it’s okay.
I do response screening for design of experiments on a routine basis. And I look at response screening metrics over and over again. I’ve done it this year at least twice. Every time I’ve had to go back to my basics and read it over it again. And sadly, I don’t think that’s because the old age, it’s because your brain is so full! And when you start looking at things like skewness and kurtosis, you’ve got to go, “What was that again? I know I learned it sometime. Why would it make a difference?” And so, you do, you forget. There’s just too much. I like to use the people around me that are experts.
I’m not a great coder, so if I have an issue, I go to somebody that has great coding experience. I started AML studies at our company about three years ago, adversarial machine learning. But I brought in a lady from Intel that I had known there and she has taken it much further. So if I want to keep up on AML, I just go to her and say, “Give me the best papers you’ve got.” You’ve got to be humble. You’ve got to know that you don’t know everything no matter how many years of schooling or experience you have.
Right, how many blog posts you read, or what have you.
I have a library of stuff, and you know, I’m still stupid, so…
Yeah. Have a large bookcase, I guess, another pro tip from Celeste here. So, we’ve talked a bit about AI reliability and the importance of not just looking at the efficacy of your model, but also thinking through how long your model is going to be effective. And I imagine determining thresholds and things of that nature. Let’s talk a little bit, you just mentioned adversarial machine learning. I have no idea what that is. Talk to us a little bit about that.
Adversarial machine learning really reared its ugly head in literature about, I would say, five years ago. Think of it as model hacking. So, it’s a great monitor to place in your MLOps to look for drift. So drift within your false positives or false negatives. Because what an adversarial will do will, he won’t know your features, he won’t know your model. He’ll come in and start pinging and the response will come back and it’s like, oh, you got a response here. And so what he’s doing, he or she is doing, is actually changing a column in supervised learning. He or she is changing the column of a feature from zero to one, or one to zero.
If I were to try to describe it in my words, essentially, this is a bad actor who was essentially trying to create data drift purposefully by overloading the amount of inputs that you get into your model. And as you retrain it, you will drift thanks to this one or two features, or whatever it is, that they’re overloading, and skewing.
Definitely. And what happens is that you’d think that they would have to know the model, but there’s a great paper by Pappadeaux. It specifically shows that you can ping with a regression model. And if your company is using a neural net, you can actually still impact that model, even though you don’t know what the company’s model is. It’s called transferability. This particular paper shows that there are a number of different models that you can use that have a particular penetration rate to other models. So, it’s a wonderful XY grid sheet that says this is what we did. We used neural nets and it impacted generalized regression. Or we use generalize regression impacted neural nets this much. And in some cases you’ll find that it impacts it up to 98%.
In some cases it impacted only 30%. We did some analysis a couple of years ago where we looked at 700 features in a white box setting. So we knew the features, we knew the model. We went in and we just changed 10 of those features from zero to one or one to zero, of 700. And we literally changed the model to now detect malicious as benign. So, in other words, the file used to come in as malicious. Now with a change of only 10 features out of 700, it is now benign. And so you’ve got to be really careful about measuring that drift, the false positives and false negatives, for evasion and poisoning by bad actors.
Golly, it comes as little surprise to somebody like yourself in the cybersecurity industry that the bad actors are getting more and more sophisticated. Essentially what I’m hearing you say is, not only are they potentially authoring and creating viruses and all sorts of different ways to sort of entrap folks like myself out there, but also they are going after the detection software itself. So the models that are used to detect their viruses and they’re trying to skew those models so that they can fly under the radar. That is terrifying.
I call it model hacking. Well, there is good news…and kind of bad news. The good news is that when we started looking at it a couple of years ago, two, three years ago, there had not been a report at all of any AML attack within the company. There is some literature that someone in, I think it was Australia, went in and went to a number of different companies—not McAfee—but they went to a number of different companies and they were able to utilize AML techniques and hack in. And that was reported in the literature. The good part was that the researchers are ahead of the bad guys. That’s the really good part. That’s the first time that we’ve really been ahead of the bad guys. Instead of reactive, we’re proactive.
Unfortunately, there’s always bad news, because bad actors can read too. And so they’re going to read all the literature out there, and there’s been a couple of reports out there that it’s starting to infiltrate. So it’s just a matter of time.
The important thing is that there’s awareness from the community and there’s people like yourself who are working to hopefully stay ahead of the game as best they can. Hey, sadly, without these bad actors McAfee wouldn’t be in existence. It wouldn’t be needed, I suppose. So, I’m glad we went down that path because I really didn’t know that much about it. But I can imagine this is not just applicable to viruses and software and hackers and things of that nature. It’s applicable to fraud and all sorts of different areas where, heck, even getting your loan approved. If you really, really badly wanted to make sure that you crossed that threshold, you could potentially, I imagine, if you’re sophisticated enough, hack away at the model.
AML and bias of what you speak of and anomalies and cyber checks. Those are all monitors that you need to put in place. And eventually they’ll give you a visual model decay. And then you can mathematically calculate your MTTD.
Then what you’re doing with that MTTD, I assume, it helps you understand how often you need to retrain your model. That makes a whole heck of a lot of sense. If you have hundreds of models in production and your team is only so large and you’re trying to figure out what your team should be doing for the next two, three months… you don’t want your models to basically all decay at the same time, right? You create this very lumpy sort of job for your team to do so, I imagine it helps to spread the workload.
Absolutely, it impacts a lot of things, including resources. If you can predict MTTD, you can plan for those resources. You can certainly meet, and exceed your customer expectations, even assess and predict how long your model performs. It just goes on ad nauseum. There are reasons why there’s books and books on reliability. So, I’m glad my two worlds kind of collided with reliability modeling and AI. It brought me to this point here with you.
And now I’m getting completely off topic, but I think it’s something that is an important. I do talk on the Data Science Leaders podcasts about the importance of diversity in data science. One of the statistics I’ve thrown out in prior podcasts before is, I think 15 to 22% of data scientists are women. I have to imagine out of that, 15 to 22%, it’s even fewer that are data science leaders like yourself. Do you have any pearls of wisdom to the women out there who are aspiring DSLs or even aspiring data scientists, and maybe even advice for the men out there, male leaders, in terms of building out a more diverse team?
Maybe it’s because I came from a very unique background. My undergraduate is microbiology and chemistry, my master’s and my PhD are in biomedical engineering, but they really focused in neural networks and design of experiments. What I find interesting is that when we want young, maybe an intern, we go to software, we go to the computer scientist. I would highly recommend going to other areas like biomedical engineering. I’ve never seen such a vast array of knowledge that look at a system from soup to nuts, as biomedical engineers. And that’s how they’re trained. I always go back to what I was taught from my mentors: learn your basics. And the first question I always ask anyone I hire is, “Do you have a good course, a year or two, in statistics and regression analysis?”
If you don’t have those, then you can understand some of the other basics that go into model development. So, I think it’s critical for people, whether women or men, to come into industry with a solid background of data science, as well as statistics and coding. I find that I’m a little biased, but no pun intended, but I’m a big fan of design of experiments. I have found that nobody in security knows anything about design of experiments. It’s really looking at the combinatorial and factorial impact of different contributors to your output and how you mathematically modeled that within a designed space. So I’m a big fan of that, but nobody in the security industry is aware of that.
If you’ve got something unique, like regression analysis, like end design of experiments, above and beyond the typical coding and capstone project of here’s a model, here’s my data, here’s my output. If you have something more to add to that, that would be awesome. I like to find people who are good in something that have a passion specifically in an area. Like I love DOE, design of experiments in neural networks.
Can you tell? I am not going to hire somebody that is really good in neural networks and design of experiments. Why should I? I need to surround myself with people that know a whole lot more about control theory, for example. I mean, that’s a good broad range.
Right. So, I’m hearing you sort of recap. Make sure that you have a variety of sort of skills and expertise in your data science team. It’s not good to just have 10 data scientists whose expertise is in neural nets, or maybe just design experience. You want to have that breadth amongst the team. And then the second thing I heard you say is having a strong statistical background is very important to enter the world of data science. But you don’t need to have a PhD in statistics, if you have one or two years of stats. And there are a lot of different disciplines from folks coming out of college, and even in their master’s degree, where statistics is really important. You don’t have to have your master’s in statistics per se, but if, like yourself, where you had sort of a biomedical background where statistics was extremely important and being able to predict, I believe it was the chronic obstructive pulmonary disease…
In fairness, I had to Google it while you were talking. So, that background, that statistical background, can be applied to other areas in the business world. And that helps as you build out that diverse team. So it’s diversity of background and diversity in expertise.
You know what, Dave? I’ve done data in eight to 10 different markets in my career. One bizarre thing that I’ve noted is that—and it’s from printers, HVACs, windmills, you name it, right? Hardware, computer chips, software, you name it, I’ve done it. And one thing I’ve noticed was that the data in cybersecurity, which has very few females, and the data in healthcare are very similar. Very, very similar. Yeah. And they’re both very complicated because you never know what you’re going to get.
Well, you’re bucking the trend and you have the long career to boot, and hopefully there’ll be many other data scientists who follow in your footsteps. So male and female, and it’s extremely impressive. So, I very much thank you. This has been fantastic. I mean, some things that we learned today about AI reliability and the importance of understanding the decay of your model, and not just the efficacy, but how long your model is going to be effective out in the field. It helps with resourcing. It helps with knowing when to retrain the model, and so on. Very, very interesting stuff. If people are interested in reaching out to you and learning more, do you have social media or some way to get a hold of you?
Sure. I’m on Twitter, @purkinje16.
Purkinje, and that’s P-U-R-K-I-N-J-E. They’re actually fibers in the heart and fibers in the brain.
I was going to ask!
Yeah. Also, on LinkedIn under Celeste Fralick. But Purkinje is funner, isn’t it?
Yeah. It was a pleasure having you on the Data Science Leaders podcast. Thank you so much for joining us. This has been great.
44:16 | Episode 06 | June 08, 2021
33:37 | Episode 05 | June 01, 2021
42:53 | Episode 04 | May 25, 2021
Use another app? Just search for Data Science Leaders to subscribe.
Data Science Leaders is a podcast for data science teams that are pushing the limits of what machine learning models can do at the world’s most impactful companies.
In each episode, host Dave Cole interviews a leader in data science. We’ll discuss how to build and enable data science teams, create scalable processes, collaborate cross-functionality, communicate with business stakeholders, and more.
Our conversations will be full of real stories, breakthrough strategies, and critical insights—all data points to build your own model for enterprise data science success