Data Science Leaders | Episode 16 | 29:22 | August 17, 2021
How Computer Science & Statistics Fundamentals Can Advance Data Science in 2021
Get new episodes in your inbox
powered by Sounder
Computer scientists can be fearless, pushing the limits of computational power and the scale of data we can analyze. On the flipside, statisticians can be intensely skeptical, always measuring error and bringing a critical perspective.
In this episode, Chris shares his thoughts on how computer science and statistics fundamentals can help us continue to push data science forward. Plus, he offers advice on how to conduct all-important exploratory data analysis (EDA) effectively.
- How computer scientists influenced data science
- What statistical thought brings to the equation
- Tips and tricks for doing EDA right
Hello, welcome to another episode of the data science leader podcast. I am your host, Dave Cole. Today, our guest is Chris Volinsky. Chris, you have been at AT&T for the better part of 23 years, is that correct?
23 years, that’s right.
Very impressive. Your LinkedIn profile is perhaps one of the simplest I’ve seen, and it’s great. We’re going to talk a little bit about your 23 years at AT&T before we dive into our discussion topics. You’re the Assistant VP of Data Science and AI at AT&T, you have a PhD in statistics from the University of Washington, and your first job out of college was at AT&T and you worked your way up to where you are today.
I’d like to dive into, Chris, your move from statistician to data science. There’s a lot that has been made of the differences between machine learning and the data scientists, and a statistician. I’ve talked with a lot of folks, on the Data Science Leaders podcast, who have a statistics background. One thing that hasn’t been discussed is the difference between machine learning and statistics. What was the path like for you, going from statistics, now with a title with ‘data science’ in the name?
The second topic we’re going to dive into is perhaps one of the most critical parts of the data science lifecycle, which is EDA: exploratory data analysis. That’s really, as a data scientist, rolling up your sleeves with the data, really trying to understand it, understand anomalies and relationships between various features, what features should you use, etc. I know you’re passionate about EDA and you have some good ideas and thoughts for the data science leaders out there, even data scientists, in terms of best practices. All right, so let’s dive in. You started your schooling in statistics, and you moved. What was your first role at AT&T? I’m curious. Where did you start?
I actually started when I was in grad school. I did an internship at AT&T for the summer. I was in the University of Washington’s fairly theoretical program. I was learning a lot of math and stats. I went to AT&T for the summer and I remember my advisor said to me the first day, he said “Most of the stuff you learn in grad school is going to be irrelevant here.” This was back in the late 90s, and it was before the big data revolution and everything. We were one of the places where you really had to deal with large datasets.
He said, “The theory you’ve learned breaks down with large datasets. Everything’s going to be significant, and you have to do a lot of stuff that’s kind of ad hoc. You’re not going to be proving asymptotics. You’ve just got to get stuff done.” That was really powerful to me. It was kind of exciting to hear that what I was learning in school was irrelevant.
That summer, I was working on real world examples. I was working on fraud detection. I was working on churn analytics. I was studying how people use communication tools and products. It wasn’t the toy examples I was learning in grad school where ‘N=50’ was a large dataset. It was really exciting. I got really jazzed that summer, and worked right into a job when I graduated out of school. You talk about the journey. I would have called myself an Applied Statistician at the time, but over the time I’ve been in this job it was applied statistics, and then it was knowledge discovery and databases, and then it was a data miner. Then I was doing big data and then I was a data scientist. Now I’m doing AI. It’s like every couple of years it gets rebranded.
For the most part, it’s still the same stuff. What we’re doing is we’re solving problems with data, particularly with large datasets. The amount of computation has clearly changed over the years, and the algorithms have changed over the years, but at its heart we’re still just solving problems with data.
Right. That’s on a super high level, Chris, right? But if we were to go a level down, do you see differences between what a data scientist does today versus what you maybe were doing 15 years ago in your first 10 years at AT&T?
Clearly, the difference is in the ability, that we can analyze large datasets with the computational advances that have been made over the years. I think culturally, I try not to get too much into the culture wars between statistician and machine learner or computer scientist, although there are significant and real differences that are worth talking about and exploring.
I think that early on, when it was really just statisticians doing this stuff, if I compare that to now where the field is a nice blend of computer science and statistics… computer scientists brought a different perspective and a different angle into the field, that I think was necessary. Statisticians are very rigorous: they want to prove things; they want to make sure stuff is asymptotically right before they apply anything. Sometimes, that is a barrier to getting stuff done.
The computer scientist kind of came into the field with a fearless attitude of “Let’s just throw a lot of things at the wall and see what works. Let’s see how far we can push the limits of computational power and the sizes of data that we can analyze. Let’s create algorithms that do interesting things without exactly knowing what the mathematical properties or asymptotic theory is, behind it.” I think that really, over the last 10-15 years, has pushed the field forward in a way that kind of I don’t the statisticians would have made on their own.
In preparation for today’s podcast, I did a bit of reading about the difference between statistics and machine learning. Quite frankly it is not straightforward. It’s hard to boil it down to a sort of a pithy response. What you just highlighted there does come through in what I was reading, which was that there’s a little bit of data scientists or a machine learning approach to building a model is a little bit about the ends justifying the means. It’s like all that really matters is how accurate your model is at the end of the day. There’s over-fitting and things you’ve got to be careful about, but how many features you use, the relationships between the features, understanding how the data was collected, things like that, that maybe a statistician would care deeply about, doesn’t really seem to matter to a data scientist.
A data scientist is really just about increasing the accuracy of the model, and better outcomes. I don’t know if that’s your perspective as well, but if I was to have to boil it down, that’s how I would boil it down.
Yeah, I think that’s a good way of putting it. If I think back to my training, we spent a pretty small percentage of time on prediction. There was a lot of time on distributional theory, parameter estimation, data and data collection, sampling, things like that, but very little time on prediction. I think what we’ve learned is that in the real world, predictive models are really important and really useful. That difference in perspective and how you approach problems is really, really valuable.
Yeah, and I would agree that with the advent of open source, hard drives being cheap, Hadoop, big data, the advent of the cloud, access to GPU, the computational power at the fingertips of a data scientist has increased. That’s really pushed forward what you can do with the algorithms that are based in statistics. Hopefully both sides, statistics and machine learning/data science, are helping each other and pushing things forward. What is for sure happening, I think, is in schools you are seeing data science disciplines and focuses, but statistics isn’t going anywhere. It’s not like statistics departments are closing up shop and are like, “Oh, we’re done.”
Machine learning’s taking over. You can still get a PhD in stats and there’s still a lot of theory and applied statistics still around, and it’s still very important, right?
I feel like statistics was definitely slow to get to the party, no doubt. What I’d like to say about statistics and statisticians is a kind of different perspective. There’s something about statistical thought that brings a lot of value to the data science process.
Statistics, by definition, is a field that’s based on uncertainty. They focus on uncertainty as the core part of the science. That’s just a really unique and interesting perspective for science to take. It creates, for those of us who are trained in statistics, it creates an intense skepticism of almost everything that you see.
In fact, I remember a meeting early on where there was a group of people who were building a model in another organization. They wanted to add a statistician. They said, “You guys are like the lawyers. You come in and you tell us what we can’t do, but that’s valuable. There’s value in that because you guys are so skeptical and you’re such critical thinkers.” Because we’re always thinking about errors. We’re always measuring the error. We’re always talking about how wrong you are instead of how right you are. I think that way of thinking of problems is really valuable, even in today’s age. We’re kind of in an age of rampant proliferation of dashboards everywhere, and everyone’s focused on metrics and a number, a metric, that either goes up or down this month. That’s valuable in a business sense, but a statistician’s way of thinking of that, they’re going to look at that number and they’re going to go, “What is the error? What is the spread? How much of this change I’m seeing this month is due to random variation? How much is due to a real effect?” And you can try and measure those things.
I think that way of approaching problems, always thinking about what is the error, what’s the expected error, what do I expect to happen compared to what actually does happen, I think that type of thought isn’t as ingrained in a computer scientist. When you bring these two things together, the power of the computation and the skepticism of the results, I think that’s where you get the real sweet spot of data science.
Yeah, and I want to talk about that. From what you’ve seen today in terms of the folks that you’re hiring as data scientists, do you feel—and I’m going to put you on the spot here—but do you feel that they could use a bit more statistics? Like they need to go back to school and take some basic stats courses, maybe more advanced stats courses. Is that something that you’d like to see happen more from the folks who call themselves data scientists today?
I’ll go back to my previous answer where I think that type of thinking is really valuable, dare I say necessary, to be a good data scientist. Now, do you need formal statistical training, where you understand the Central Limit Theorem backwards and forwards and you can prove it? Can you prove asymptotic theory about parameter estimates? No, you don’t need that to be a good data scientist. There’s a sweet spot there.
There’s a sweet spot where I think you can gain the value of statistical methods and thought without again having to get a PhD. I think the best data science training out there combines these methods and these points of view. One of my favorites is the Coursera series that’s taught by the faculty at Johns Hopkins, the biostatisticians from Johns Hopkins. They were one of the first online MOOCs to help teach data science, that’s still out there. Their course is just a really great collection/combination of kind of the best of all worlds. They talk about things like data skepticism and reproducible research that don’t necessarily get covered in all topics. I think the perspective that’s brought from the biostatisticians who put the program together helps to infuse all of the computational training that you need with the statistical thought and brings it together.
I think the best data science training out there is exactly incorporating both of those, and I would encourage people to seek it out if they don’t have it.
I think in summary, you need to have that mix. If you personally are a data scientist or even a data science leader, and you’re like “Man, I really could use a refresher on statistics.” The Johns Hopkins course was mentioned. I’m sure there are others out there that you should take a look at. But then also, I think if you’re a statistician and you’re on a team that has the label of data science, your perspective is very valuable. I like that lawyer analogy, thinking about the rules of the road when working with a data scientist. I think what I heard is one plus one equals three, like the two together can be very powerful.
If you’re looking to mix up your team and maybe you have a great data scientist that doesn’t have that sort of classical statistical training and background, it might be great to pair that data scientist with somebody who has a strong foundation in stats, or even get them trained up.
So, a lot of goodness there. Clearly for you, your career and your titles have sort of changed. I laughed when I heard data mining, because I remember 15-20 years ago: that was kind of the label and I haven’t heard that in such a long time. These labels come and go, and I do think that to your point, the advent of increased computational power, the cloud, no longer having to worry about sampling because gosh, we can just store as much data as we need. It’s very cheap. All these things have sort of changed the game in good ways, and it really improved our profession as a whole.
All right, let’s move on. I want to talk about EDA, exploratory data analysis. First of all, can you describe it in your own words, and then let’s dive into how to do EDA right.
EDA, what it comes down to is how do you explore and describe a dataset. It’s usually part of the beginning of the pipeline of the data science process. You want to answer a question, you get access to data, and so you want to explore that data. You want to explore it to learn from it, to try and understand the distributions of the variables that are involved, but you also want to explore it to understand things like missing data, is there missing data in the dataset that you have, to get some sense of how to deal with that missing data, are you going to delete those variables, are you going to compute those variables, you want to look for outliers, you want to look for skewed variable distributions.
This is all going to impact what you do, when you ultimately get to the more fun stuff like modeling. You’ve got to do the due diligence of looking at the dataset and understanding it, to know what models are appropriate and really understand and live with your data, getting a little bit of subject matter expertise perhaps. I think the dirty little secret of data science, or at least it’s not true everywhere but it’s true in my world, is that I live in a world where I’m in a data science organization that’s centralized within a big corporation. A lot of the questions that come to us help me understand stuff for my data.
A lot of the solutions that you can come up with, really, EDA might be good enough to give someone insight into data, help them understand correlations and trends, make good plots and just help people to understand their data. Hopefully and, I think, typically, that leads to more advanced modeling, the type of stuff that we would consider real data science like a random forest model, or some kind of mixed effects model, or deep learning. A lot of times that EDA is really what the client needs. They’re not sophisticated with data. We are the experts. Sometimes they need just a really good description and visualization of the data so that they can help understand their customers.
It’s a really critical point of what we do as data scientists: spend time and be good at that EDA phase of analysis. Don’t rush through it so that you can get to the sexy modeling. You’ve got to spend a significant amount of time with it because there’s a lot to be learned. Based on what you learned from your EDA, you might have to go back to the source of the data and help them explain why does this look like that, or is this an anomaly? Is there a problem with the data collection? If you don’t do that, then the stuff that comes later isn’t going to be worthwhile because it’s going to be garbage in/garbage out. I’m always encouraging teams to spend more time than they think in the EDA phase because it’s really a critical part of what we do as data scientists.
I couldn’t agree more. I’m curious, thinking of the garbage in/garbage out, I think all of us probably in our careers have seen models built on crummy data where the data scientists, maybe, didn’t truly understand the data. They didn’t actually validate assumptions that they were making about certain fields they were using with the business, and all sorts of different snafus. If more attention had been paid to truly understanding the data and making sure that you fully understand it before you start throwing models at it and start training your models, then a lot of time would have not been wasted etc.
In your mind, what is the output of an EDA? Is it simply just “Hey, as a data scientist, I want you to write a bunch of SQL queries, look at correlation between various variables or features for columns,” or whatever you want to call it, that is a number of things. Or do you have something more formal, more structured in terms of an expectation? Is it the goal really just for the data scientist to learn a lot of stuff? Or is there an actual physical output that should be consumed and it can be shareable to others? What is your perspective on that?
It depends on the situation. I would say that when you take on a project or you’re trying to answer a business problem and you get access to a new dataset, you go through the phase of EDA. That phase has to have a distinct end where something is presented, or delivered, or reported on. That’s usually a presentation or a deck somebody walks through, and almost always contains data visualization. Data visualization is obviously a key part of this. You want to see what the distributions look like visually, you want to look at correlations between variables visually, you want to understand basic relationships between your response variable and your dependent variables before modeling. You want to look at the skew and things like that.
Having good data visualization skills is really a key part of that whole thing. I think it’s a critical part of the data scientist’s toolbox. Especially nowadays with all these great tools that we have, the Tableaus and PowerBIs. I’m an R coder so I love interactive shiny displays that people make so that you can kind of explore and interact with the data as part of the EDA. More people have the ability to create these really nice visualizations as part of their EDA explorations. I think that really helps the whole data science pipeline.
I’m curious if you—maybe the answer is no here—but if you feel like BI has maybe a role to play in EDA. The other tools out there, business intelligence-type tools that maybe a data scientist might help, or stick with some of the open source packages.
I absolutely feel they have a role. I think BI gets a bad rap. BI gets this rap of “Oh, people are just creating dashboards. It’s not real science,” kind of thing. I’ve seen some displays of data in the BI tools that are really sophisticated with all kinds of conditioning on the right variables. Putting those things together: figuring out what variables to condition on, how to organize your thoughts, whether to make it a stack bar or a box plot, it actually takes talent to do those things well. When you see one that’s done well, where they’re literally showing eight different dimensions of the data on one page in a way that’s digestible, that’s a skill that people have that’s very valuable to allow others to get access to learning from data. I think we should incorporate more of those tools rather than kind of push them off to the side.
Yeah, I agree. I’m going to throw this one out there to get your reaction, but I think there’s this interesting sort of symbiotic relationship between, say, a data engineer and a data scientist. We hear all the time about how much data wrangling is part of data scientist’s job, and gosh, data engineers can be used for some of the hard, challenging even sort of the repeatable data wrangling or problems provided the datasets. I’ve never heard of the same relationship between maybe somebody who’s a BI professional, like bringing in a BI professional to assist with that EDA process. My guess is, because you as a data scientist have to be part of the EDA process to really know which models that you want to use, and they would know which features you think are important, which data needs to be cleansed or whatever, or normalized.
I do think potentially data scientists may be not great at building these visualizations out. As you said, it takes talent, could partner with somebody on the BI side. I don’t know.
Yeah, I think that makes a lot of sense. There’s a lot of talk about the model of putting together pods of teams to solve data science problems where you have different people with different skill sets, and you lock them in a room or a virtual room, and then have them get the job done. I think there’s a role for someone like that in a pod as part of the EDA process. Like you, I haven’t heard that be done a lot, but I think that’s an interesting model to pursue.
Great. I think there’s the emphasis there on EDA and some of the things that we discussed on creating an artifact ie. building out a deck. I imagine that this deck is a sort of phase one output to say your business users or what have you, or the rest of your team. Is this something that you typically see being presented? Or is it simply like an artifact that just kind of the data scientists check the box of “I’ve done it. I feel like I’ve learned as much about the data that I may need before I move on to the next step.”
Yeah, it’s got to be a living part of the process. We talk about the data science pipeline, but we all know it’s not a linear path from A to Z. You start a project, you get data, you learn from it, you do EDA, you start modeling. But then you learn something when you do the modeling that kind of feeds back into how you do the data, and the whole process is very circular and iterative. It’s never going to be a dead document that goes into a drawer somewhere because you’re going to need to revise it and revisit it over time as that pipeline keeps spinning around and around. It definitely should be a living document.
A good EDA can get fed into a data monitoring platform. You want to make sure that your data is not drifting, or there’s not significant change points. You want to feed the things you learned in the EDA into some kind of monitoring system that the engineers might use to make sure that as whatever process you’re working with, you can be alerted if something goes haywire.
Yeah, that’s a great point. It’s interesting, we’re talking about typically within the first couple of steps. The first step might be a hypothesis or a thought before you dive into EDA. The last step is monitoring the model once it’s out of production. I didn’t think we would be talking about the last step in the whole data science lifecycle and how EDA plays a role there, but you’re right, it absolutely does. If you’re working with your ML engineers you might be working on that last mile problem and being able to have an artifact that they can read over so that they can monitor the inputs into your model and looking for drift, and things of that nature I think makes perfect sense.
This has been great. Chris, I really appreciate you taking the time. We’ve covered a lot of topics today. You talked a bit about statistics and machine learning. We also talked about EDA and the important role there. I really thank you for being on the Data Science Leaders podcast. This has been a lot of fun.
Yeah, I had a great time. It’s good talking to you, Dave.
Great. If people want to get in touch with you, I assume they can reach out to you via LinkedIn?
I’m on LinkedIn. I’m on Twitter. My handle is @StatPumpkin. Don’t ask me why. It just happened.
@StatPumpkin, where did that come from, Chris?
There’s not an exciting story there. It was just one of those things where I was looking for a handle and I was just mashing words together that I was a statistician and I like pumpkins. It happened to be October when I created it. Orange is my favorite color, so it just kind of happened. I didn’t expect that to be the handle that would stick with me for 20 years. I kind of just thought it would be a fun thing that I would create. But here I am, @StatPumpkin.
@StatPumpkin. Boom. All right, well if you want to get in touch with Chris, @StatPumpkin on Twitter. Thanks again, and I hope you have a great rest of your week.
Take care. Thanks for the conversation.
Listen how you wantUse another app? Just search for Data Science Leaders to subscribe.
About the show
Data Science Leaders is a podcast for data science teams that are pushing the limits of what machine learning models can do at the world’s most impactful companies.
In each episode, host Dave Cole interviews a leader in data science. We’ll discuss how to build and enable data science teams, create scalable processes, collaborate cross-functionality, communicate with business stakeholders, and more.
Our conversations will be full of real stories, breakthrough strategies, and critical insights—all data points to build your own model for enterprise data science success.