In this post, Josh Poduska, Chief Data Scientist at Domino Data Lab, advocates for a common taxonomy of terms within the data science industry. The proposed definitions enable data science professionals to cut through the hype and increase the speed of data science innovation.
The noise around AI, data science, machine learning, and deep learning is reaching a fever pitch. As this noise has grown, our industry has experienced a divergence in what people mean when they say “AI”, “machine learning”, or “data science”. It can be argued that our industry lacks a common taxonomy. If there is a taxonomy, then we, as data science professionals, have not done a very good job of adhering to it. This has consequences. Two consequences include the creation of a hype-bubble that leads to unrealistic expectations and an increasing inability to communicate, especially with non-data science colleagues. In this post, I’ll cover concise definitions and then argue how it is vital to our industry that we be consistent with how we define terms like "AI".
- Data Science: A discipline that uses code and data to build models that are put into production to generate predictions and explanations.
- Machine Learning: A class of algorithms or techniques for automatically capturing complex data patterns in the form of a model.
- Deep Learning: A class of machine learning algorithms that uses neural networks with more than one hidden layer.
- AI: A category of systems that operate in a way that is comparable to humans in the degree of autonomy and scope.
Our terms have a lot of star power. They inspire people to dream and imagine a better world which leads to their overuse. More buzz around our industry raises the tide that lifts all boats, right? Sure, we all hope the tide will continue to rise. But, we should work for a sustainable rise and avoid a hype bubble that will create widespread disillusionment if it bursts.
I recently attended Domino's rev conference, a summit for data science leaders and practitioners. I heard multiple leaders seeking advice on how to help executives, mid-level managers, and even new data scientists have proper expectations of data science projects without sacrificing enthusiasm for data science. Unrealistic expectations slow down progress by deflating the enthusiasm when projects yield less than utopian results. They also make it harder than it should be to agree on project success metrics and ROI goals.
The frequent overuse of “AI” when referring to any solution that makes any kind of prediction has been a major cause of this hype. Because of frequent overuse, people instinctively associate data science projects with near perfect human-like autonomous solutions. Or, at a minimum, people perceive that data science can easily solve their specific predictive need, without any regard to whether their organizational data will support such a model.
Incorrect use of terms also gums up conversations. This can be especially damaging in the early planning phases of a data science project when a cross-functional team assembles to articulate goals and design the end solution. I know a data science manager that requires his team of data scientists to be literally locked in a room for an our hour with business leaders before he will approve any new data science project. Okay, the door is not literally locked, but it is shut, and he does require them to discuss the project for a full hour. They’ve seen a reduction in project rework as they’ve focused on early alignment with business stakeholders. The challenge of explaining data science concepts is hard enough as it is. We only make this harder when we can’t define our own terms.
I’ve been practicing data science for a long time now. I’ve worked with hundreds of analytical leaders and practitioners from all over the world. Since AI and deep learning came on the scene, I’ve increasingly had to pause conversations and ask questions to discover what people really mean when they use certain terms. For example, how would you interpret these statements which are based on conversations I’ve had?
- “Our goal is to make our solution AI-driven within 5 years.”
- “We need to get better at machine learning before we invest in deep learning.”
- “We use AI to predict fraud so our customers can spend with confidence.”
- “Our study found that organizations investing in AI realize a 10% revenue boost.”
One has to ask a series of questions to be able to understand what is really going on.
The most common term-confusion I hear is when someone talks about AI solutions, or doing AI, when they really should be talking about building a deep learning or machine learning model. It seems that far too often the interchange of terms is on purpose, with the speaker hoping to get a hype-boost by saying “AI”. Let’s dive into each of the definitions and see if we can come to an agreement on a taxonomy.
First of all, I view data science as a scientific discipline, like any other scientific discipline. Take biology, for example. Biology encompasses a set of ideas, theories, methods, and tools. Experimentation is common. The biological research community is continually adding to the discipline’s knowledge base. Data science is no different. Practitioners do data science. Researchers advance the field with new theory, concepts, and tools.
The practice of data science involves marrying code (usually some statistical programming language) with data to build models. This includes the important and dominant initial steps of data acquisition, cleansing, and preparation. Data science models usually make predictions (e.g., predict loan risk, predict disease diagnosis, predict how to respond to a chat, predict what objects are in an image). Data science models can also explain or describe the world for us (e.g., which combination of factors are most influential in making a disease diagnosis, which customers are most similar to each other and how). Finally, these models are put into production to make predictions and explanations when applied to new data. Data science is a discipline that uses code and data to build models that are put into production to generate predictions and explanations.
It can be difficult to craft a definition for data science while, at the same time, distinguishing it from statistical analysis. I came to the data science profession via educational training in math and statistics as well as professional experience as a statistician. Like many of you, I was doing data science before it was a thing.
Statistical analysis is based on samples, controlled experiments, probabilities, and distributions. It usually answers questions about likelihood of events or the validity of statements. It uses different algorithms like t-test, chi-square, ANOVA, DOE, response surface designs, etc. These algorithms sometimes build models too. For example, response surface designs are techniques to estimate the polynomial model of a physical system based on observed explanatory factors and how they relate to the response factor.
One key point in my definition is that data science models are applied to new data to make future predictions and descriptions, or “put into production”. While it is true that response surface models can be used on new data to predict a response, it is usually a hypothetical prediction about what might happen if the inputs were changed. The engineers then change the inputs and observe the responses that are generated by the physical system in its new state. The response surface model is not put into production. It does not take new input settings by the thousands, over time, in batches or streams, and predict responses.
My data science definition is by no means fool-proof, but I believe putting predictive and descriptive models into production starts to capture the essence of data science.
Machine learning as a term goes back to the 1950s. Today, it is viewed by data scientists as a set of techniques that are used within data science. It is a toolset or a class of techniques for building the models mentioned above. Instead of a human explicitly articulating the logic for a model, machine learning enables computers to generate (or learn) models on their own. This is done by processing an initial set of data, discovering complex hidden patterns in that data, and capturing those patterns in a model so they can be applied later to new data in order to make predictions or explanations. The magic behind this process of automatically discovering patterns lies in the algorithms. Algorithms are the workhorses of machine learning. Common machine learning algorithms include the various neural network approaches, clustering techniques, gradient boosting machines, random forests, and many more. If data science is a discipline like biology, then machine learning is like microscopy or genetic engineering. It is a class of tools and techniques with which the discipline is practiced.
Deep learning is the easiest of these terms to define. Deep learning is a class of machine learning algorithms that uses neural networks with more than one hidden layer. Neural networks themselves date back to the 1950s. Deep learning algorithms have recently become very popular starting in the 1980s, with a lull in the 1990s and 2000s, followed by a revival in our decade due to relatively small tweaks in the way the deep networks were constructed that proved to have astonishing effects. Deep learning can be applied to a variety of use cases including image recognition, chat assistants, and recommender systems. For example, Google Speech, Google Photos, and Google Search are some of the original solutions built using deep learning.
AI has been around for a long time. Long before the recent hype storm that has co-opted it with buzzwords. How do we, as data scientists, define it? When and how should we use it? What is AI to us? Honestly, I’m not sure anyone really knows. This might be our “emperor has no clothes" moment. We have the ambiguity and the resulting hype that comes from the promise of something new and unknown. The CEO of a well known data science company was recently talking with our team at Domino when he mentioned “AI”. He immediately caught himself and said, "I know that doesn’t really mean anything. I just had to start using it because everyone is talking about it. I resisted for a long time but finally gave in.”
That said, I’ll take a stab at it: AI is a category of systems that people hope to create which have the defining characteristic that they will be comparable to humans in the degree of autonomy and scope of operation.
To extend our analogy, if data science is like biology and machine learning is like genetic engineering, then AI is like disease resistance. It’s the end result, a set of solutions or systems that we are striving to create through the application of machine learning (often deep learning) and other techniques.
Here’s the bottom line. I believe that we need to draw a distinction between techniques that are part of AI solutions, AI-like solutions, and true AI solutions. This includes AI building blocks, solutions with AI-ish qualities, and solutions that approach human autonomy and scope. These are three separate things. People just say “AI” for all three far too often.
- Deep learning is not AI. It is a technique that can be used as part of an AI solution.
- Most data science projects are not AI solutions. A customer churn model is not an AI solution, no matter if it used deep learning or logistic regression.
- A self driving car is an AI solution. It is a solution that operates with complexity and autonomy that approaches what humans are capable of doing.
Remember those cryptic statements from above? In each case I asked questions to figure out exactly what was going on under the hood. Here is what I found.
- An executive said: “Our goal is to make our solution AI-driven within 5 years.”
The executive meant: “We want to have a couple machine learning models in production within 5 years.”
- A manager said: “We need to get better at machine learning before we invest in deep learning.”
The manager meant: “We need to train our analysts in basic data science principles before we are ready to try deep learning approaches.”
- A marketer said: “We use AI to predict fraud so our customers can spend with confidence.”
The marketer meant: “Our fraud score is based on a logistic regression model that has been working well for years.”
- An industry analyst said: “Our study found that organizations investing in AI realize a 10% revenue boost.”
The industry analyst meant: “Organizations that have any kind of predictive model in production realize a 10% revenue boost.”
Whether you 100% agree with my definitions or not, I think we can all agree that there is too much hype in our industry today, especially around AI. Each of us has seen how this hype limits real progress. I argue that a lot of the hype is from misuse of the terms of data science. My ask is that, as data science professionals, we try harder to be conscious of how we use these key terms, and that we politely help others who work with us learn to use these terms in the right way. I believe that the quicker we can iterate to an agreed-upon taxonomy and insist on adherence to it, the quicker we can cut through hype and increase our speed of innovation as we build the solutions of today and tomorrow.