Data Science Leaders | Episode 29 | 36:50 | November 30, 2021
To Patent or Not to Patent? How to Weigh the Options for Your Team
Get new episodes in your inbox
powered by Sounder
Should your team patent its data science work? With open source such an important part of the data science community, patents almost seem antithetical to the ethos of the field itself.
But it turns out, there are some very good reasons to pursue data science patents in business.
In this episode, Kli Pappas, Associate Director of Global Analytics at Colgate-Palmolive, shares his team's process for deciding whether to patent an algorithmic process—and what benefits it can bring. Plus, he talks about why a statistical background is so important for teams that generate data.
- The transition from getting a PhD in chemistry to the analytics world
- Finding the balance between statistical and computer science backgrounds
- Why you should patent your data science work and how to do it
Hello, and welcome to another episode of the Data Science Leaders podcast! I’m your host, Dave Cole and today's guest is Kli Pappas. How are you doing today, Kli?
Great. How are you doing, Dave?
Great. So, Kli, your background: you're an Associate Director of Global Analytics at Colgate-Palmolive. You also have a PhD in Chemistry from Princeton. I'll dive quickly into how you segued your career from chemistry into being head of an analytics team. Before we do that, I always like to sort of tease the agenda at the outset, here.
We're going to be talking about how to balance your team, between the traditional stats background that we see amongst analytic teams and a software engineer background. We'll go to our meaty topic, a unique one, which is patenting your work—why, as a data science leader, you might want to consider filing for a patent.
How did you go from getting a PhD in Chemistry, which is not for the faint of heart, to moving more into an analytics world here?
Yeah, it's funny. I always had an interest in computer science. I started off as a computer science major in college and I learned some visual basic—no one writes visual basic anymore.
Exactly. But that doesn't apply too much these days.
For good reasons, I'm sure. During my PhD time, there was a lot of work going on in chemistry as a whole: trying to figure out how to use machine learning specifically in neural networks, to learn chemical reactions. If you look at chemistry, it's still very much done the same way it's been done for 200 years. You run an experiment, see what the outcome is, change a variable and run another experiment, iterating as you go.
There's a lot of work that's been going on, to try to collect the corpus of experiments that have been done. The aim is to do some modeling work, to enable the prediction of outcomes to design catalysts that make better drug molecules. That piqued my interest. When I started working at Colgate, it was just about the time that data science was really coming into itself as a field. I really saw an opportunity there, to do something very similar on the Colgate side, by focusing on the massive amount of product development work that goes on—which most people don't realize.
You see a product on the shelf. What's behind that product is usually thousands of experimental products that a person actually made in a lab at the company, and iterated on. There's a tremendous amount of information there, in figuring out how to use that data so that scientists can predict the properties of new products, before they go into the lab and experiment on it. That kicked the whole thing off.
Describing chemistry as being a process by which you experiment, change a variable and then wait for the chemical reaction, is a simple explanation for what we do in the data science world, at scale. So at a very high level, there's a high amount of parallels there.
It sounds like that's exactly what you've done. What you're doing today, within your team, is trying to keep those real-world experiments to a minimum and work on the ones with the highest probability of success. Does that sound about right to you?
Yeah. I think there's an interesting transition there, to the first topic that we wanted to talk through: data science is a science. Sometimes the term gets used as data science, right? What we forget, and you see this a lot with people who come into data science from the computer science or IT fields, is that data science should be hypothesis-driven research. It's not about data and manipulation. It's about having an hypothesis about what the real world is like, then doing experiments on your data, to confirm or reject that hypothesis.
That's a concept that comes naturally to some people from the sciences. It’s a conceptual framework, that sometimes you see people who are coming over from IT or computer sciences miss. That's why it's a natural fit, I think. You see a lot of people with backgrounds in physics, chemistry, biology, moving over to data science. They have that notion that, really, this is a science and it's about hypothesis-driven research, falsifiable hypothesis-driven research. That’s what science is.
Hypothesis-driven research, where you're using data to validate your hypothesis, having a strong statistical background, makes a lot of sense. What I've always found to be a commonality amongst PhD candidates in the hard sciences, is that they tend to have a very strong stats background. Is that your experience as well?
Yeah. Agreed. Some of the technical details are sometimes missing, but the conceptual framework around why, where and how you would apply statistics, variability and data providence, is baked into people with a background in hard science. One of the things that you get from coming out of the hard sciences, is that you generate data when you're in the hard sciences and then you analyze it.
If you just come into data science from an outside field, data is a received truth. It's like the golden tablets come down and the data is what it is. I think there's sometimes a miss from data scientists, thinking about what is the providence of the data? What did it actually look like at the time people were generating it, whether generating it means running an experiment or clicking a button on a website.
That insight into the providence of the data is critically important to formulating hypotheses about it. People in the background, the scientists, know that because they do experiments. That's also why if you look at the way data science is typically described, you have this Venn diagram of three things: computer science, statistics and domain knowledge.
To me, the domain knowledge part of it is really getting at how well you understand where the data came from and how it was generated. That's really going to inform your analysis strategy.
Right. You might think of domain knowledge, where I also think you truly understand the business problem. If you truly understand the business problem, you also truly understand how the data came to be and what the various fields actually mean in the real world, and so on. So, we’ve talked about folks in the hard sciences, who have PhDs, that tend to have a strong cyclical background. I imagine you also look to hire people who actually have a PhD in stats, and things of that nature.
You also balance out your team with folks who have more of your typical computer science/software engineer type of background. How do you see those two roles and backgrounds evolving in your team?
The best quote that I've seen, regarding statistics, computer science and data science, is that you must have an appreciation of both and a mastery of one. As a hiring manager, it’s always the case that nobody has that perfect Venn diagram. You have people who, by background or disposition, sit more on the stats side and people who sit more on the computer science side.
Sometimes you can see that immediately, when you're interviewing people, or looking at their work. Is this person using R or are they using Python? Are they writing notebooks or leaning more on writing .py files and structured code? Sometimes it comes out directly.
In terms of balancing those two roles, there are really a couple of things that I think about. One is the maturity of the analytics organization itself. If you think about an organization at a very immature state on the analytics continuum, a lot of the work that's going to drive a ton of value is not deep statistical work. It's relatively simple things: arithmetic stuff, or slicing and dicing of data and scaling it.
If your organization is early on the analytics curve, that will drive a ton of value and you probably don't have a lot of people, so you need people who can have a big impact and scale what they're doing. Further along on the continuum, when that low-hanging fruit has been taken, that's when the meteor problems come up, where you can start to support ad hoc projects and you need some people with more of a statistical background to dig in.
The second thing I think about, is the type of problem that you work on. Is data received as truth or is data something that is generated? If you're looking at an organization that has research and development, or within it—pharma or consumer packaged goods—a lot of the data is generated internally.
Oftentimes, you try to build a model or something and it's not all the way there. You have people who will ask you what data they should be generating to make the model better. That’s not a question that data scientists who come from the CS side are used to handling.
Data is this big thing that's out there and you get it and it is what it is. It is a question that statisticians are very well versed at answering. There's a whole subfield of statistics called design of experiments, which is all about how to design well-balanced experiments. If those types of questions apply to what you do—you have people in your organization, running a clinical study or doing a consumer test, who are going to be generating data—then people with a statistics background are really best situated and have more firm training in how to make sure that the data that that gets generated is well balanced.
On my team, we do a decent amount of work in that area because we work with our global technology team and support the clinical function and consumer testing. In those cases, data providence is very important. Scientists can go run the experiments that we ask them to run, to build out a more robust data set. My statistics people are great with doing that.
I should say there's kind of a third and fourth category of people. We talked about people who studied stats and then learned to code. Those are your statistical data scientists. You have people who studied CS and learned statistics or data science. They did a master's degree in bootcamp. Those are your CS data scientists.
There's a third group of people, who I call “passionate converts.” That's the group that I would put myself in. They studied almost anything. Some have an advanced degree in the hard sciences. Some come from engineering, a totally unrelated field. Just by disposition, they know how to code and ask questions. They did a master's degree or sometimes all they did was a bootcamp. A lot of those people are very good if you get them in and provide the right training because they have the passion to ask questions. I give a lot of credit to people who change careers and fields. If you've been in a field for five to 10 years and you leave that field entirely, do a data science bootcamp or get a master's degree to switch fields, you're inherently, to me, a pretty special person. Not a lot of people do that.
I have a lot of love for people in that passionate convert area because they're curious thinkers. They know how to self-learn which, in data science, is 90% just looking stuff up, reading and teaching yourself how to do things. That’s the third bucket. There's a fourth bucket that I don't have a lot of experience with, that's new, which is people who just study data. They have a Bachelor's degree in Data Science.
Because everyone knows data science is super hot now, the sexiest job of the decade or whatever, a lot of universities now have data science undergraduate programs. It's pretty nascent. There are lots of questions there around whether they actually understand a domain or is their domain just slicing and dicing of data? That’s a new area.
In that fourth bucket, you have a Bachelor's in Data Science, you're probably in your early to mid-20s, relatively new to this career. It is interesting. Right? Compare somebody like that, 10 years from now, to the passionate convert who didn't start in any formal data science training, probably had a more well-rounded background and purposefully went into data science with a passion for it. It'll be interesting to compare and contrast a few years from now, to see how they will differ.
My personal take is that the passionate convert is going to be extremely deep on the domain expertise side but might not have the formal stats or programming experience. To your point, so much of that can be learned in some way: on the job or through more formal online training as well. I would put myself in that bucket, for the record. I'm glad to hear that you're in that bucket too.
That was a fascinating conversation. I want to switch gears abruptly and talk about the patent that you filed. Briefly describe what it is and then we can talk about the process. I think what everyone is probably interested in, including myself, is the why.
Initial disclaimer here: I am not an attorney, I have a PhD in Chemistry. I definitely don't speak on behalf of the attorneys at my company, but I have been through this. I have more than 10 granted patents. I've been through the patent process a lot of times, both in the chemistry side and in product development. Some recent patents relate to data science.
I've spent a lot of time thinking about what it means to patent work on the leadership side: working with ideas and thinking about ideas in a way to get ownership over them. Not just ownership, I guess. The other thing to start with here is the elephant in the room when it comes to patenting, as it relates to data science. There is this notion that the patent system itself is antithetical to the ethos of open source data science.
You're a traitor.
Yeah. Right. You're taking these ideas and that's not what data science is. I don't think that that's the case. I'll try to talk you through why I think that that's not the case, by way of background, the thing that you were referencing before. We recently filed a patent on a process to take the massive amount of historical product formulation data that Colgate has.
I'm talking about formulations for hundreds of thousands of products, 1% of which go to market: all the ingredients that go into it and all of the properties about the product; what it felt like when we ran a consumer test; what the clinical efficacy was; the PH; active ingredient stability; flavor spoilage; thickness, grittiness; color stability in things like toothpaste, body wash, or hand soap.
The process takes all of that information and uses it to build an algorithm that can take, just digitally, a list of ingredients and then predict what the properties of that product would be: consumer attributes, clinical attributes and physical properties. But you can't patent an algorithm, so really it's about the process.
We’re taking that and reducing it to practice, where you make a product then measure it and use the outcome to feed back into the algorithm and process. We actually won the Edison Patent Award from the patent council of New Jersey. We won an award in a segment there for the work, because it's really the first patent of its kind in the CPG industry, using models in this way to develop an overall process.
Taking a step back, where did this come from and how did we think about it? When you look at the intellectual property related to the work that data scientists do, there are really three ways that you can go. You have a copyright, so you can copyright your code. Most code that's out there, like on GitHub, is copyrighted.
The important distinction for copyright is that it protects the expression and not the idea itself. The expression would be the text of the code but not the idea behind what it does. NumPy, or SciPy for example, do numerical transformations. The copyrights on those are not copywriting the idea of distance matrix; it's copywriting the actual code that does it. So that's one avenue you can go down.
The other avenue is trade secret. I think, by far, the majority of code that gets written in data science, is a trade secret. Most companies with data scientists working on stuff, are not putting all of their work on GitHub publicly. That’s sort of strike number one against the ethos that data science is meant to be open. How much code do you think, written by data scientists, actually goes into open source? So trade secrets are a route that you can go down.
Open source is another path you can take, which is interesting. There are three requirements for something to be considered open source. It has to be publicly accessible. People have to be able to use it without royalties and derivatives. You have to be able to make derivatives of it. But even open source, the foundation of modern data science, like those open source libraries in Python, are all copyrighted.
They have different licenses associated with them, that say how you can and can't use it. If you take something like NumPy, it’s a copyrighted open source. If you make derivations of NumPy for example, your copyright has to be at least as open as their copyright. It's called a copy lift license. Even the open source community has stuff that is protected by copyright, with rules around how you can use it. You can read about the different types of copyrights.
Many people are not aware of that in data science. If you're a data science leader, you should make yourself generally aware of the packages that your team is using. Go to GitHub and check the copyright licenses to make sure that you can use them in the way that you're using them. You can get yourself into trouble if you don't do that. The big things are usually fine but it's worth looking at.
That brings us to the last thing, which is patenting. What does it mean and when would you patent data science work? Why do patents have a bad name? There are patent trolls and lots of negative connotations around what patents are and what they do. Patents have been around for a long time. The patent laws in our constitution say that Congress can give out patents to people.
What a patent really is, and how this ties into what the ethos of data science is, does not give you ownership for an idea. A patent is an exchange. It's an agreement between you and the government that you will disclose your idea in full detail. You will publicly disclose it and, in exchange, you get limited to exclusivity to use it. Patents usually last 20 years.
A patent is a mechanism for the government to make sure that ideas get put out there in the world. To incentivize people to put ideas out there, they give you limited exclusivity for 20 years over that idea. Then it becomes totally open, public domain, entirely for anyone to use that idea.
Patenting in this space is a little bit complicated because you can't patent math. You cannot file a patent on an algorithm. An algorithm is a fact. There's the notorious thing here in the patent law. It is called 101 which means, is the subject matter patentable? You cannot patent a math equation.
Here's an example: you can't patent a neural network but you can patent a neural network that creates music. You're taking the algorithm and reducing it to practice and making something useful from it. There's a hurdle that you have to cross when you're patenting something, which is the non-obviousness hurdle. It needs to not be obvious, the thing that you're patenting, to someone in the art.
The patent examiner on that one deemed that it was not obvious that you could take an algorithm and make a Mozart, for example. There's a patent out there in the world. You can look it up. Someone patented that. So the consideration for patenting, really, is whether there’s an idea behind your algorithm. Are you reducing it to practice in a practical way? Are you making something from it? That's usually the step that needs to be there.
You need to have the algorithm and produce a tangible output from it. What you're patenting in that case is not the algorithm but the process. It's the process of using the algorithm, to come up with a meaningful outcome. Take my case, for example. We have an algorithm that predicts the properties of products. The process is like producing the product from it.
Because someone could use the algorithm, that's fine. We haven't patented the algorithm itself, but we're patenting the useful application of that to make a product. You need to make sure that's there if you're thinking about patenting something. There has to be some utility or reduction to practice, and some non-obvious part of it.
What it looks like in practice is that you would typically work with a patent attorney and put together a draft of what the idea is. You would try to frame it in such a way that they understand what the process is. What is the idea behind it? From there, it's really the patent attorney's job to make sure that it checks the boxes for what is a patentable invention. Then that gets filed.
The amount of time that it takes is extremely variable. Typically, it takes years from filing to granting. It seems to take more than five years. The US Patent Office will let you fast track patents. You pay extra money and then it gets more immediate consideration. That's another route to go down. In that case, I think it's less than a year in which they guarantee you that it'll be examined. That's sort of what it looks like in practice, to take your work and patent it.
This has been fascinating in terms of understanding how the whole patent process works, as it applies data science, some general rubrics, obviousness which then lends itself to being patentable. The burning question here for me is why? Why would you go about patenting a process, specifically the process that you designed? I imagine it's for competitive defensible reasons.
Most companies and teams would lean towards trade secrets for things. With patents, you're fully disclosing what you're doing. Patents are enforceable only when someone sues to enforce them. It's not like a patent stops people. It doesn't fly in the window and break your keyboard when you're doing something. You have to find out that someone is infringing on it and then decide to sue them for it, then have all the litigation around it.
There are lots of reasons not to patent things. How are you going to enforce it? Would you want to enforce it? Would you want to pay the money that it would take to do it? So why do it? There are a couple of considerations that come in. One is being able to talk about your work at any level of detail. Say you copyright a piece of code. That's great. You can go out, you can talk about the idea behind your code because the idea is not patented, but no one can steal it from you based on your talk. They don't know what you did.
Your code is a trade secret and you have a copyright and you keep it within your company. So great, you go out and talk about it. When the code behind your invention is not special, right, when anyone could do it, when you could go out to a conference and talk about the idea and then anyone else can write their own code and do it, then that's problematic. In that case, the code is not what's special, it's the idea that's special.
That’s reason number one, why you might want a patent. You want yourself or your team to be able to go out into the world and talk at conferences, to investors, shareholders, whoever it might be, about this great idea that you guys have, that you're using to make money for your business, better connect with your consumers or people around the world. That idea itself, you don't want anyone to take, but you want to be able to talk about it.
The converse side to that is purely defensive. Maybe you don't want to patent the idea because you don't want to disclose it, but you're kind of worried that someone else might come along if you keep it as a trade secret. This is a general issue with trade secrets. Right? You could keep a trade secret. Coca-Cola could keep the heart of their Coca-Cola trade secret. If you discover what it is, you can patent it and then you have the patent for that.
You could be doing something for 20 years and keeping it as a trade secret. Someone else patents that and then you're now excluded from doing it. The patent law recently changed in the U.S. on this. It used to be the first person to invent something. You had wars that would go on over patents, where someone would grab their lab notebook and bring it and say, "As you can see, in my lab notebook, I did this experiment in 1982."
Then it's like, "Okay, well actually you don't have the patent. This guy did in 1982." So the US Patent system switched over to what's called “first-to-file,” which means it does not matter anymore who was the first person to do, to make the discovery, or to come up with an idea. The only thing that matters is who was the first person to file for the patent. They did that because of all the complexities that were associated with who was the first person to invent.
You have cases going to court and scribbled stuff on pieces of paper. Did you discover the idea? So now it's first to file, which means you have an idea, you keep it a trade secret. I have the idea. Maybe I will talk to you somewhere, or whatever it is. I file a patent on it. Now it's my idea. I own it. You can't do it anymore. So there's a defensive side to it which is, if you think you could run into that situation, that would be kind of reason number two, to consider patenting, if it seems reasonable to you that this is a really important idea.
The source code behind it is generic, like there's 100 different ways you could do it. If you have someone who is skilled in the field and you don't want to be blocked from using your idea in the future, then you should patent for two reasons. Say you file the patent. The patent gets rejected. That still protects you, because the act of filing the patent puts that idea into the public domain. So no one else can patent that idea anymore.
Right. You're first to file. You're first to file the rejected patent.
Well, yeah, and you don't have the idea, but now it's public domain. It's not an original idea for anyone else. If someone else comes along, they can't patent the idea. Just filing the application puts it in the public domain, protects you from someone else taking it. If the patent gets granted, you have exclusivity. If it doesn't get granted, you're at least not blocked from using the idea. So that's another route you can go down.
There are other ways to put things into the public domain, by writing an article out of it. Putting something into the public domain through a patent application, is a nice route to go down because maybe you will get the patent granted. Then, not only have you prevented anyone else from saying it's their idea, but you also now have exclusivity over it because the patent office thought about the hurdles for a patent.
That's number one and number two reasons to do it. To be able to talk about it and to protect yourself. The last reason, which I think should be the last consideration, is getting exclusivity over it. Even though that's the purpose of the patent system, there are a lot of complications there. How do you know who's doing what, and whether or not your organization or legal team will want to go after a patent?
You don't strike me as the type of guy to go after all your competitors, willy nilly.
We don't do that. It's not generally a good thing to do. It's not something that anyone in the data science field really does. So that's the last consideration which is, I think the opposite of when most people think about patents. That would be number one on the list. Well, why would I patent something? I want to make sure no one else does it. For me, that's number three on the list. Number one, is so that you can talk about it and feel safe talking about it. And number two, is so that no one else says it's their idea, when it was your idea.
Those first two totally resonate with me. When I heard you say, "I want to be able to talk about it openly," I wondered why you couldn’t just have them sign an NDA. I imagine an NDA just means they're not allowed to disclose it to others. It doesn't mean that they can't necessarily turn around and file a patent.
Maybe that would violate the NDA. I don't know. I'm not a lawyer, but the idea that you're worried about your innovation from a process standpoint, that something else might preempt you and file it, thus preventing you from rolling out this innovative process, makes a lot of sense to me for sure.
That is, for me, the major consideration.
Last thing I have on this topic that I'm curious about...are there any lessons learned? If you had to do it all over again, are there things that you might tweak?
On the patenting side?
All sorts of stuff. It took me a long time to realize, after having the team, that this is something we need to think about with everything that we do. Now that it's on my mind, every time that we generate original work, it's a conversation we have: is this an original idea? Does this already exist out there in the world? If it doesn't then we need to have a conversation. That's number one.
Number two is going back to where we started, which is the ethos of data science: talking to the people on your data science team about what this is really about, why we're thinking about patenting things, to not leave a sour taste in anyone's mouth about it. Most people think it’s cool to get a patent and then in the back of their head they’re thinking about all the open source elements that they use.
Now, Facebook has profit and LinkedIn has their own time series. All these big companies have open source projects that they're rolling out into the world. It seems to go against the ethos. Have the conversation with your team over this being about public disclosure. It’s actually about making sure that other people don't say they had your idea and take it away from you so you can't exercise it anymore.
Understand that it's out there and in the arsenal of copywriting, open sourcing and trade secrets. It’s not automatically about reverting to a trade secret. In many cases going the patent, or at least patent application, route is there. Make sure that the team understands what this is about, why it's important for them and for the work that they do.
Yeah. Well, Kli, that was fascinating. I knew very little about the patent process. In my early years in my career, it was a point of pride to have X number of patents. My vision of that person is that they're this sort of mad scientist and innovator. These days it's fairly rare but it's good to understand why and when you should use it and how it is applicable.
In this world of open source, there is a place for the patent process. I think you did a great job of summarizing when and how to think about it. Maybe the data science teams out there should be thinking about this as they do their work.
Well, this has been great.
I really appreciate you taking the time to be here. If people want to get in touch with you and learn directly, can they link up with you on LinkedIn?
Yeah, definitely. Whether you're new to the field, or in the field already and you want to share ideas, the best way to find me is on LinkedIn. Send me a direct message. Would love to talk to anyone listening who's interested.
Great. Thanks again, Kli, and have a great rest of your week!
You too Dave. Thank you.
Listen how you wantUse another app? Just search for Data Science Leaders to subscribe.
About the show
Data Science Leaders is a podcast for data science teams that are pushing the limits of what machine learning models can do at the world’s most impactful companies.
In each episode, host Dave Cole interviews a leader in data science. We’ll discuss how to build and enable data science teams, create scalable processes, collaborate cross-functionality, communicate with business stakeholders, and more.
Our conversations will be full of real stories, breakthrough strategies, and critical insights—all data points to build your own model for enterprise data science success.