We recently caught up with Eduardo Arino de la Rubia - a principal developer for the Ingram Content Group, holding two roles in the company. In the first of these two roles, Eduardo works with Lightning Source, a print-on-demand provider. Manufacturing books on demand has all sorts of interesting data science applications, from attempting to predict downtime and using machine vision and image processing for quality control, all the way through natural language processing to understanding the content that passes through their facilities. Eduardo also works with Ingram’s Chief Venture Capital officer to fund interesting startups. He is a big proponent of using data science in the due diligence process for early to mid-stage venture investing.
How did you get started in data science, and what do you think made you successful throughout your career?
I had a great deal of interest in HPC in the 90s, particularly parallel processing. I fell in love with the toolkits that were available, PVM and MPI. I was very fortunate that the university I attended had a SIMD supercomputer available for students and encouraged interested students to explore and learn this way. It was my first introduction to the approaches currently taken by GPUs, for example. One of my professors also introduced me to Genetic Algorithms and soft computing, and I was quickly mesmerized by these bioinspired approaches to optimization and problem solving. To be honest, I dropped off the data science and machine learning radar for a few years after that. The dotcom boom and the “web as a platform” really interested me, and therefore took up a great deal of my time. In the last ten years, though, I’ve spent an increasing amount of time learning image processing which is surprisingly numerical in nature. That has also been a natural conduit back to more recent advancements in machine learning as I have integrated them into my day to day work. I also think that one of my ‘secrets’ to success in computing has been how I actively recast myself as a student. For example, I signed up for the General Assembly Data Science class when it was available, and I signed up for the Coursera/Johns Hopkins Data Science specialization when it was available, etc… Considering yourself a student allows you to not become too rigid within a single orthodoxy and encourages you to expect new and interesting approaches that will surprise you. Always be willing to accept and enjoy surprise, even if the surprise is that your preconceptions were wrong, and be thankful for the opportunity to both learn and be surprised.
I’ve also had a great time being involved in the fantastic meetup communities around data science topics, and I’m very fortunate that in Los Angeles we have an active and welcoming community. One of my collaborators, Szilard Pafka started a Data Science meetup nearly 5 years ago - an R meetup in name, but the topics were clearly DS - and this has fostered a very welcoming and collegial environment over time. We launched DataScience.LA earlier this year to help keep the community learning and growing, to help people who are interested in the field learn more about it, and to attract interesting speakers. It’s important to be willing to volunteer and give your time, speak at meetups, and encourage others, as that’s when you learn the most and forge the strongest relationships.
What tools do you use in your analysis? What new tools or techniques are you most excited about?
I am first and foremost a programmer, and I have been for a very long time. This should translate into an advantage, but honestly I believe it was one of the greater disadvantages during my adoption of modern data science tools. When I was first introduced to the R ecosystem, I judged it as a programmer. While R is indeed a programming language and can be judged as one, to do so is to completely miss the point of the language. R (and RStudio) are an interactive environment, and an interface into some of the world’s most advanced analytics algorithms.
Needless to say, I am now a complete convert and an active R user. To be very specific, I love the RStudio IDE and IPython notebooks. They are both quite powerful and each can learn from the other. I mostly use RStudio, and when I work in R, I am a proponent of the Hadleyverse, a series of APIs built by Hadley Wickham which modernize analysis and visualization.
The “tool and technique” I’m most excited about is the growth of tools for reproducibility. The work being done by Domino Data Labs is quite cutting-edge in this space, and their tools are a fantastic addition to anyone’s tool belt. I am also a proponent of Yihui Xie’s knitR package, as it’s a fantastic tool for building markdown documents which encapsulate the results of an analysis in a reproducible, easy to communicate fashion.
Is there an exciting application of data science you can share with us, from either a business perspective or a personal perspective?
I think that education is going to be the single biggest industry to be revolutionized with data science. There are so many facets of education which can be helped with data and proper application of predictive analytics. Conversely, data science in the nonprofit space is growing leaps and bounds and can help us address some of the largest societal problems we’re currently facing.
I know you were at Strata a few weeks ago, so let’s talk a bit about that...
What trends did you notice at Strata+Hadoop World?
Strata was both very exciting and very disappointing. I understand that the purpose of a conference such as Strata is to build hype about the future of an ecosystem, but I certainly felt like this was accomplished at the expense of current capabilities. For example, the market is too fragmented and there are too many players offering partial solutions. There is also the problem of overpromising and under-delivering, which I experienced first-hand at Strata. I understand that Big Data is exciting and sexy, but to pretend these tools are ready for production and available as a turnkey solution is disingenuous and distracting from the issues still at hand. The things we’ve been doing well with big data have gotten better, batch processing, ETL, etc… Those things are certainly solid. The storyline around “interactive real time access” to the data, the promise of opening up an organization's entire data universe to the BI analysts, is still a bit of a pipe dream.
The biggest story is, without a doubt, Spark. There is no doubt in my mind that Spark is going to consume the entire big data world, and for those organizations that genuinely need the stream processing and in-memory computing capabilities of Spark, that future can’t arrive fast enough.
We hear a lot of companies talking about Hadoop... How much is hype and how much is real, from your perspective as an industry practitioner?
It’s 90% hype, unfortunately. Let me make it clear, if your company needs Hadoop, then it’s the single most central component to your way of doing business. However a large number of organizations want to include the Hadoop ecosystem as a part of their toolset - and very few of them need it. If you need something like Hadoop, which organizations with giant data genuinely do, then you’re already using it and already aware of the limitations and complexity. If you can solve your problems with a commercial MPP or column store, don’t overburden your future self with a complex ecosystem. Make sure you really have pushed the limits of conventional approaches first and are not merely jumping on the latest bandwagon.
Did you find any new innovative companies and/or solutions? Why do you think these new players are important to the future of analytics?
I didn’t “find them” at Strata, but it was really great to see them. I am a huge fan of h2o and the work they are doing. They really are building cutting edge machine learning algorithms which scale, and providing clean and intuitive interfaces. I think the sky's the limit for them.
How are organizations supposed to know which technologies they are going to need to invest in when there is a seemingly endless number of choices and combinations?
This is probably the least popular answer I can give, but the fastest computer is the one you can buy tomorrow. Pure technology is seldom where an organization can get the most value for the dollar. Unless your current tooling has genuinely been exhausted and you’re hitting real limitations, take a good long look at your approaches and your processes.
Are you following an analytical methodology which will yield results, or vanity metrics?
Don’t start with the data, start with genuine impactful questions a human would ask. Buy technologies that will make your current staff as productive as they possibly can be, make sure they have what they actually need to do their jobs, and let them prototype and play. My employer is very good about this - if we want to test out a new series of technologies we can very quickly can build up a sandbox infrastructure and test claims in real-world situations. If you let people sandbox and give them the ability to be innovative, they’ll come back to you with solid approaches founded on sound principles, instead of going off of vendor promises.
What are your predictions for at Strata in San Jose next year?
I hope to see maturing ecosystem with less talk about the glorious future and more discussion of current successes. I think the traditional names in analytics and BI are struggling to find a foothold, and the upcoming conference in San Jose will provide transparency separate those who can show that they “get it” and those who will be left behind. I think that the Spark platform will have stabilized by then, so expect to see a high number of vendors talking about how their platform is now Spark-enabled. Strata is no longer a little data science conference; it is very much a mainstream conference with massive industry players, and it’s going to be interesting to watch it adapt.
What’s your advice to companies trying to bring new analytics and big data capabilities into their organizations?
Unless you are 100% certain that you have big data and need big data capabilities, do not take the step to a big data toolchain lightly. There is a great deal of technical debt that comes with any big data implementation, and you only want to start paying on that interest when it’s absolutely necessary.
As for analytics, companies need to have the right questions. Almost every company wants a magic black box that will give answers capable of driving decisions and guiding the business in general, but in order to do that, the organization’s data has to be rich enough and the leadership has to be willing to change when presented with evidence. A square peg will not fit in a round hole no matter how hard you push, so make sure your organization is genuinely prepared to challenge fundamental assumptions about its relationship with customers, suppliers, and its impact on the world around it.
What advice do you have to someone entering the field, trying to establish a career in data science?
I would first ask, “Why do you want to be a data scientist?” The fact of the matter is that, as sexy as the field is right now, data science is an interdisciplinary field that requires a great deal of elbow grease and does not have a clear path marked as you head towards the title of data scientist.
Your data won’t be clean, most of your analyses are going to be wrong and inconclusive, and when you do find something interesting, you are going to have to present your findings to potentially incredulous decision makers. Are you someone who genuinely enjoys constantly learning? I don’t mean, “are you willing to be taught,” but are you someone who constantly seeks out several sources of new information? That’s the first part. Are you mathematically minded? You don’t have to have a PhD in mathematics, but without a proper foundation in statistics and linear algebra, life as a data scientist is going to be tough. Are you fastidious? Are you capable of looking at weird error messages about misplaced punctuation and finding some perverse joy in it? Are you skeptical? When someone hands you a data file which they claim represents X, do you take them at their word, or do you question whether the data was correctly captured, logged, and handed off to you?
Do you find pleasure in explaining things which you find elementary to an audience that finds them conceptually impenetrable? You can have every skill listed above, but if you’re not capable of communicating calmly, kindly, and with a good natured approach, this will be a hard field for you to succeed within.
I don’t mean to sound too negative, but this is not a trivial field or an endeavor to be taken lightly. This is a fantastically complex, wonderfully open, spectacularly interdisciplinary and rewarding field. If you pick it, and succeed, you will be doing things which are (to the lay person) indistinguishable from magic. You will be predicting the future, you will be detecting lies, you will tell someone what flavor ice-cream they would love even before they have had the opportunity to ponder the question. The groundwork has been completed, the foundation has been laid, and the data’s secrets are waiting to be unlocked.
We are extremely grateful to Eduardo for his time. You can follow him @earino