Introduction: New Monthly Series!
Welcome to a new monthly series! I’ll summarize highlights from recent industry conferences, new open source projects, interesting research, great examples, amazing people, etc. – all pointed at how to level up your organization’s data science practices.
Key Theme: Machine Learning Models
Themes. Amidst the flurry and noise in the tech industry, I crave trying to discern themes. Technology evolves rapidly and there are almost always underlying drivers hidden in plain view … if one could only spot them early, articulate the nuances involved, and begin to understand their dynamics and likely trajectories. Discussing themes helps us establish a cognitive framework – or, if you will, a heads-up, a launchpad, a HOWTO – about understanding what’s happening overall in the industry.
One key theme I’ve been tracking through conferences, articles, papers, etc., is about machine learning models. There’s such a rich dialog about ML models and their impact on how we do data science, at all levels of the organization. We’re seeing advances in both the tech and the dialog: about ML model reproducibility, about security, about transparency, about ethics and bias, about compliance, about workflow and process mining, about best practices for how to handle models in production. Ultimately we’re seeing the impact of ML models way beyond open source and IT teams: looking at issues such as security, data privacy, bias, ethics, compliance, etc., ML models impact the much larger scope of open society.
Let’s take a look…
Strata Data Conference in London, held this year on May 21–24: check out this rich set of cautionary tales from David Talby (Pacific AI) about "Why machine-learned models crash and burn in production and what to do about it” – see video. David differentiates the process required for machine learning versus process used by, say, mobile web development. To wit, in “traditional” Agile process, expert contributions tend to happen up-front, while in ML it’s the other way around.
For example, when developing a web app, generally the architects and tech leads pour more time into the early stages of a project, while less experienced people become more involved as that project matures. In ML, initial training of a model from a dataset – that’s a “getting started” homework assignment. Once a model is running in production, the iteration required to respond to edge cases, identified bias, problems in feature engineering, etc., that’s where core expertise in ML becomes imperative. In other words, stark reality looms up when you’re handling those nuanced, troublesome issues which make or break a product. Excellent food for thought when discussing best practices for ML models in production.
David’s talk got me thinking. Five minutes before my presentation at Strata UK, I scrambled to combine David’s key points with related notes from Pete Warden, resulting in a somewhat controversial slide – and followed by loads of excellent hallway discussions. Especially the latter.
OSCON in Portland, July 16–19: building on best practices for models in production, my favorite talk of the entire OSCON conference was "Data science in production” by Richa Khandelwal (Nike) – see video, slides. Richa is a software engineering manager at Nike working on data engineering and infrastructure in the Personalization group. I’ve had the honor of meeting with their team multiple times.
Richa presented about how to level-up data engineering teams to “meet in the middle” with data science teams, for better enterprise collaboration. One simple but often overlooked point: require your data engineers to take hands-on data science training. Overall, Richa’s points provide thought leadership and role modeling for highly effective data science in enterprise.
Full disclosure, I’m co-chair for JupyterCon, held on August 21–25 in NYC. Admittedly I’m biased; our biases are important to declare up-front.
There are far too many excellent talks and forums to mention, plus the videos are still coming out, so I’ll cover more of that content later. In short, JupyterCon is one of the better places to gain a wide perspective of data science across industry verticals, government, journalism, academia, scientific research, etc. One frequently repeated comment: “So much diversity, yet so much common ground.”
Speaking of scientific research, we featured a wide array: startling results presented by Michelle Gill (Benevolent AI) in "Data science as a catalyst for scientific discovery” about biomedical work and AI in drug synthesis … LIGO and the discovery of gravitational waves by Will Farr – quite likely short-listed for a Nobel Prize in Physics … and in a different domain, “computational propaganda” analysis where Columbia grad students in journalism leveraged data science to make the Sunday front page of NY Times; see "The reporter’s notebook" by Mark Hansen.
IMO, Michelle Ufford and crew from Netflix stole the show with popular open source projects such as nteract, papermill, vdom, etc. During the Business Summit Roundtable, as moderator I’d tried to throw a curveball question:
Q: There appears to be some tensions between corporate efforts to certify data science professionals vs. the rising practice of citizen data science. How is this unfolding in enterprise organizations? Along with that, what is the role of self-service data analytics for domain experts? What works, what doesn't work?
If the leadership at your organization is not grappling with that question already, they almost certainly will be. It’s a hard problem, where company culture has large impact on the process required for machine learning in production.
Michelle Ufford in particular provided a key insight: in data science, certify the process used by a team, rather than the individual people. IMO, data science is fundamentally a team sport thus the notion of individual certifications – decontextualized from team capabilities – that’s always struck me as awkward. Michelle also gave some of the best advice for audience questions about ethics in data science. She emphasized that while Netflix invests significantly in developing open source tools, guiding the culture is more crucial for effective data science at scale. Be sure to watch her keynote.
I gave a keynote talk about industry trends we’d seen during 2017–2018 related to Jupyter. Notably, data science in enterprise now embraces Jupyter for analytics infrastructure at scale. That’s rolling out in a “leap frog” effect when compared with an earlier generation of, say, Apache Hadoop. To name just a few examples: Capital One, Two Sigma, Bloomberg, DoD, Booz Allen Hamilton, PayPal. Another major theme is reproducibility, how it’s central to both scientific research and open source – plus how much those two disciplines have yet to learn from each other. We see the two converging within data science, where enterprise decision-making depends on reproducibility. In other words, consider how teams need to collaborate across organizations large and small. Oddly enough, it’s within the highly regulated environments such as finance, healthcare, defense/intelligence, etc., where we see rapid evolution of open source. That’s also precisely where the dialog is so crucial about collaboration, data privacy, ethics, security, compliance, etc. Extending from that, my keynote dug into implications of reproducibility w.r.t. governance, looking at guidance from Karl Popper.
Glimpses into the future
With some of those talks and insights in mind, let’s glimpse into the future of AI plus its impact on organizations…
If you haven’t seen the conditional video synthesis work from UC Berkeley (Caroline Chan, Shiry Ginosar, Tinghui Zhou, Alexei Efros) – stop what you’re doing and watch “Everybody Dance Now”. Using generative adversarial networks, they show motion transfer in video with “do as I do” training. Guaranteed double-take.
Meanwhile, reinforcement learning recently moved from pure research to large-scale industry use cases. “Notes from the first Ray meetup” describes how the world’s largest “unicorn” company, Ant Financial in China, has multiple production deployments based on Ray.
Another must-watch news item: check the recent Wired article, “Using Artificial Intelligence to fix Wikipedia’s Gender Problem”, about the Quicksilver AI software from Primer. Quicksilver identifies and corrects gender imbalance in scientists’ biographies. The software exemplifies a newer area of AI which leverages natural language generation and human-in-the-loop practices. Watch for more of that, especially in enterprise use cases.
Also, Ben Lorica and I surveyed organizations worldwide about AI adoption in enterprise, with 8000+ responses. Check out this free mini-book about our analysis. We found surprises, particularly the industry uptake for ethics and privacy in ML workflows, as well as how process differs for data science work among enterprise shops which have more sophisticated ML experience.
Orange Silicon Valley held a one-day conference “Artificial intelligence, Real business 2018” on July 13 in San Francisco, where Danny Lange gave the closing keynote. Danny led machine learning at Uber, Amazon, Microsoft, etc., and in one episode of the O’Reilly Data Show, he describes how his teams kept using the Unity game engine for simulations to train machine learning … so eventually he became head of AI at Unity. Smart move! At the OrangeSV keynote, Danny showed a “Roadmap to Artificial General Intelligence” – often abbreviated as AGI – with biological inspirations and technical directions for each component. Fascinating.
Check out Danny’s talk this Thursday in SF.