When it comes to data, we assume that bigger is better. In fact, the age of Big Data brings a new array of challenges pertaining to data science modeling that today’s practitioners must tackle head-on, according to statistician Nate Silver, who founded FiveThirtyEight.com and is known for his analysis of political polls.
“We have better information and more data, we’re getting smarter all the time, and yet the world still remains often quite unpredictable to us,” Silver said in a keynote at Rev, a two-day summit that brought together 300 data science leaders in San Francisco in May.
The volume of information at our fingertips has grown exponentially, with 90 percent of the world’s data created in the last two years, according to IBM. Silver identified three main problems that stem from this data explosion:
- People are more likely to interpret the same evidence in different ways. Silver pointed to the example of the 2016 presidential election, when the same data produced models that led to wildly divergent forecasts.
- The signal-to-noise ratio is higher, increasing the risk of finding false positives. A massive volume of data points means you can test a huge array of relationships, but that doesn’t mean apparent correlations are actually reliable in driving predictions or describing systems (this is one reason behind the replication crisis in academia). For example, the Federal Reserve tracks 509,000 economic statistics in real time, producing more than 258 billion potential two-way relationships. Only a small share are likely to be meaningful. “Finding those proverbial needles in the haystack is an important part of data science today,” said Silver.
- It’s harder to distinguish between a feature and a bug. In a competitive landscape, data scientists are constantly trying to discover what competitors have missed or where they erred. When they land on something, it’s difficult to know whether the finding is significant or a red herring, and whether to trust the algorithm or experience and intuition. “In those edge cases…we do need human capital,” Silver said. “It’s not all a matter of automating everything.”
With awareness of these emerging problems growing, Silver had advice for data scientists trying to navigate the potential pitfalls of wrangling Big Data as they work to build predictive models. His insights drew on Bayes’ theorem, a widely used statistical theory that helps determine how new information changes the probability that an event will occur:
- Think in terms of probabilities. Start by explicitly defining the chance that something will happen (known as a prior probability), and update this prediction as you gather new information. Make sure to communicate uncertainty—the margin of error can be as important as the forecast.
- Start with a prior probability that stems from a strong culture. Coming up with a prior belief will inherently be subjective and based on experience. But you can avoid the problems of groupthink and build a stronger hypothesis if your prior probability derives from teams that are diverse, in terms of both demographics and skill set; free to think and speak independently; and decentralized, allowing a view of conditions on the ground.
- Engage in trial and error. The way to improve your models is to test, learn from your experience and iterate over time. Initially, the learning curve is steep, and you improve quickly. But “when you are in a competitive environment…you gain your competitive advantage through relatively marginal and incremental gains,” Silver said. One of the reasons that weather forecasters have gotten better at making predictions than many other fields is that they get daily practice and continually feed input back into their algorithms. But trial and error has its limits: Take more time to perfect your model when erroneous approximations can lead to a crisis.
With the right process in place, data science leaders have an opportunity to pave the way forward in the era of Big Data. When it comes to data science, “we’re not in the dark ages,” Silver said. “We’re in an age of discovery.”