Data science represents the next era of analytics driving the enterprise. Enterprises that capitalize on its potential will outcompete their rivals, increase efficiency, and generate new revenue streams. Today’s IT teams are challenged to centralize data science infrastructure in a way that will increase governance without constraining data scientists’ freedom and flexibility.
Failure to act will result in a “wild west” of siloed, inconsistent technologies sprinkled across the enterprise, operating beyond IT’s purview and hindering the business’s opportunity to drive value from its data science investment.
Successful CIOs and IT leaders move data science from the business’s periphery to its core with structure and discipline that provide unbridled access to the latest technologies, visibility and auditability, and close alignment with the business.
Implementing the right platform will deliver a win-win-win: IT achieves better governance while enabling innovation that unlocks new business value. Data scientists gain self-service and agility. The business earns a bigger return from its investment in data science.
In this field guide:
Data science at large blends statistics with computer science to find patterns in big data and use those patterns to predict outcomes or to recommend actions or decisions.
Data science represents the next frontier for the data-driven business, which has been evolving for decades:
This evolution of data management and analytics paved the way for data science, a term popularized around 2010, sometimes also called “quantitative research” or “decision science.” Data science encompasses machine learning (ML), the computational process of making predictions based on data inputs and continually improving those predictions as data changes. ML is just one type of weapon in the broad arsenal of data science.
For decades, organizations have aspired to become data-driven. It took years to develop technologies that make it possible to efficiently capture, store and manage data from the systems that are instrumenting today’s world. Now that the data is available, it can benefit every person and every department across the enterprise, which is driving fast and furious adoption of analytics and data science.
ata science is widely recognized as a discipline that should become a core organizational capability, with the potential to drive new revenue streams, automate decisions, improve products and enhance customer experiences to increase a firm’s competitive advantage. This potential is driving significant investment from executives.
IT organizations have an opportunity to help companies realize the full potential of this investment by providing the infrastructure that helps make data science a core organizational capability, rather than a collection of siloed people and tools.
“Based on the simple fact that there’s just a huge amount more data than ever before, our greatest challenge is making sense of that data,” Salesforce.com CEO Marc Benioff said in a 2015 interview with Fortune. “And we need a new generation of tools to be able to organize and view the data. We need a new generation of executives who understand how to manage and lead through data. And we also need a new generation of employees who are able to help us organize and structure our businesses around that data… We need more data science.”
Previous generations of data technologies have involved centralized, monolithic components: a BI server, a database server, a data lake platform, for example. Data science work, in contrast, involves dozens of smaller tools and technologies, many of which are designed to be used locally on data scientists’ workstations.
According to a 2017 study by KDnuggets, the most popular languages for data science are Python and R, both of which rely on IDEs and development tools that run on end users’ machines. On top of that, these languages have rich ecosystems of “packages,” which provide supplemental functionality for more specialized purposes. Many of these packages and tools are open source and available for download online, and data scientists regularly download dozens or hundreds of packages to use in their day-to-day work.
In the last several years, the open source ecosystem around these tools and packages has flourished, driving rapid innovation, frequent updates, and availability of entirely new packages every month.
In other words, modern data science work lives across dozens or hundreds of clients, not in a centralized server.
Data scientists, eager to stay on the cutting edge and utilize the latest techniques, experiment liberally with a variety of tools and packages. That pace of experimentation is increasing as the open source ecosystem innovates more rapidly. The combination of client-based work, a large number of easily accessible technologies, and a desire for rapid experimentation has created a “wild west” of data science tooling in most organizations. Inconsistent technologies are spread across disparate parts of the organization without governance or transparency around any of them.
Worse, in many organizations, “shadow IT” is cropping up to support these systems. For example, a small team might install RStudio or Jupyter (both free downloads) on a shared server to use for their group, without considering support requirements or consistency with other parts of the organization.
Beyond the obvious problems, this “wild west” of siloed data science work creates several other issues:
Data scientists will err on the side of innovation, driven by a desire to use the latest technology and largest machines to develop better models faster than competitors. They are unlikely to perceive the medium- and long-term consequences of a lack of standardization and governance. Like water flowing around rocks in a river, they will find the path of least resistance: if IT isn’t offering them what they need, they will find workarounds, install tools locally, and unintentionally put the organization at risk over the long run.
It’s natural, but overly simplistic, to view the situation as a trade off between innovation and safety/security. That framing binds the CIO or IT leader between stifling business progress and competitiveness, or endorsing chaos and risk. But this framing is a false dichotomy and misses an opportunity to align the goals and incentives of stakeholders across the business.
Within the challenges above lies a tremendous opportunity to bring order to chaos while enabling a critical business transformation. It’s a pivotal point in many organizations’ journey toward becoming truly data driven, and if built correctly, an effective data science function will transform every business.
A data science platform is where all data science work takes place. It acts as the system of record for predictive models. If databases and data lakes were the central architectural components of incumbent generations, the foundational technology for the data science era is the data science platform.
Unlike a database, a data science platform doesn’t house your data—instead, it houses the artifacts and work product associated with data science workflows. Just as sales organizations use a CRM to create maturity and scalability, and engineering organizations use version control, enterprises are deploying data science platforms to create more maturity and discipline around data science work.
Where a data science platform fits in your infrastructure.
Data science platforms allow IT organizations to rein in the wild west of data science tools, assets and infrastructure spread across the organization. Instead of working in disparate local environments, data scientists do their work in one central place. In order to support the range of use cases involved in data science work, an effective data science platform will provide:
Moving data science work onto a centralized platform will ensure that:
Capabilities provided by a modern data science platform.
Implementing a data science platform to centralize data science work will reduce risk and support burden for IT organizations. But getting buy-in from other parts of the organization—especially data scientists who are likely to balk at talk of “governance”—will be critical. A key part of the CIO’s and IT leader’s challenge is delivering effective, tailored communications to different stakeholders; rallying the troops to align behind a shared goal for successful data science. Doing so requires empathy to understand the unique motivations and perspectives of different constituents. Fortunately, there are a wide variety of benefits that can be communicated to align interests.
To data scientists, whose priority is to innovate as quickly as possible by taking advantage of the best and newest tools in a self-service environment:
To executives, whose priority is to derive ROI from investments in data science by quickly integrating insights to improve business processes:
To the rest of the IT organization, whose priority is to control infrastructure costs and maintain a single, integrated environment:
By successfully navigating each internal stakeholder’s concerns and deploying a data science platform, everyone wins: IT management successfully mitigates risk through governance and centralization, while delivering productivity gains for data scientists. Establishing a data science platform leaves IT poised for success, and the business is equipped to drive faster innovation.
The “build vs. buy” decision can be a difficult one. Companies that set out to build their own typically do so for two reasons:
Before heading down this path, consider several costs associated with a homegrown solution:
You haven’t built the CRM system that your sales team uses, or the version control system that your engineers use—a data science platform is no different.
As organizations increasingly strive to become model-driven, they recognize the necessity of a data science platform. According to a recent survey report, 86% of model-driven companies differentiate themselves by using a data science platform. And yet the question of whether to build or buy still remains.
For most organizations, purchasing a data science platform is the right choice from both a business strategy and project cost efficiency perspective. However, many organizations confuse the criticality of models to their long-term success with the need to build the underlying platform themselves. In a few select situations, the platform itself is the differentiator.
These organizations have highly specialized workflows (eg, Uber), a stellar track record of internal software development (eg, Airbnb), and deep data science expertise that recognizes the unique traits of models (eg, Google).
For the vast majority of organizations, the competitive differentiator is not the platform, but the entire organizational capability — what we call Model Management — encompassing many different technologies, stakeholders, and business processes. Buying the platform is the logical choice for most.
You’re probably thinking, “Of course Domino, the data science platform vendor, believes everyone should buy a data science platform.” We do have our opinion on the topic, but this opinion stems from thousands of interactions with organizations of all shapes and sizes around the world. Most that have opted to build their own platform have stalled or failed. Those who have purchased a platform are operationalizing data science at scale.
These interactions and experiences working with organizations trying to decide whether they should build or buy led us to develop an objective framework to facilitate the decision process. It includes three major factors:
The scope of building, managing and operating a data science platform needs to be carefully examined. Many organizations underestimate the total cost of ownership in the build approach.
In a four-year scenario where an organization builds a data science platform supporting 30 data scientists at first (and growing at 20% annual rate in subsequent years), we estimated the TCO of building to be over $30 million while the TCO of buying is only a fraction of that. See Figure 1 below for a yearly side-by-side comparison of the TCOs of the two approaches.
By devoting resources to building a data science platform, an organization is inevitably choosing to divest from other projects. This choice can be unwise especially if the organization sacrifices its core competency, which will eventually hurt the organization’s revenue.
Data science is not an easy endeavor to take on, and it is wise to de-risk as much as possible. Risk factors such as talent acquisition and retention, skill requirement changes, and platform feature requirement changes need to be considered carefully before deciding to build. On the flip side, an organization should also be very careful with choosing which vendor to purchase from if they so decide.
Ultimately, organizations need to decide where their differentiation lies with data science: in the models they build and overall organizational capability, or in the underlying infrastructure? For most, it is the former, so a “buy” approach likely offers the lowest TCO and most aligned strategic choice.
The data science platform landscape can be overwhelming. There are dozens of products describing themselves using similar language despite addressing different problems for different types of users.
The three segments that have crystallized are:
The table below summarizes these segments:
Types and use cases of data science platforms.
These solutions help data analysts build models by automating tasks in data science, including training models, selecting algorithms, and creating features. These solutions are targeted primarily at non-expert data scientists or data scientists interested in shortcutting tedious steps in their process to build quick baseline models.
These “automated machine learning” solutions help spread data science work by getting non-expert data scientists in to the model building process, offering drag-and-drop interfaces. They often include functionality for deploying the models that have been automatically trained, and they are increasingly integrating interpretability and explainability features for those models, as well. They work best when the data is cleanly prepped and consistently structured.
These tools support a breadth of use cases including data science, data engineering, and model operations. They provide both drag-and-drop and code interfaces and have strong footholds in a number of enterprises, and may even offer unique capabilities or algorithms for specific micro-verticals.
While these solutions offer great breadth of functionality, users must leverage the proprietary user interfaces or programming languages to express their logic.
This group of solutions targets code-first data scientists who use statistical programming languages and spend their days in computational notebooks (eg, Jupyter) or IDEs (eg, RStudio), leveraging a mix of open-source and commercial packages and tools to develop sophisticated models. These data scientists require the flexibility to use a constantly-evolving software and hardware stack to optimize each step of their model lifecycle.
These code-first data science platforms orchestrate the necessary infrastructure to accelerate power users’ workflows and create a system of record for organizations with hundreds or thousands of models.
Enterprises with teams of data scientists select these solutions to enable accelerated experimentation for individuals while simultaneously driving collaboration and governance for the organization. Key features include scalable compute, environment management, auditability, knowledge management, and reproducibility.
Data science is unlike other technical disciplines, and models are not like software or data. Therefore, a data science platform requires a different type of technology platform.
Below are the top 10 IT Leaders should ask of data science platforms to ensure the platform handles the uniqueness of data science work.
An ideal data science platform should work with existing infrastructure. It provides the flexibility to be hosted in the Cloud (e.g. a VPC—a vendor-managed private cloud), on-premise, or perhaps hybrid. Either way, the platform should be based on a single code-base, regardless of where it is hosted. If business requirements call for changes in infrastructure, the ideal platform provides the flexibility to adapt to those changes.
Data science requires free-from experimentation and access to the latest revolutions in open-source tooling to achieve breakthroughs. However, enterprises need to provide guardrails on experimentation and tools to guard against breaches and protect company IP. So, a data science platform must support various native data science tools (JupyterLab, RStudio, SAS, etc.) through an open and flexible approach, while providing IT teams the capabilities to govern the data science environments and provision pre-approved environments.
This approach will remove the data science shadow IT challenge and ensures IT infrastructure is not exposed to unnecessary risks.
Data scientists’ work requires somewhat unpredictable access to different sizes of hardware, including GPUs, when doing intense work like deep learning. Reserving large hardware instances that sit idle is too expensive, so a data science platform should provide elastic access to different types of machines and software packages. These environments should be available with a single-click, removing DevOps tasks from data scientists’ daily work.
IT teams should be able to control which users have access to which environments, and also have complete visibility into the costs, time, and usage of each of these environments. Ultimately, the platform should provide ability for parallel execution (running multiple experiments in parallel) in resource provisioning.
An ideal platform for data science should work with existing user security practices such as Single Sign-On (SSO). However, in data science, providing authorization and authentication security isn’t enough. Data science is different, and a complete platform also provides an audit trail of all data science work (code, data, packages, environments, comments) for an individual user that ensures reproducibility and auditability of the users’ work.
Along with this visibility and auditability, IT should have access to a flexible permission model to govern access to models, projects, data, experiments, hardware, and software packages that scales to support growth to hundreds of users.
Keeping a comprehensive and thorough system of record in the data science lifecycle can significantly reduce regulatory and operational risks. An ideal data science platform preserves the entire lifecycle of a model for a system of record. All revisions of a project should be tracked to enable easy retrieval of any experiment for audits, risk governance, and compliance checks. For example, a model developed to predict insurance policy holder risk may need to be audited and adjusted based on new personal privacy laws.
A full model provenance log would enable one to trace back every step of model creation, understand how specific sensitive personal data impacts the model, and how that sensitive data was used in development of the model. Additionally, a data scientist could start from any point in that model creation process to fork off and develop an updated model without starting from scratch, accelerating new model development while reducing compliance risk.
Data science is different than software development; models require re-training, are developed in an experimental fashion, and are made using lots of different software tools. There is no need to “retrain” software code, but production models do need to be retrained frequently. A data science platform provides a single and comprehensive system-of-record (SOR) for models, which is much more than keeping track of code versions and issues.
Data science assets include code, data, discussion threads, hardware tiers, software package versions, parameters, results, and more. Git and JIRA are not built for an experimental process. Furthermore, data scientists will reject GIt/Jira/Jenkins built systems since they hinder their work instead of accelerating it.
A data science platform accelerates model development and deployment, with access to elastic compute, automatic experiment tracking, full reproducibility, model-based collaboration, streamlined model-deployment, and a knowledge base of building blocks to enable rapid model development.
A data science platform needs to provide simple, fast, and secure access to ALL types of data including Hadoop, Spark, flat files, and databases. These connections must be encrypted in transit, be able to handle failover, and set up to transfer large amounts of data for model training and experimentation.
Data science also involves lots of data manipulation and creation of new “features,” which are created based on other data. Since the data and features often change in each experiment, the snapshot of that data needs to be captured and revisioned so that the model and data is auditable, reproducible, and meets compliance requirements in regulated industries.
Model operations involves deploying models to production and the process of monitoring, re-training, and updating them in production. Model deployment is the process of enabling a model to be used in production (for example, deploying the model as a simple visual (chart, graph), an interactive application, or as an API) so the model can be used for interactive human consumption or machine-based consumptions.
An ideal data science platform should allow data scientists to self-serve and directly deploy models in the various different modes, with IT approval and oversight. Once the model is deployed, the platform should monitor model performance, provide ability to retrain, and revision that model in production, capturing full model provenance for audit records.
Lastly, the platform should ensure that end-users have a direct feedback path, from the model to the data scientists, to ensure rapid iteration on the model.
A data science platform should provide an elastic and flexible compute infrastructure to meet the dynamic resource requirements of data science projects. Poor resource provisioning can lead to unexpectedly high hardware-usage bills or unrealistic requests for additional hardware.
The platform should also provide visibility and controls to ensure compute resources are properly allocated and consumed by the correct users on data science teams. Visibility and controls of hardware are important, but the platform should also expose the usage of different software tools by users, for specific projects too. This level of detail helps IT leaders plan for future projects and adjust spend and tooling to be commensurate with the projects that drive the most value. It also enables IT leaders to have collaborative discussions with data science leaders on project ROI.
Even though data science platforms are built to enable their unique model development lifecycle, they should integrate with current software development processes. The platform should provide a workflow to enable a Dev-Test-Production schedule for the unique aspects of model development. This workflow should ensure the process captures all model assets, including code, data, comments, tools, packages, and even the development environments. Capturing all model asset information ensures that one can revert to previous model versions and promote to the latest model version in a seamless and auditable manner.
If you choose to design and build your own data science platform, the video below shares recommendations and lessons on designing them to be sustainable and scalable.
This field guide covered the types of data science platforms and the challenges they solve, and choosing between building and buying. CIOs and IT leaders who decided to implement a data science platform can read the complete guide to making the build vs buy decision.