Domino manages the time-consuming DevOps chores that are typically required to provide access to GPU resources. The unmatched power of NVIDIA DGX hardware is easily accessible via Domino so data scientists can focus on critical work, and IT teams eliminate infrastructure configuration and debugging tasks.
“Before we had Domino, it took an intern eight weeks to just get access to a GPU. Now it's just a push button and it's those savings added up across thousands of employees at scale that really make the difference.”
Senior Manager of AI and AiMLabs, Lockheed Martin
Time is money, and that’s critically important when members of your data science team are forced for wait on experiments involving huge datasets and computationally intensive frameworks.
The combination of Domino and the NVIDIA DGX platform gives data scientists immediate access to parallel computing and massive floating point processing. With this amount of power just a few clicks away, important research such as deep learning can be completed in a fraction of the time.
Many deep learning and AI training jobs require more than a single GPU or single multi-GPU machine. But setting up a multi-node cluster is so hard that many teams decide to leave these resources in place, even if that means they’ll often sit idle when there are no complex experimentation to be run.
Dedicated resources and low utilization are eliminated with Domino. The platform automatically creates and manages multi-node clusters, and releases clusters when training is done. Domino supports ephemeral clusters using Spark and Ray, and will be adding support for Dask later in 2021.
Domino gives IT visibility into who is accessing GPU hardware and how it’s being used. Permissions can be set to ensure that employees without proper entitlements are not burning through valuable resources, and power users have full access to maximize the use of the hardware. Usage information and tracking can inform a centralized IT team to easily allocate resources and chargebacks, while also measuring the business value that the GPU-enabled model is generating.
Domino administrators can easily divide a single DGX system into several different hardware tiers to support a variety of different users and their use cases. For example, 1 and 2-GPU partitions can be allocated for basic research, while 4 and 8-GPU partitions are set aside for training workloads. By providing different compute options, more data scientists can use the system at the same time, and companies get the maximum benefit from their GPU investment.
With NVIDIA Multi-Instance GPU (MIG) technology on the NVIDIA A100 Tensor Core GPU, admins can take this even further, allowing up to 56 concurrent notebooks or hosted models, each with an independent GPU instance.
When Lockheed Martin wanted to centralize access to data science tooling, streamline collaboration and knowledge sharing, and automate DevOps tasks, they turned to Domino. Today, they attribute $20 million in annual cost savings to their use of Domino, which includes a 10x increase in data scientist productivity thanks to self-serve access to resources that include NVIDIA GPUs.
“Domino makes it easy for our data scientists to rapidly access NVIDIA GPUs so we can support complex use cases like training deep neural networks.”
Lead Data Scientist, Lockheed Martin
Domino’s growing partner ecosystem helps our customers accelerate the development and delivery of models with key capabilities of infrastructure automation, seamless collaboration, and automated reproducibility. This greatly increases the productivity of data scientists and removes bottlenecks in the data science lifecycle.