On October 6, 2020, data scientist Peter Shen from Janssen Pharmaceuticals, a division of Johnson & Johnson, presented “Multi-GPU Machines for Computer Vision-based Deep Learning Models in Histopathology” at the NVIDIA GPU Technology Conference (GTC) for developers. Peter was joined by Katherine Shakman, Domino Data Lab Field Data Scientist. This post provides highlights from their talk, along with a link to the full session recording.
Data Science at Janssen
Janssen is the pharmaceutical arm of Johnson & Johnson, a multinational healthcare leader. The company uses computational data science research across immunology, compositional chemistry, and biology to develop new drugs, optimize clinical trials, and automate diagnosis techniques.
Working with Domino and NVIDIA, Janssen has accelerated training of deep learning models, in some cases as much as ten times faster, to more quickly and accurately diagnose and characterize cancer cells through whole-slide image analysis. This is a crucial step in its effort to deliver precision medicine. Based on early results, Peter anticipates that once deployed in the clinical setting, one model will deliver a four times increase in the number of patients who can be screened as positive for eligibility in clinical trials.
Artificial intelligence can transform healthcare, giving researchers new insights to discover novel treatments and deliver precision medicine to patients. But doing so requires the ability to analyze enormous data sets. In his talk, Peter dove into the specifics of how Janssen is using deep learning to analyze whole-slide images of biopsy and surgical specimens (called Histopathology images). Each image typically ranges from two gigabytes to five gigabytes, and most clinical trials generate thousands of these images. Large clinical trials, Peter said, can generate up to 100,000 images.
By training deep learning models to distinguish the difference among patients at a cellular level in these images, researchers can better identify patients who are viable for therapeutic targets and clinical trial eligibility or predict a patient’s potential response to a given therapy.
If we’re able to deploy this model into a clinic, we’re able to get a 4X increase in the number of patients that we can screen as positive for eligibility in our clinical trials.
To support this work, Janssen built a unified framework for deep learning and distributed training, using the Domino data science platform to provide data scientists with self-service infrastructure access to diverse tools, languages, data sets, and scalable compute, including NVIDIA GPUs, which are critical for training deep learning models on large data sets. In his discussion, Peter shared how Domino is helping the team more rapidly develop deep learning models, in some cases as much as 10 times faster. (Of course, getting these models to production will require strong partnership among data science, IT and business leaders. Peter joined data science leaders from easyJet and PointRight to discuss challenges and best practices in this area specifically during their webinar “Reaching Across the Aisle.”)
In terms of the unified framework specifically, Peter emphasized four benefits, including the ability to:
- Run multiple experiments with different hyperparameters in parallel. In the past, Peter said that training models with a small data set on a single GPU machine could take up to nine days per epoch. With large data sets, this became even more challenging. And in one case, researchers found training a convolutional neural network model for classifying images was not even feasible due to memory constraints and challenges fully utilizing multi-GPU compute resources.
- Reproduce results so researchers can translate results for one cohort of patients to other patients or other settings.
- Track metrics for each model and store models in a centralized location so they can analyze whether models are working correctly and pinpoint the exact data set being used.
- Easily integrate models into data science workflows (in other words making research reproducible and reusable so they can be used across multiple business uses cases). For Peter this includes reducing the use of ad hoc code bases that just work for one set of experiments and not others, and ensuring that models are explainable and interpretable.
We built a flexible platform that really allows us to iterate through different model training, and do that also in a distributed fashion.
Peter also presented three examples of how this approach is helping accelerate research by:
- Reducing training time of deep learning models by 10X. For example, in one case researchers achieved almost a linear acceleration when they scaled model training to six GPUs versus just one. In another example, they reduced the time to run each epoch for training, from two hours to two minutes, by experimenting with different image formats.
- Optimizing models with the ability to use a higher number of images in both model training and validation.
- Better predicting clinical trial eligibility with a 4X increase in eligible patients, based on results realized during one experiment. “If we’re able to deploy this model into a clinic, we’re able to get a 4X increase in the number of patients that we can screen as positive for eligibility in our clinical trials.” With these results, the team is moving forward with its Proof of Concept into the next phases of development, which includes additional validations and collection of additional data, to create a very robust and generalizable model for therapy eligibility prediction that can be used across use cases.
About the Speakers
In his role at Janssen Pharmaceuticals, Peter Shen helps the research and development of new pharmaceuticals through data-driven decision-making. Prior to Janssen, Peter was a graduate student researcher at Dana-Farber Cancer Institute, served as product manager at both Aimsio and Billion Health, and Bio-informatics Co-op at both the BC Cancer Agency and the Public Health Agency of Canada.
Katherine Shakman empowers and supports data science teams across a variety of industries. Katie’s background is in health data science and neuroscience, and she believes computational tools will transform the way we interact with our world and each other, and particularly the healthcare and life sciences landscape. She is working to help make that transformation benefit society. In her doctoral research Katie utilized neural imaging and behavioral analysis to study interactions between neural circuits regulating attention and memory in insects. She employs her skills in experimental design, problem solving, project management, analytics, machine learning, data visualization and technical communication to impact the future of technology.
To learn more
Watch the Webinar“Multi-GPU Machines for Computer Vision-based Deep Learning Models in Histopathology” to learn more about the key technical challenges Janssen faced and how they addressed them.