Visit Domino News for press releases and mentions.
Visit the Data Science Blog to learn about data science trends, tools, and best practices.
By David Bloch, Data Science Evangelist, Domino on May 01, 2020 in Perspective
Image source: https://www.r-project.org/
The latest update of R dropped on the 24th of April, taking the statistical computing language to Version 4.0. The update is packed with new features, bug fixes and changes to the underlying behavior and syntax of existing functions that help optimize implementation and reusability of code.
Due to changes in core components in R 4.0, R packages will need to be re-installed to ensure that they function correctly due to underlying changes in the way core libraries function. This update can create model reproducibility headaches, and brings the age-old challenge of ensuring consistent versions across teams of data scientists, enabling them to co-develop models and ensure that models developed on their machines are able to be run by other data scientists and deployed into production.
A full list of the changes made in R 4.0 can be found at the Comprehensive R Archive Network.
Environment Management is a key component of Domino’s platform. It enables teams to configure and keep multiple analytics environments that can easily be applied to their projects and workbenches in a safe, seamless manner, with minimal to no IT interaction. It doesn’t require developers to configure their own local level machines or setup new virtual environments to test updates or new packages, and can easily be rolled out to other team members once deployed.
This functionality is often used to produce “gold standard” production environments where all packages and software versions have been tested and assessed for being worthy of production, alongside research and development environments where data scientists can test the latest and greatest packages and software updates.
Like all functions within Domino, environment management is fully version-controlled meaning that changes can be tracked over time, and full environments can be restored in the case that there is a breaking change in an underlying package or software component. This is critical for major upgrades like R 4.0 that could otherwise be disruptive, causing your models to break. Having a platform that can recall and use the exact environments needed to reproduce your model, even years later, is key.
So how does Domino help?
- Compute Environments can be shared across users. Once an admin creates a new R 4.0 environment, they can make it immediately available to everyone. There’s no email blast saying “Here are instructions on how to upgrade R on your laptop or environment”.
- Domino allows you to have multiple compute environments and toggle between them easily. Updating R requires you to re-install all your packages which is likely going to bump up the version of your packages too. So there are a lot of places for breaking changes. In Domino, you can create a new R4.0 compute environment and then try out your code on it. If it doesn’t work, because maybe you rely on some package whose maintainers haven’t updated yet, then you can toggle back almost immediately. Otherwise, you’re upgrading your laptop/server and then testing your test stuff and then downgrading it again. This is a pain if you’re a single user, but a real headache if you’ve got a lot of users. There’s no email blast going out saying “We’re upgrading tonight so everyone can please test their stuff before EOD and let me know if there are issues”.
- Finally, since Domino versions and tracks environments automatically, an upgrade doesn’t break your old work since your old stuff can still run on the old environment. In fact, you have a record of exactly the environment and version of the environment that code works with. There’s no email blast going out saying “Does anyone remember what version of R Jim was using, before he left, when he created the plots for our quarterly report?”
The tools and technology that underpin data science are constantly evolving. The usual approach by data science teams has been to stand up virtual machine environments as development hubs, and testing packages out before rolling out instructions for each data scientist to undertake themselves in order to update their development environments. This approach often meant package and version incompatibility, requiring troubleshooting, and hampering data scientists’ progress.
Environment Management makes it easy for data scientists to test new packages and software updates, but more importantly, these environments are easily shared across data science teams meaning that the days of spending hours trying to set up an environment to test a peer’s work are long gone.
This gives data scientists an ability to set up gold standard production-worthy environments that can be trusted, without sacrificing any freedom or flexibility in their ability to try new emerging packages and software as they become available.