Whether you are a machine learning enthusiast, or a ninja data scientist training models for all sorts of applications, you may have heard of the need to use graphical processing units (GPUs), to squeeze the best performance when training and scaling your models. This may be summarized by saying that training tasks based on small datasets that take a few minutes to complete on a CPU may take hours, days, or even weeks when moving to larger datasets if a GPU is not used. GPU acceleration is a topic we have previously addressed in the blog. For example, take a look at the recent post called “On-demand Spark clusters with GPU acceleration” and "Faster Deep Learning with GPUs and Theano".
In this post, we will explore some of the reasons why GPUs are machine learning’s and deep learning's best friends and how this friendship has come about. A good place to start is by looking at the central processing unit, or CPU, and build our understanding of how GPU acceleration comes about. We will then talk about some developments using libraries such as NVIDIA’s CUDA and RAPIDS.
Central Processing Unit (CPU), or the Brains of a Computer
You may have seen some photographs depicting computers that take a room or two rather than fit in your hand. Take for instance the Electronic Numerical Integrator and Computer (ENIAC), which in the mid-1940s was one of the first programmable, general computers available. A computer such as this was effectively purpose-built: the ENIAC was a base-10 machine (not binary!), with 20 electronic accumulators to serve as registers for storing intermediate arithmetic results, and it occupied 167 squared meters. Since then, the central processing units of computers have changed a lot and it is fair to say that many of us do not think twice about the capacity of the processors that power not only our computers, but also phones, televisions, cars, and vending machines. If we stop to consider this for a moment we can see that CPUs are effectively the brains of these devices, telling other components what to do and when to do it, and following the instructions provided by the software. Without a CPU, you may not be able to read this post as easily as you are now.
The advent of the microprocessor in the early 1970s marks a change in the brains that computers used. With the announcement of Intel’s 4004 chip 50 years ago, almost to the day, we have the start of a journey that has continued to this day. The Intel 4004 was effectively the world’s first, general-purpose programmable processor, and this achievement was followed by the announcement of Intel’s 8-bit 8080 microprocessor, which enabled us to represent signed numbers in the range of −128 to +127. It may not sound like a lot, but it was good enough to perform arithmetic calculations without needing a room of its own. Apart from Intel, other notable 8-bit competitors include the Zilog Z80 and Motorola’s 6800 microprocessors. The appearance of 16-bit designs was quickly followed by 32-bit versions with chips such as the Intel i386 capable of representing signed numbers in the range of +/-2x10^9. Another notable example of 32-bit chips is the MC68000 from Motorola, used in the Apple Lisa as well as in the Atari ST and Commodore Amiga. From the 1990s the norm has been to use 64-bit processors. An important mention should go to the AMD64 as well as Intel’s x86-64 chips. Both are capable of backward compatibility with 32-bit legacy applications as well as 64-bit native ones.
The story continues, for example with the recent announcement of the Apple M1 in 2020, followed by the M1 Pro and M1 Max in 2021. Let us take the M1 for instance: designed as a CPU and GPU combo used in Apple’s tablets, phones, and laptops it is marketed as the “fastest CPU core in low-power silicon, the world’s best CPU performance per watt, the world’s fastest integrated graphics in a personal computer, and breakthrough machine learning performance with the Apple Neural Engine.” And there you have it, machine learning accelerated thanks to a chip.
One way to think about the evolution of the microprocessor and its capabilities is encapsulated in the well-known "Moore’s Law," which suggests that the complexity of an integrated circuit, compared to its cost, doubles every two years or so. This empirical observation has held quite well since the 1970s. As such, the self-fulfilling prophecy of Moore’s law has influenced not only the electronics industry but also any other ones where keeping up with the growth is a tacit requirement. Nonetheless, keeping up that exponential curve is becoming harder, and even Gordon Moore went on record in 2015 saying that the eponymous law may be out of vogue in the next few years.
In any event, CPUs are not the only powerhouse in a computer. Although there are a lot of things you can do with your CPU —including the number of cores, clock speed, and threads— other components may be better placed for some tasks. Bring on the mighty graphics processing unit.
Graphics Processing Unit (GPU), the Soul of a Computer
If the CPU is the brain of our computer, some say that the GPU is its soul. Here I argue that the GPU is at the heart of machine learning tasks. A GPU is the electronic component that lets us accelerate the creation of images in the section of our computer's random-access memory (RAM) that is used to drive the video display. This portion of RAM is called a frame buffer, and a typical video card contains frame buffer circuity that lets us display things on our monitor. As you can imagine, the use of GPUs for video games has been an important application for quite some time. In the early days of arcades, RAM for frame buffering was not the cheapest thing to get and instead of adding more memory, specialized circuits were used to help the CPU with the task. A good example of these early souls is the Television Interface Adaptor used by Atari.
You may be familiar with or even used some of the early monochromatic monitors of the 1980s. If so, you may have used Hitachi’s ARTC HD63484 processor, supporting an impressive 4K resolution. We mentioned the Commodore Amiga when talking about CPUs. Well, you will not be surprised to hear that it also shipped with custom graphics hardware that accelerated bitmap manipulation. If you are not of that vintage, you surely have heard of Video Graphics Arrays, or VGA, introduced in the late 80s by IBM and used as a display standard for many years. You could use hardware like this to render two-dimensional graphics in an accelerated way, but the requirement for 3D rendering by computers and console games provided the pathway to the creation of boards such as Sega Models 1 and 2 or the Namco System 22.
None of these examples were actually called GPUs. The term is said to have been used first by Sony in reference to the 32-bit Toshiba graphics chip used in the PlayStation One released in December 1994. Others argue that the term should be attributed to the marketing of NVIDIA’s GeForce 256. One important aspect of the GeForce 256 was the offloading of tasks like geometry calculation from the CPU. Whereas the GeForce 256 was able to do 50 gigaflops of floating-point performance built in a 220nm lithography process, more modern equivalents such as the GeForce RTX 2080 Ti GPU supports 14.2 teraflops and is built on a 12nm process. In other words, gamers are a happy bunch with GPUs that can let them experience flagship gaming titles in their full glory, using shadow volumes, elevation mapping, and accurate refraction among many other things. However, that acceleration does not go unnoticed by other users with equally compelling applications.
Take for instance the repurposing of some old PlayStation consoles to build a supercomputer to study black holes. The use of the cell processors and Tesla CUDA GPUs inside the gaming console is reported to have contributed to speeding up calculations over a traditional computer by a factor of nearly 10 for the purposes of modeling gravitational waves. The move from gaming and computer graphics towards other applications did not happen by chance. The reasons are clear when you compare a CPU with a GPU. For example, the memory requirement for a CPU is higher than that of a GPU, GPUs have a higher processing speed and process instructions in parallel, as opposed to the serial nature of the CPU. Companies such as NVIDIA have been thinking strategically about these applications, including machine learning, for quite some time.
GPUs at the Heart of Deep Learning
The variety of problems that can be tackled with machine learning and deep learning algorithms has been growing, and with access to bigger and more diverse datasets, the complexity of these models increases. The acceleration provided by GPUs is therefore a welcome feature that many machine learning practitioners and data scientists are keen to exploit. Consider for example the training of a typical simple neural network with architecture comprising one input layer and one output layer. If the input layer has 4 units and the output has 2, we have 8 connections or parameters that need to be computed. This should be an easy problem to solve even by hand. As the number of layers and units increases, we have a larger and larger number of parameters that need to be calculated and tracked, particularly as we apply backpropagation to our neural network. For thousands of parameters, a robust CPU may be of help, but as we continue to move into the depths of the hidden layers of the neural networks employed in deep learning, the number of parameters grows into the tens of billions.
If instead of CPUs we employ GPUs, and our calculations are accelerated thanks to the larger number of cores and the parallelization that they enable. NVIDIA has made it possible to harness the power of their GPUs with the creation of a parallel computing platform and programming model called Compute Unified Device Architecture, or CUDA for short, which provides programmers with direct, unfettered access to the GPU’s parallel computational elements. Designed to work with robust programming languages such as Fortran and C/C++, CUDA lets “the developer express massive amounts of parallelism and direct the compiler to the portion of the application that maps to the GPU” as NVIDIA themselves put it. Parallelization is therefore at the reach of many applications, not only for machine learning but also in areas that use high-performance computing such as physics, scientific modeling, medical sciences, and even the creative arts.
The specialization for particular applications may even run to the core of the GPU, pun definitely intended. Take for instance the Titan RTX, with specifications to “power AI, machine learning and creative workflows,” compared to the GTX 1080Ti which is #GameReady. Similarly, the libraries that power these devices are specialized too. Look for example at the CUDA-X AI libraries that have given rise to implementations such as RAPIDS, letting data science and data engineers collaborate in the construction of end-to-end pipelines running entirely on the GPU: from data preparation tasks through to implementations of popular machine learning algorithms in cuML, including a pandas-like datagrams library called cuDF, and even a library for network analysis called cuGraph. RAPIDS is able to integrate with other deep learning libraries and even has support for Python-based visualization libraries, making it easier for data scientists to concentrate on what they know best. The Domino MLOps platform has comprehensive integration with the aforementioned technologies and thanks to the Domino/NVIDIA partnership can leverage GPU-accelerated compute, seamlessly use RAPIDS, and utilize NVIDIA containers out-of-the-box. You can learn more about this by watching the recording of the joint Domino/NVIDIA webinar titled, How to Simply Run Complex AI Training & Inference Workloads with Domino & NVIDIA.
In this post, we covered some of the reasons why GPU acceleration is the thing to consider when increasing the complexity of our models. Whereas your CPU is a critical component of your computer, a GPU carries out more specialized tasks. A CPU is suited for tasks where latency or per-core performance is important so that things are executed quickly. A GPU is a general-purpose parallel processor that may have started life powering graphics and 3D rendering tasks suitable for games, but today we are able to exploit it to make machine learning tasks more efficient and faster.
With dedicated libraries such as NVIDIA's CUDA, CUDA-X AI, we are able to make better use of our GPUs. Furthermore, specialized high-level libraries and APIs such as RAPIDS make it easier than ever to get us started. NVIDIA has a number of resources including some online courses on deep learning, accelerated computing, and data science.