Skip to content

    On-Demand Spark clusters with GPU acceleration

    March 29, 2021   15 min read

    Apache Spark has become the de facto standard for processing large amounts of stationary and streaming data in a distributed fashion. The addition of the MLlib library, consisting of common learning algorithms and utilities, opened up Spark for a wide range of machine learning tasks and paved the way for running complex machine learning workflows on top of Apache Spark clusters. Some of the key benefits of using Spark for machine learning include: 

    • Distributed Learning - Parallelize compute-heavy workloads such as distributed training or hyper-parameter tuning 
    • Interactive Exploratory Analysis - Efficiently load large data sets in a distributed manner. Explore and understand the data using a familiar interface with Spark SQL 
    • Featurization and Transformation - Sample, aggregate, and re-label large data sets.

    At the same time, the use of Spark by Data Scientists presents its own set of challenges: 

    • Complexity – Apache Spark uses a layered architecture that mandates a master node, a cluster manager, and a set of worker nodes. Quite often Spark is not deployed in isolation but sits on top of a virtualized infrastructure (e.g. virtual machines or OS-level virtualization). Maintaining the cluster and the underlying infrastructure configuration can be a complex and time-consuming task 
    • Lack of GPU acceleration – Complex machine workloads, especially the ones involving Deep Learning, benefit from GPU architectures that are well adapted for vector and matrix operations. The Spark provided executor level and CPU-centric parallelization is typically no match for the large and fast registers and optimized bandwidth of the GPU architecture 
    • Cost – Keeping a Spark cluster up and running and using it intermittently, can quickly become a costly exercise (especially if Spark is running in the cloud). Quite often Spark is only needed for a fraction of the ML pipeline (e.g. data pre-processing) as the result set it produces fits comfortably in something like a cuDF DataFrame

    To address the challenges associated with complexity and costs Domino offers the ability to dynamically provision and orchestrate a Spark cluster directly on the infrastructure backing the Domino instance. This allows Domino users to get quick access to Spark without having to rely on their IT team to create and manage one for them. The Spark workloads are fully containerized on the Domino Kubernetes cluster and users can access Spark interactively through a Domino workspace (e.g. JupyterLab) or in batch mode through a Domino job or spark-submit. Moreover, because Domino can provision and de-provision clusters automatically, and can spin up Spark clusters on-demand, use them as part of a complex pipeline, and tear them down once the stage they were needed for is complete.

    Spark driver interacting with worker nodes within Domino

    To solve the need for GPU accelerated Spark, Domino has teamed up with Nvidia. The Domino platform has been capable of leveraging GPU-accelerated hardware (both in the cloud and on-premises) for quite some time, and thanks to its underlying Kubernetes architecture can natively deploy and use NGC containers out of the box. This, for example, enables the Data Scientists to natively use NVIDIA RAPIDS  - a suite of software libraries, built on CUDA-X AI, that gives them the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. In addition, Domino supports the integration of RAPIDS Accelerator for Apache Spark, which combines the power of the RAPIDS cuDF library and the scale of the Spark distributed computing framework. The RAPIDS Accelerator library also has a built-in accelerated shuffle based on UCX that can be configured to leverage GPU-to-GPU communication and RDMA capabilities. These capabilities allow Domino to provide streamlined access to GPU accelerated ML/DL frameworks and GPU accelerated Apache Spark components through a unified and Data Scientist-friendly UI. 

    Showing GPU accelerated architecture - Spark components with Nvidia SQL/DF plugin and accelerated ML/DL frameworks on top of Spark 3.0 core

    Configuring Spark clusters with RAPIDS Accelerator in Domino 

    By default, Domino does not come with a Spark-compatible Compute Environment (Docker image), so our first task is to create one. Creating a new Compute Environment is a well-documented process, so feel free to check the official documentation if you need a refresher.  

    The key steps are to give the new environment a name (e.g. Spark 3.0.0 GPU) and use bitnami/spark:2.4.6 as the base image. Domino’s on-demand Spark functionality has been developed and tested using open-source Spark images from Bitnami (this is why, in case you are interested). However, you could also use the bitnami/spark:3.0.0 image, as we are replacing the Spark installation within so it doesn't really matter.  

    Screenshot of the Domino New Environment UI

    Next, we need to edit the Compute Environment’s Dockerfile to bring Spark up to 3.0.0, add the NVIDIA CUDA drivers, the RAPIDS accelerator, and the GPU discovery script. Adding the code below to the Dockerfile instructions triggers a compute environment rebuild.

    USER root


    RUN apt-get update && apt-get install -y wget && rm -r /var/lib/apt/lists /var/cache/apt/archives

    ENV HADOOP_HOME=/opt/hadoop
    ENV HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
    ENV SPARK_HOME=/opt/bitnami/spark

    ### Remove the pre-installed Spark since it is pre-bundled with hadoop but preserve the python env
    WORKDIR /opt/bitnami
    RUN rm -rf ${SPARK_HOME}

    ### Install the desired Hadoop-free Spark distribution
    RUN wget -q${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-without-hadoop.tgz && \
    tar -xf spark-${SPARK_VERSION}-bin-without-hadoop.tgz && \
    rm spark-${SPARK_VERSION}-bin-without-hadoop.tgz && \
    mv spark-${SPARK_VERSION}-bin-without-hadoop ${SPARK_HOME} && \
    chmod -R 777 ${SPARK_HOME}/conf

    ### Install the desired Hadoop libraries
    RUN wget -q${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz && \
    tar -xf hadoop-${HADOOP_VERSION}.tar.gz && \
    rm hadoop-${HADOOP_VERSION}.tar.gz && \
    mv hadoop-${HADOOP_VERSION} ${HADOOP_HOME}

    ### Setup the Hadoop libraries classpath
    RUN echo 'export SPARK_DIST_CLASSPATH="$(hadoop classpath):${HADOOP_HOME}/share/hadoop/tools/lib/*:/opt/sparkRapidsPlugin"' >> ${SPARK_HOME}/conf/

    ### This is important to maintain compatibility with Bitnami
    RUN /opt/bitnami/scripts/spark/


    RUN apt-get update && apt-get install -y --no-install-recommends \
    gnupg2 curl ca-certificates && \
    curl -fsSL | apt-key add - && \
    echo "deb /" > /etc/apt/sources.list.d/cuda.list && \
    echo "deb /" > /etc/apt/sources.list.d/nvidia-ml.list && \
    apt-get purge --autoremove -y curl && \
    rm -rf /var/lib/apt/lists/*

    ENV CUDA_VERSION 10.1.243


    # For libraries in the cuda-compat-* package:
    RUN apt-get update && apt-get install -y --no-install-recommends \
    cuda-cudart-$CUDA_PKG_VERSION \
    cuda-compat-10-1 && \
    ln -s cuda-10.1 /usr/local/cuda && \
    rm -rf /var/lib/apt/lists/*

    # Required for nvidia-docker v1
    RUN echo "/usr/local/nvidia/lib" >> /etc/ && \
    echo "/usr/local/nvidia/lib64" >> /etc/

    ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}
    ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64

    # nvidia-container-runtime
    ENV NVIDIA_REQUIRE_CUDA "cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411"


    RUN apt-get update && apt-get install -y --no-install-recommends \
    cuda-libraries-$CUDA_PKG_VERSION \
    cuda-nvtx-$CUDA_PKG_VERSION \
    libcublas10= \
    libnccl2=$NCCL_VERSION-1+cuda10.1 && \
    apt-mark hold libnccl2 && \
    rm -rf /var/lib/apt/lists/*

    RUN apt-get update && apt-get install -y --no-install-recommends \
    cuda-nvml-dev-$CUDA_PKG_VERSION \
    cuda-command-line-tools-$CUDA_PKG_VERSION \
    cuda-libraries-dev-$CUDA_PKG_VERSION \
    cuda-minimal-build-$CUDA_PKG_VERSION \
    libnccl-dev=$NCCL_VERSION-1+cuda10.1 \
    libcublas-dev= \
    && \
    rm -rf /var/lib/apt/lists/*

    ENV LIBRARY_PATH /usr/local/cuda/lib64/stubs

    LABEL com.nvidia.cudnn.version="${CUDNN_VERSION}"

    RUN apt-get update && apt-get install -y --no-install-recommends \
    libcudnn7=$CUDNN_VERSION-1+cuda10.1 \
    libcudnn7-dev=$CUDNN_VERSION-1+cuda10.1 \
    && \
    apt-mark hold libcudnn7 && \
    rm -rf /var/lib/apt/lists/*

    # GPU Discovery Script
    ENV SPARK_RAPIDS_DIR=/opt/sparkRapidsPlugin
    RUN wget -q -P $SPARK_RAPIDS_DIR
    RUN chmod +x $SPARK_RAPIDS_DIR/
    RUN echo 'export SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=1 -Dspark.worker.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/"' >> ${SPARK_HOME}/conf/


    RUN wget -q -P $SPARK_RAPIDS_DIR
    RUN wget -q -P $SPARK_RAPIDS_DIR
    ENV SPARK_CUDF_JAR=${SPARK_RAPIDS_DIR}/cudf-0.14-cuda10-1.jar
    ENV SPARK_RAPIDS_PLUGIN_JAR=${SPARK_RAPIDS_DIR}/rapids-4-spark_2.12-0.1.0.jar


    We can verify that the new environment has been successfully built by inspecting the Revisions sections and making sure that the active environment is the most recent one.

    Showing Domino's Revisions tab with successfully built version of the GPU compute environment.

    Now that we have a Spark environment with the RAPIDS accelerator in place, we need to create a Workspace environment - an environment that will host the IDE that we'll use to interact with Spark.

    The process of creating a custom PySpark workspace environment is fully covered in the Domino official documentation. It is similar to how we built the Spark environment above, the key differences being that we use a Domino base image (instead of bitnami) and that we also need to configure pluggable workspaces tools. The latter enables access to the web-based tools inside the compute environment (e.g. JupyterLab).

    To build the workspace environment we create a new Compute Environment (Spark 3.0.0 RAPIDS Workspace Py3.6) using dominodatalab/base:Ubuntu18_DAD_Py3.6_R3.6_20200508 as the base image and we add the following contents to the Dockerfile instructions section:


    RUN mkdir -p /opt/domino

    ### Modify the Hadoop and Spark versions below as needed.
    ENV HADOOP_HOME=/opt/domino/hadoop
    ENV HADOOP_CONF_DIR=/opt/domino/hadoop/etc/hadoop
    ENV SPARK_HOME=/opt/domino/spark

    ### Install the desired Hadoop-free Spark distribution
    RUN rm -rf ${SPARK_HOME} && \
    wget -q${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-without-hadoop.tgz && \
    tar -xf spark-${SPARK_VERSION}-bin-without-hadoop.tgz && \
    rm spark-${SPARK_VERSION}-bin-without-hadoop.tgz && \
    mv spark-${SPARK_VERSION}-bin-without-hadoop ${SPARK_HOME} && \
    chmod -R 777 ${SPARK_HOME}/conf

    ### Install the desired Hadoop libraries
    RUN rm -rf ${HADOOP_HOME} && \
    wget -q${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz && \
    tar -xf hadoop-${HADOOP_VERSION}.tar.gz && \
    rm hadoop-${HADOOP_VERSION}.tar.gz && \
    mv hadoop-${HADOOP_VERSION} ${HADOOP_HOME}

    ### Setup the Hadoop libraries classpath and Spark related envars for proper init in Domino
    RUN echo "export SPARK_HOME=${SPARK_HOME}" >> /home/ubuntu/.domino-defaults
    RUN echo "export HADOOP_HOME=${HADOOP_HOME}" >> /home/ubuntu/.domino-defaults
    RUN echo "export HADOOP_CONF_DIR=${HADOOP_CONF_DIR}" >> /home/ubuntu/.domino-defaults
    RUN echo "export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:${HADOOP_HOME}/lib/native" >> /home/ubuntu/.domino-defaults
    RUN echo "export PATH=\$PATH:${SPARK_HOME}/bin:${HADOOP_HOME}/bin" >> /home/ubuntu/.domino-defaults
    RUN echo "export SPARK_DIST_CLASSPATH=\"\$(hadoop classpath):${HADOOP_HOME}/share/hadoop/tools/lib/*\"" >> ${SPARK_HOME}/conf/

    ### Complete the PySpark setup from the Spark distribution files
    RUN python install

    ### Optionally copy spark-submit to to be able to run from Domino jobs
    RUN spark_submit_path=$(which spark-submit) && \
    cp ${spark_submit_path} ${spark_submit_path}.sh

    ENV SPARK_RAPIDS_DIR=/opt/sparkRapidsPlugin
    RUN wget -q -P $SPARK_RAPIDS_DIR
    RUN wget -q -P $SPARK_RAPIDS_DIR
    ENV SPARK_CUDF_JAR=${SPARK_RAPIDS_DIR}/cudf-0.14-cuda10-1.jar
    ENV SPARK_RAPIDS_PLUGIN_JAR=${SPARK_RAPIDS_DIR}/rapids-4-spark_2.12-0.1.0.jar

    Notice that we also add the RAPIDS accelerator at the end and set a number of environment variables to make the plugin readily available in the preferred IDE (e.g. JupyterLab). We also add the following mapping to the Pluggable Workspaces Tools section in order to make Jupyter and JupyterLab available through the Domino UI.

    title: "Jupyter (Python, R, Julia)"
    iconUrl: "/assets/images/workspace-logos/Jupyter.svg"
    start: [ "/var/opt/workspaces/jupyter/start" ]
    port: 8888
    rewrite: false
    internalPath: "/////{

    Nikolay Manchev

    Nikolay Manchev is the Principal Data Scientist for EMEA at Domino Data Lab. In this role, Nikolay helps clients from a wide range of industries tackle challenging machine learning use-cases and successfully integrate predictive analytics in their domain specific workflows. He holds an MSc in Software Technologies, an MSc in Data Science, and is currently undertaking postgraduate research at King's College London. His area of expertise is Machine Learning and Data Science, and his research interests are in neural networks and computational neurobiology.

    Other posts you might be interested in