Run a genomics analysis in a JupyterLab notebook

This tutorial shows you how to run a single-cell genomics analysis using Dask, NVIDIA RAPIDS, and GPUs, which you can configure on Managed Service for Apache Spark. You can configure Managed Service for Apache Spark to run Dask either with its standalone scheduler or with YARN for resource management.

This tutorial configures Managed Service for Apache Spark with a hosted JupyterLab instance to run a notebook featuring a single-cell genomics analysis. Using a Jupyter Notebook on Managed Service for Apache Spark lets you combine the interactive capabilities of Jupyter with the workload scaling that Managed Service for Apache Spark enables. With Managed Service for Apache Spark, you can scale out your workloads from one to many machines, which you can configure with as many GPUs as you need.

This tutorial is intended for data scientists and researchers. It assumes that you are experienced with Python and have basic knowledge of the following:

Objectives

  • Create a Managed Service for Apache Spark instance which is configured with GPUs, JupyterLab, and open source components.
  • Run a notebook on Managed Service for Apache Spark.

Costs

In this document, you use the following billable components of Google Cloud:

  • Managed Service for Apache Spark
  • Cloud Storage
  • GPUs
  • To generate a cost estimate based on your projected usage, use the pricing calculator.

    New Google Cloud users might be eligible for a free trial.

    When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.

    Before you begin

    1. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

      Roles required to select or create a project

      • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
      • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

      Go to project selector

    2. Verify that billing is enabled for your Google Cloud project.

    3. Enable the Dataproc API.

      Roles required to enable APIs

      To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

      Enable the API

    Prepare your environment

    1. Select a location for your resources.

      REGION=REGION
      

    2. Create a Cloud Storage bucket.

      gcloud storage buckets create gs://BUCKET --location=REGION
      

    3. Copy the following initialization actions to your bucket.

      SCRIPT_BUCKET=gs://goog-dataproc-initialization-actions-REGION
      gcloud storage cp ${SCRIPT_BUCKET}/gpu/install_gpu_driver.sh BUCKET/gpu/install_gpu_driver.sh
      gcloud storage cp ${SCRIPT_BUCKET}/dask/dask.sh BUCKET/dask/dask.sh
      gcloud storage cp ${SCRIPT_BUCKET}/rapids/rapids.sh BUCKET/rapids/rapids.sh
      gcloud storage cp ${SCRIPT_BUCKET}/python/pip-install.sh BUCKET/python/pip-install.sh
      

    Create a Managed Service for Apache Spark cluster with JupyterLab and open source components

    1. Create a Managed Service for Apache Spark cluster.
    gcloud dataproc clusters create CLUSTER_NAME \
        --region REGION \
        --image-version 2.0-ubuntu18 \
        --master-machine-type n1-standard-32 \
        --master-accelerator type=nvidia-tesla-t4,count=4 \
        --initialization-actions
    BUCKET/gpu/install_gpu_driver.sh,BUCKET/dask/dask.sh,BUCKET/rapids/rapids.sh,BUCKET/python/pip-install.sh
    \
        --initialization-action-timeout=60m \
        --metadata
    gpu-driver-provider=NVIDIA,dask-runtime=yarn,rapids-runtime=DASK,rapids-version=21.06,PIP_PACKAGES="scanpy==1.8.1,wget" \
        --optional-components JUPYTER \
        --enable-component-gateway \
        --single-node
    

    The cluster has the following properties:

    • --region: the region where your cluster is located.
    • --image-version: 2.0-ubuntu18, the cluster image version
    • --master-machine-type: n1-standard-32, the main machine type.
    • --master-accelerator: the type and count of GPUs on the main node, four nvidia-tesla-t4 GPUs.
    • --initialization-actions: the Cloud Storage paths to the installation scripts that install GPU drivers, Dask, RAPIDS, and extra dependencies.
    • --initialization-action-timeout: the timeout for the initialization actions.
    • --metadata: passed to the initialization actions to configure the cluster with NVIDIA GPU drivers, the standalone scheduler for Dask, and RAPIDS version 21.06.
    • --optional-components: configures the cluster with the Jupyter optional component.
    • --enable-component-gateway: allows access to web UIs on the cluster.
    • --single-node: configures the cluster as a single node (no workers).

    Access the Jupyter Notebook

    1. Open the Clusters page in the Managed Service for Apache Spark Google Cloud console.
      Open Clusters page
    2. Click your cluster and click the Web Interfaces tab.
    3. Click JupyterLab.
    4. Open a new terminal in JupyterLab.
    5. Clone the clara-parabricks/rapids-single-cell-examples repository and check out the dataproc/multi-gpu branch.

      git clone https://github.com/clara-parabricks/rapids-single-cell-examples.git
      git checkout dataproc/multi-gpu
      

    6. In JupyterLab, navigate to the rapids-single-cell-examples/notebooks repository and open the 1M_brain_gpu_analysis_uvm.ipynb Jupyter Notebook.

    7. To clear all the outputs in the notebook, select Edit > Clear All Outputs

    8. Read the instructions in the cells of the notebook. The notebook uses Dask and RAPIDS on Managed Service for Apache Spark to guide you through a single-cell RNA-seq workflow on 1 million cells, including processing and visualizing the data. To learn more, see Accelerating Single Cell Genomic Analysis using RAPIDS.

    Clean up

    To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

    Delete the project

    1. In the Google Cloud console, go to the Manage resources page.

      Go to Manage resources

    2. In the project list, select the project that you want to delete, and then click Delete.
    3. In the dialog, type the project ID, and then click Shut down to delete the project.

    Delete individual resources

    1. Delete your Managed Service for Apache Spark cluster.
      gcloud dataproc clusters delete cluster-name \
          --region=region
      
    2. Delete the bucket:
      gcloud storage buckets delete BUCKET_NAME

    What's next