Run a Spark job on Google Kubernetes Engine

This document shows you how to create a Managed Service for Apache Spark on GKE virtual cluster, and then run a Spark job on the cluster.

Overview of options

While Managed Service for Apache Spark on GKE offers powerful control for containerized environments, Google Cloud also provides fully managed and serverless options that can simplify operations and accelerate development. For a comparison of Spark Managed Service for Apache Spark deployment options, see Decide on the best Spark service.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  3. Verify that you have the permissions required to complete this guide.

  4. Verify that billing is enabled for your Google Cloud project.

  5. Enable the Dataproc API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the API

  6. Install the Google Cloud CLI.

  7. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  8. To initialize the gcloud CLI, run the following command:

    gcloud init
  9. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  10. Verify that you have the permissions required to complete this guide.

  11. Verify that billing is enabled for your Google Cloud project.

  12. Enable the Dataproc API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the API

  13. Install the Google Cloud CLI.

  14. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  15. To initialize the gcloud CLI, run the following command:

    gcloud init
  16. You must have created a standard (not autopilot) Google Kubernetes Engine (GKE) zonal or regional cluster that has Workload Identity enabled on the cluster.

Required roles

Certain IAM roles are required to run the examples on this page. Depending on organization policies, these roles may have already been granted. To check role grants, see Do you need to grant roles?.

For more information about granting roles, see Manage access to projects, folders, and organizations.

User roles

To get the permissions that you need to create a Managed Service for Apache Spark cluster, ask your administrator to grant you the following IAM roles:

Service account role

To ensure that the Compute Engine default service account has the necessary permissions to create a Managed Service for Apache Spark cluster, ask your administrator to grant the Dataproc Worker (roles/dataproc.worker) IAM role to the Compute Engine default service account on the project.

Create a virtual cluster

A Managed Service for Apache Spark on GKE virtual cluster is created as the deployment platform for Managed Service for Apache Spark components. It's a virtual resource, and unlike a Managed Service for Apache Spark on Compute Engine cluster, does not include separate Managed Service for Apache Spark master and worker VMs.

  • Managed Service for Apache Spark on GKE creates node pools within a GKE cluster when you create a Managed Service for Apache Spark on GKE virtual cluster.

  • Managed Service for Apache Spark on GKE jobs are run as pods on these node pools. The node pools and scheduling of pods on the node pools are managed by GKE.

  • Create multiple virtual clusters. You can create and run multiple virtual clusters on a GKE cluster to obtain improved resource utilization by sharing node pools across the virtual clusters.

    • Each virtual cluster:
      • is created with separate properties, including Spark engine version and workload identity
      • is isolated within a separate GKE namespace on the GKE cluster

Console

  1. In the Google Cloud console, go to the Managed Service for Apache Spark Clusters page.

    Go to Clusters

  2. Click Create cluster.

  3. In the Create Managed Service for Apache Spark cluster dialog, click Create in the Cluster on GKE row.

  4. In the Set up cluster panel:

    1. In the Cluster Name field, enter a name for the cluster.
    2. In the Region list, select a region for the Managed Service for Apache Spark on GKE virtual cluster. This region must be the same region where your existing GKE cluster is located (which you select in the next item).
    3. In the Kubernetes Cluster field, click Browse to select the region where your existing GKE cluster is located.
    4. Optional: In the Cloud Storage staging bucket field, you can click Browse to select an existing Cloud Storage bucket. Managed Service for Apache Spark on GKE will stage artifacts in the bucket. Ignore this field to have Managed Service for Apache Spark on GKE create a staging bucket.
  5. In the left panel, click Configure Node pools, then in the Node pools panel, click Add a pool.

    1. To reuse an existing Managed Service for Apache Spark on GKE node pool:
      1. Click Reuse existing node pool.
      2. Input the name of the existing node pool and select its Role. At least one node pool must have the DEFAULT role.
      3. Click Done.
    2. To create a new Managed Service for Apache Spark on GKE node pool:
      1. Click Create a new node pool.
      2. Input the following node pool values:
        • Node pool name
        • Role: At least one node pool must have the DEFAULT role.
        • Location: Specify a zone within the Managed Service for Apache Spark on GKE cluster region.
        • Node pool machine type
        • CPU platform
        • Preemptibility
        • Min: Minimum node count.
        • Max: Maximum node count. The maximum node count must be greater than 0.
    3. Click Add a pool to add more node pools. All node pools must have the location. You can add a total of four node pools.
  6. (Optional) If you have set up a Managed Service for Apache Spark Persistent History Server (PHS) to use to view Spark job history, on active and deleted Managed Service for Apache Spark on GKE clusters, click Customize cluster. Then in the History server cluster field, browse for and choose your PHS cluster. The PHS cluster must be located in the same region as the Managed Service for Apache Spark on GKE virtual cluster.

  7. Click Create to create the Managed Service for Apache Spark cluster. Your Managed Service for Apache Spark on GKE cluster appears in a list on the Clusters page. Its status is Provisioning until the cluster is ready to use, and then the status changes to Running.

gcloud

Set environment variables, then run the gcloud dataproc clusters gke create command locally or in Cloud Shell to create a Managed Service for Apache Spark on GKE cluster.

  1. Set environment variables:

    DP_CLUSTER=Managed Service for Apache Spark on GKE  cluster-name \
      REGION=region \
      GKE_CLUSTER=GKE cluster-name \
      BUCKET=Cloud Storage bucket-name \
      DP_POOLNAME=node pool-name
      PHS_CLUSTER=Managed Service for Apache Spark PHS server name
    
    Notes:

    • DP_CLUSTER: Set the Managed Service for Apache Spark virtual cluster name, which must start with a lowercase letter, followed by up to 54 lowercase letters, numbers, or hyphens. It and cannot end with a hyphen.
    • REGION: The region must be the same as the region where the GKE cluster is located.
    • GKE_CLUSTER: The name of your existing GKE cluster.
    • BUCKET: (Optional) You can specify the name of a Cloud Storage bucket, which Managed Service for Apache Spark will use to stage artifacts. If you do not specify a bucket, Managed Service for Apache Spark on GKE will create a staging bucket.
    • DP_POOLNAME: The name of a node pool to create on the GKE cluster.
    • PHS_CLUSTER: (Optional) Managed Service for Apache Spark PHS Server to use to view Spark job history on active and deleted Managed Service for Apache Spark on GKE clusters. The PHS cluster must be located in the same region as the Managed Service for Apache Spark on GKE virtual cluster.
  2. Run the command:

    gcloud dataproc clusters gke create ${DP_CLUSTER} \
        --region=${REGION} \
        --gke-cluster=${GKE_CLUSTER} \
        --spark-engine-version=latest \
        --staging-bucket=${BUCKET} \
        --pools="name=${DP_POOLNAME},roles=default" \
        --setup-workload-identity \
        --history-server-cluster=${PHS_CLUSTER}
    
    Notes:

    • --spark-engine-version: The Spark image version used on the Managed Service for Apache Spark cluster. You can use an identifier, such as 3, 3.1, or latest, or you can specify the full subminor version, such as 3.1-dataproc-5.
    • --staging-bucket: Delete this flag to have Managed Service for Apache Spark on GKE create a staging bucket.
    • --pools: This flag is used to specify a new or existing node pool that Managed Service for Apache Spark will create or use to perform the workload. List Managed Service for Apache Spark on GKE node pool settings, separated by commas, for example:
      --pools=name=dp-default,roles=default,machineType=e2-standard-4,min=0,max=10
      
      You must specify the node pool name and role. Other node pool settings are optional. You can use multiple --pools flags to specify multiple node pools. At least one node pool must have the default role. All node pools must have the same location.
    • --setup-workload-identity: This flag enables Workload Identity bindings. These bindings allow the Kubernetes service accounts (KSAs) to act as the default Managed Service for Apache Spark VM Service Account (Data Plane identity) of the virtual cluster.

REST

Complete a virtualClusterConfig as part of an cluster.create API request.

Before using any of the request data, make the following replacements:

  • PROJECT: Google Cloud project ID
  • REGION: Dataproc virtual cluster region (same region as the existing GKE cluster region)
  • DP_CLUSTER: Dataproc cluster name
  • GKE_CLUSTER: GKE cluster name
  • NODE_POOL: Node pool name
  • PHS_CLUSTER: Persistent History Server (PHS) cluster name
  • BUCKET: (Optional) Staging bucket name. Leave this empty to have Managed Service for Apache Spark on GKE create a staging bucket.

HTTP method and URL:

POST https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters

Request JSON body:

{
  "clusterName":"DP_CLUSTER",
  "projectId":"PROJECT",
  "virtualClusterConfig":{
    "auxiliaryServicesConfig":{
      "sparkHistoryServerConfig":{
        "dataprocCluster":"projects/PROJECT/regions/REGION/clusters/PHS_CLUSTER"
      }
    },
    "kubernetesClusterConfig":{
      "gkeClusterConfig":{
        "gkeClusterTarget":"projects/PROJECT/locations/REGION/clusters/GKE_CLUSTER",
        "nodePoolTarget":[
          {
"nodePool":"projects/PROJECT/locations/REGION/clusters/GKE_CLUSTER/nodePools/NODE_POOL",
            "roles":[
              "DEFAULT"
            ]
          }
        ]
      },
      "kubernetesSoftwareConfig":{
        "componentVersion":{
          "SPARK":"latest"
        }
      }
    },
    "stagingBucket":"BUCKET"
  }
}

To send your request, expand one of these options:

You should receive a JSON response similar to the following:

{
  "projectId":"PROJECT",
  "clusterName":"DP_CLUSTER",
  "status":{
    "state":"RUNNING",
    "stateStartTime":"2022-04-01T19:16:39.865716Z"
  },
  "clusterUuid":"98060b77-...",
  "statusHistory":[
    {
      "state":"CREATING",
      "stateStartTime":"2022-04-01T19:14:27.340544Z"
    }
  ],
  "labels":{
    "goog-dataproc-cluster-name":"DP_CLUSTER",
    "goog-dataproc-cluster-uuid":"98060b77-...",
    "goog-dataproc-location":"REGION",
    "goog-dataproc-environment":"prod"
  },
  "virtualClusterConfig":{
    "stagingBucket":"BUCKET",
    "kubernetesClusterConfig":{
      "kubernetesNamespace":"dp-cluster",
      "gkeClusterConfig":{
"gkeClusterTarget":"projects/PROJECT/locations/REGION/clusters/GKE_CLUSTER",
        "nodePoolTarget":[
          {
"nodePool":"projects/PROJECT/locations/REGION/clusters/GKE_CLUSTER/nodePools/NODE_POOL",
            "roles":[
              "DEFAULT"
            ]
          }
        ]
      },
      "kubernetesSoftwareConfig":{
        "componentVersion":{
          "SPARK":"3.1-..."
        },
        "properties":{
          "dpgke:dpgke.unstable.outputOnly.endpoints.sparkHistoryServer":"https://...",
          "spark:spark.eventLog.dir":"gs://BUCKET/.../spark-job-history",
          "spark:spark.eventLog.enabled":"true"
        }
      }
    },
    "auxiliaryServicesConfig":{
      "sparkHistoryServerConfig":{
        "dataprocCluster":"projects/PROJECT/regions/REGION/clusters/PHS_CLUSTER"
      }
    }
  }

Submit a Spark job

After your Managed Service for Apache Spark on GKE virtual cluster is running, submit a Spark job using the Google Cloud console, gcloud CLI, or the Managed Service for Apache Spark jobs.submit API (by using direct HTTP requests or the Cloud Client Libraries).

gcloud CLI Spark job example:

gcloud dataproc jobs submit spark \
    --region=${REGION} \
    --cluster=${DP_CLUSTER} \
    --class=org.apache.spark.examples.SparkPi \
    --jars=local:///usr/lib/spark/examples/jars/spark-examples.jar \
    -- 1000

gcloud CLI PySpark job example:

gcloud dataproc jobs submit pyspark \
    --region=${REGION} \
    --cluster=${DP_CLUSTER} \
    local:///usr/lib/spark/examples/src/main/python/pi.py \
    -- 10

gcloud CLI SparkR job example:

gcloud dataproc jobs submit spark-r \
    --region=${REGION} \
    --cluster=${DP_CLUSTER} \
    local:///usr/lib/spark/examples/src/main/r/dataframe.R

Clean up