Quickstart: Deploy a Slurm cluster on GKE

This document explains how to quickly deploy and configure a basic Slurm cluster on Google Kubernetes Engine (GKE) by using the open-source Slurm Helm chart and the Slurm Operator add-on for GKE. This setup includes a Slurm controller (slurmctld), REST API (slurmrestd), a login node for user access, and a single worker node (slurmd) managed by the Slurm Operator add-on for GKE.

This document is for Data administrators, Operators, and Developers who want to enable and configure the Slurm cluster on GKE.

Before reading this document, ensure that you're familiar with the Slurm Operator add-on for GKE.

Before you begin

Before you start, make sure that you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
  • Ensure you have already generated an SSH key pair. This key pair is required only if you want to set up OS Login.

  • Ensure you have a running GKE cluster with Slurm Operator enabled. If not, create one:

    gcloud container clusters create CLUSTER_NAME \
        --cluster-version=VERSION \
        --location=LOCATION \
        --project=PROJECT_ID \
        --addons=SlurmOperator
    

    Replace the following:

    • CLUSTER_NAME: the name of the new cluster.
    • VERSION: the GKE version, which must be 1.35.2-gke.1842000 or later. You can also use the --release-channel option to select a release channel. The release channel must have a default version of 1.35.2-gke.1842000 or later.
    • LOCATION: the location of the cluster.
    • PROJECT_ID: the ID of the project.

    For more information, see enable Slurm Operator add-on for GKE.

(Optional) Configure OS Login

OS Login simplifies SSH access management by linking your Linux user account to your IAM identity. This configuration lets you manage access to Slurm nodes by using IAM permissions.

  1. Grant necessary IAM roles. Ensure your user account has the necessary IAM roles in the project:

    • roles/compute.osLogin: lets you manage your own OS Login profile.
    • roles/compute.instanceAdmin.v1: provides permissions to manage compute instances.
    • roles/iam.serviceAccountUser: lets users act as a service account, which is often needed for node operations.

    For more information about the required roles, see the guide to set up OS Login.

  2. Add your SSH key to OS Login by uploading your public SSH key:

    gcloud compute os-login ssh-keys add --key-file=PATH_TO_PUBLIC_KEY --project=PROJECT_ID
    

    Alternatively, you can add a key that's loaded in your ssh-agent:

    gcloud compute os-login ssh-keys add --key="$(ssh-add -L | grep publickey | head -n 1)" --project=PROJECT_ID
    
  3. Enable OS Login in your project metadata:

    gcloud compute project-info add-metadata --metadata enable-oslogin=TRUE --project=PROJECT_ID
    
Best practice:

For managing OS Login across multiple projects in an organization, consider enforcing OS Login by using an Organization Policy Service constraint (compute.requireOsLogin). This is a recommended security best practice. For more information, see Enable and configure OS Login in GKE.

(Optional) Add a compute node pool

If you want to run Slurm compute workloads on separate nodes, you can create a dedicated node pool for them.

gcloud container node-pools create NODE_POOL_NAME \
    --cluster=CLUSTER_NAME \
    --machine-type=MACHINE_TYPE \
    --num-nodes=NUM_NODES \
    --node-taints=slurm-worker=true:NoSchedule

Replace the following:

  • NODE_POOL_NAME: the name of the new node pool.
  • CLUSTER_NAME: the name of your cluster.
  • MACHINE_TYPE: the machine type for the nodes (for example: n2-standard-4).
  • NUM_NODES: the number of nodes in the node pool.

Deploy Slurm using Helm

This section guides you through deploying the Slurm cluster components by using the Slurm Helm chart. The Helm chart deploys slurmctld, slurmrestd, and slurmd components within the GKE cluster.

  1. Configure kubectl to communicate with your cluster:

    gcloud container clusters get-credentials CLUSTER_NAME
    

    Replace CLUSTER_NAME with your cluster name.

  2. Verify that you are running Helm 3.8.0 or later.

    helm version
    

    The output is similar to the following:

    version.BuildInfo{Version:"v3.17.3", GitCommit:"e4da49785aa6e6ee2b86efd5dd9e43400318262b", GitTreeState:"clean", GoVersion:"go1.23.7"}
    

    If needed, you can install Helm by following the official Helm documentation.

  3. Find an available image tag:

    1. In the Google Cloud console, go to the Artifact Registry repository page that includes the slinky/slurmd package.

      Go to Artifact Registry repository

    2. Annotate one of the image tag value, for example 25.11-ubuntu24.04-gke.4. You use this tag in the IMAGE_TAG placeholder in the following configuration file.

  4. Save the following configuration to a new file named values.yaml:

    controller:
        slurmctld:
            image:
                repository: gcr.io/gke-release/slinky/slurmctld
                tag: IMAGE_TAG
        reconfigure:
            image:
                repository: gcr.io/gke-release/slinky/slurmctld
                tag: IMAGE_TAG
    
    restapi:
        replicas: 1
        slurmrestd:
            image:
                repository: gcr.io/gke-release/slinky/slurmrestd
                tag: IMAGE_TAG
    
    nodesets:
        slinky:
            replicas: 1
            slurmd:
                image:
                    repository: gcr.io/gke-release/slinky/slurmd
                    tag: IMAGE_TAG
    
            # The podSpec block is optional and only required when using
            # a dedicated node pool for compute nodes.
            podSpec:
                nodeSelector:
                    cloud.google.com/gke-nodepool: NODE_POOL_NAME
                tolerations:
                    - key: "slurm-worker"
                      operator: "Equal"
                      value: "true"
                      effect: "NoSchedule"
    
    loginsets:
        slinky:
            enabled: true
            replicas: 1
            login:
                image:
                    repository: gcr.io/gke-release/slinky/login
                    tag: IMAGE_TAG
    

    Replace IMAGE_TAG with the tag that you copied in the previous step. For example, use 25.11-ubuntu24.04-gke.4.

  5. Install the Slurm Helm chart by using the values.yaml file:

    helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
        --namespace=slurm --create-namespace --version 1.0.2 -f values.yaml
    

Verify the Slurm installation

You can verify that Slurm is deployed on the cluster by using kubectl.

  1. Check Pod status:

    kubectl get pods --namespace slurm
    

    The output should be similar to the following, and show the Running status for all Pods:

    NAME                                  READY   STATUS    RESTARTS   AGE
    slurm-controller-0                    3/3     Running   0          60s
    slurm-login-slinky-5d79cd755c-mf62z   1/1     Running   0          60s
    slurm-restapi-6b4ccb479f-njlp9        1/1     Running   0          60s
    slurm-worker-slinky-0                 2/2     Running   0          60s
    
  2. To see the registered nodes, execute the sinfo command on the login node:

    kubectl exec -it deployment/slurm-login-slinky -n slurm -- sinfo
    

    The output should list the slinky partition and the worker node.

Run a Slurm Job

  1. To run a job, you need to access the Slurm login node. The way you access the login node depends on whether you have configured OS Login in the previous section.

    1. If you configured OS Login in the preceding section, access the login node by using SSH. To do this, get theexternal IP address ofslurm-login-slinky` Service:

      kubectl get service --namespace slurm slurm-login-slinky
      

      The output looks like this:

      NAME                 TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)        AGE
      slurm-login-slinky   LoadBalancer   10.X.X.X        X.X.X.X        22:30171/TCP   5m
      

      Copy the value of the EXTERNAL-IP column.

      ssh OSLOGIN_USERNAME@EXTERNAL_IP
      

      Replace the following:

      • EXTERNAL_IP: the IP address obtained in the previous step.
      • OSLOGIN_USERNAME: your OS Login username.
    2. If you did not configure OS Login, you can still access the login node by using the kubectl exec command:

      kubectl exec -it deployment/slurm-login-slinky -n slurm -- bash
      
  2. Run an interactive job: After you're in the login node, you can run a command on a compute node by using the srun command line utility.

    srun hostname
    

    The output includes the hostname of the slurm-worker-slinky-0 Pod.

Clean up

To avoid incurring charges, clean up the resources created in this document.

  1. Uninstall the Helm deployment: This command removes all Kubernetes resources deployed by the Helm chart.

    helm uninstall slurm --namespace slurm
    
  2. Delete the Slurm namespace:

    kubectl delete namespace slurm
    
  3. Delete the GKE cluster:

    gcloud container clusters delete CLUSTER_NAME  \
        --location=LOCATION \
    

    Replace the following:

    • CLUSTER_NAME: the name of the new cluster.
    • LOCATION: the region of cluster.

What's next