This document shows you how to configure Google Cloud Managed Lustre shared storage for Slurm jobs on Google Kubernetes Engine (GKE). Shared storage is essential for Slurm clusters to help ensure that the Slurm login and worker nodes can access the same configuration files, scripts, and job data.
Before reading this document, ensure that you're familiar with the Slurm Operator add-on for GKE.
Before you begin
Before you start, make sure that you have performed the following tasks:
- Enable the Google Cloud Managed Lustre API and the Google Kubernetes Engine API. Enable APIs
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running the
gcloud components updatecommand. Earlier gcloud CLI versions might not support running the commands in this document.
Set up the cluster
To use shared storage with Slurm, you need a GKE cluster with the Slurm Operator add-on for GKE enabled. In this section, you complete the following configuration steps:
- Create a VPC network.
- Set up a GKE
cluster in the same VPC network as your Managed Lustre instance.
You also set up OS Login for secure access and consistent user identification
across the cluster, which is useful when sharing the
/homedirectory. - Enable the Managed Lustre CSI driver on the cluster.
Complete the following steps to set up the cluster:
Configure the VPC network by completing the steps in the Set up a VPC network section.
Create a GKE cluster with the Slurm Operator add-on for GKE and OS Login enabled. To create the cluster, complete the steps in Deploy a Slurm cluster on GKE and use the
--networkflag to specify the VPC network you created in the previous step.Enable the Managed Lustre CSI driver on the cluster:
gcloud container clusters update CLUSTER_NAME \ --location CONTROL_PLANE_LOCATION \ --project PROJECT_ID \ --update-addons=LustreCsiDriver=ENABLEDReplace the following:
CLUSTER_NAME: the name of the cluster.CONTROL_PLANE_LOCATION: the Compute Engine location of the control plane of your cluster. Provide a region for regional clusters, or a zone for zonal clusters.PROJECT_ID: your Google Cloud project ID.
When the Managed Lustre CSI driver is enabled, GKE automatically creates StorageClasses for provisioning Managed Lustre instances.
Configure Managed Lustre
In this section, you use the StorageClasses to dynamically provision a volume through a PersistentVolumeClaim (PVC).
Create Slurm namespace:
kubectl create namespace slurmCreate a manifest file named
lustre-pvc.yamlto define the PVC:apiVersion: v1 kind: PersistentVolumeClaim metadata: name: slurm-lustre-pvc namespace: slurm spec: accessModes: - ReadWriteMany resources: requests: storage: 72000Gi storageClassName: lustre-rwx-125mbps-per-tibApply the manifest:
kubectl apply -f lustre-pvc.yaml
Use shared storage in Slurm configurations
After creating your PVC, you can configure the Slurm Operator resources to mount the shared storage.
Configure Slurm login node and worker nodes
Find an available image tag:
In the Google Cloud console, go to the Artifact Registry repository page that includes the
slinky/slurmdpackage.Annotate one of the image tag values, for example
25.11-ubuntu24.04-gke.4. You use this tag in theIMAGE_TAGplaceholder in the following configuration file.
Save the following configuration to a file named
values.yaml:controller: slurmctld: image: repository: gcr.io/gke-release/slinky/slurmctld tag: IMAGE_TAG reconfigure: image: repository: gcr.io/gke-release/slinky/slurmctld tag: IMAGE_TAG restapi: replicas: 1 slurmrestd: image: repository: gcr.io/gke-release/slinky/slurmrestd tag: IMAGE_TAG nodesets: slinky: replicas: 1 slurmd: image: repository: gcr.io/gke-release/slinky/slurmd tag: IMAGE_TAG volumeMounts: - name: data-vol mountPath: /data podSpec: nodeSelector: cloud.google.com/gke-nodepool: NODE_POOL_NAME volumes: - name: data-vol persistentVolumeClaim: claimName: slurm-lustre-pvc loginsets: slinky: enabled: true replicas: 1 login: image: repository: gcr.io/gke-release/slinky/login tag: IMAGE_TAG volumeMounts: - name: data-vol mountPath: /data podSpec: volumes: - name: data-vol persistentVolumeClaim: claimName: slurm-lustre-pvcReplace the following:
IMAGE_TAG: the tag that you annotated in the previous step.NODE_POOL_NAME: the name of the node pool where you want to deploy the Slurm worker Pods.
Upgrade the Slurm chart by using the
values.yamlfile:helm upgrade slurm oci://ghcr.io/slinkyproject/charts/slurm \ --version 1.0.2 \ --namespace=slurm \ -f values.yaml
Verify shared storage
To verify that shared storage is mounted correctly, follow these steps:
Check that the PVCs and PVs are bound:
kubectl get pvc -n slurmThe output should show the status of all PVCs as
Boundto their PVs.Connect to the Slurm login node by completing the steps in Configure OS Login.
On the login node, check the mount paths:
df -h /dataCheck the mount paths on the worker nodes:
srun -N 1 df -h /data
Clean up
Clean up Slurm cluster and resources by following the directions in the Clean up section of Deploy a Slurm cluster on GKE.
What's next
- Learn more about Managed Lustre on GKE.