Debug completed Ray Jobs with Ray History Server

Standard

Ray History Server lets you access the Ray Dashboard and its logs after a Ray cluster has been terminated.

This document describes how to configure and deploy the Ray History Server on Google Kubernetes Engine (GKE) clusters running Ray workloads. This document also explains how to access terminated RayCluster data using a local Ray Dashboard.

By default, the Ray Dashboard and its logs exist only while the Ray cluster is running. When running jobs on ephemeral Ray clusters, the debug data is lost as soon as the cluster terminates. Previously, preserving this data for debugging required keeping idle clusters running, which consumed unnecessary compute resources.

Ray History Server persists this data, which lets you terminate clusters immediately after a job finishes to optimize resource usage. Developers can continue to access the dashboard, review logs, and troubleshoot issues after the compute resources are released.

When configured, Ray History Server acts as the backend for the Ray Dashboard. For more information about using the dashboard, see Ray Dashboard.

Cost

Ray History Server utilizes Cloud Storage. For more information, see Cloud Storage pricing.

Ray History Server container images are stored in Artifact Registry. For more information, see Artifact Registry pricing.

Before you begin

Before you start, make sure that you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running the gcloud components update command. Earlier gcloud CLI versions might not support running the commands in this document.
Note: For existing gcloud CLI installations, make sure to set the compute/region property. If you use primarily zonal clusters, set the compute/zone instead. By setting a default location, you can avoid errors in the gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location. You might need to specify the location in certain commands if the location of your cluster differs from the default that you set.

Install and update helm.
Make sure your Cloud Storage bucket is created.

Requirements and limitations

Ray History Server requires a minimum KubeRay version v1.6 and uses Ray version v2.55.

This document assumes that you are familiar with the following concepts and operations:

Kubernetes concepts and command-line tooling (kubectl)
Google Cloud projects and command-line tooling (Google Cloud CLI)
Fundamentals of GKE and Kubernetes
Fundamentals of Ray and KubeRay

Set up GKE Cluster

In this section, you set up necessary variables and the GKE cluster.

Configure environment variables

The following environmental variables are used throughout this document:

export LOCATION="LOCATION"
export PROJECT_NAME="PROJECT_NAME"
export PROJECT_NUMBER="PROJECT_NUMBER"
export GKE_CLUSTER_NAME="GKE_CLUSTER_NAME"
export GCS_BUCKET="GCS_BUCKET"
export GCP_SA="GCP_SA"
export RAY_JOB="RAY_JOB"
export NAMESPACE="NAMESPACE"

LOCATION: Region or zone of the cluster
PROJECT_NAME: Google Cloud project name
PROJECT_NUMBER: Google Cloud project number
GKE_CLUSTER_NAME: Name of the GKE cluster
GCS_BUCKET: Name of Cloud Storage bucket
GCP_SA: Name of the service account
RAY_JOB: Name of the Ray Job
NAMESPACE: Namespace where Ray History Server lives in a GKE cluster

Create a GKE cluster

The GKE cluster must have Workload Identity Federation for GKE enabled to access Cloud Storage.

gcloud

Create a Standard cluster with Workload Identity Federation for GKE enabled.

gcloud container clusters create GKE_CLUSTER_NAME \
  --location=LOCATION
  --workload-pool=PROJECT_NAME.svc.id.goog

Set the kubectl context to the GKE cluster:

gcloud container clusters get-credentials GKE_CLUSTER_NAME \
  --location=LOCATION

Configure storage with Workload Identity Federation and service accounts

Set up necessary permissions for Cloud Storage bucket access. For more information, see Workload Identity Federation

Cloud Storage is used to store all the necessary logs and events emitted by Ray and is also used to reconstruct the Ray Dashboard. The Cloud Storage bucket must be created with the setting:

  --uniform-bucket-level-access

To create the Kubernetes service account, run the following command:

kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ray-history-server
  namespace: NAMESPACE
automountServiceAccountToken: true
EOF

Bind the roles/storage.objectUser role to the Kubernetes service account:

gcloud storage buckets add-iam-policy-binding gs://GCS_BUCKET \
  --member "principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_NAME.svc.id.goog
/subject/ns/NAMESPACE/sa/ray-history-server"  --role "roles/storage.objectUser"

Build Ray History Server images

To build the custom image, follow the steps outlined in the KubeRay documentation

Install KubeRay

Add and update the KubeRay repository:

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update

To install the KubeRay operator, run the helm install command:

helm install kuberay-operator kuberay/kuberay-operator

Set up Ray History Server

Configure RBAC roles for Ray History Server

Prepare necessary cluster roles and RBAC roles for the Ray History Server components:

kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: raycluster-reader
rules:
- apiGroups: ["ray.io"]
  resources: ["rayclusters"]
  verbs: ["list", "get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: historyserver
  namespace: NAMESPACE
subjects:
- kind: ServiceAccount
  name: ray-history-server
  namespace: NAMESPACE
roleRef:
  kind: ClusterRole
  name: raycluster-reader
EOF

Deploy Ray History Server

Create a yaml file HISTORY_SERVER_FILE_NAME with the following manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
 name: historyserver-demo
 labels:
   app: historyserver
spec:
 replicas: 1
 selector:
   matchLabels:
     app: historyserver
 template:
   metadata:
     labels:
       app: historyserver
   spec:
     serviceAccountName: ray-history-server
     containers:
     - name: historyserver
       env:
         - name: GCS_BUCKET
           value: "GCS_BUCKET"
       image: historyserver:v0.1.0
       imagePullPolicy: IfNotPresent
       command:
       - historyserver
       - --runtime-class-name=gcs
       - --ray-root-dir=log
       ports:
       - containerPort: 8080
       resources:
         limits:
           cpu: "500m"

Apply the HISTORY_SERVER_FILE_NAME using kubectl:

kubectl apply -f HISTORY_SERVER_FILE_NAME

Add a service manifest SERVICE_FILE_NAME for Ray History Server:

apiVersion: v1
kind: Service
metadata:
 name: historyserver
 labels:
   app: historyserver
spec:
 selector:
   app: historyserver
 ports:
 - protocol: TCP
   name: http
   port: 30080
   targetPort: 8080
 type: ClusterIP

Apply the service manifest using kubectl:

kubectl apply -f SERVICE_FILE_NAME

Deploy a Ray Job with an ephemeral Ray Cluster

The collector component of Ray History Server lives on each of the RayCluster Pods and handles collecting the necessary logs and events, and exporting them to Cloud Storage.

Add the following environmental variables required by Ray History Server:

RAY_enable_core_worker_ray_event_to_aggregator and RAY_DASHBOARD_AGGREGATOR_AGENT_EVENTS_EXPORT_ADDR enable the Ray event export API.
RAY_DASHBOARD_AGGREGATOR_AGENT_PUBLISHER_HTTP_ENDPOINT_EXPOSABLE_EVENT_TYPES lists the types of events for the Ray History Server to collect.
GCS_BUCKET tells the collector which Cloud Storage bucket to use.

Note: The RayCluster commands are used for setting up Ray History Server and retrieving the node_id as part of the collector container setup. The commands also help ensure that the logs are saved during restart or termination.

The role field tells the collector which Ray node the collector belongs to.
The runtime-class-name field determines the storage client.
The ray-cluster-name field defines the name of the RayCluster.
The ray-root field tells Ray History Server the root directory of the logs.
The events-port field tells Ray History Server which port the events come from.

The following snippet shows an example RayJob manifest:

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: RAY_JOB
  namespace: NAMESPACE
spec:
  entrypoint: "python -c 'import ray; ray.init(); print(ray.cluster_resources())'"
  shutdownAfterJobFinishes: true
  rayClusterSpec:
    headGroupSpec:
      rayStartParams:
        dashboard-host: 0.0.0.0
      template:
        spec:
          serviceAccountName: ray-history-server
          containers:
          - name: ray-head
            image: rayproject/ray:2.53.0
            env:
            - name: RAY_enable_ray_event
              value: "true"
            - name: RAY_enable_core_worker_ray_event_to_aggregator
              value: "true"
            - name: RAY_DASHBOARD_AGGREGATOR_AGENT_EVENTS_EXPORT_ADDR
              value: "http://localhost:8084/v1/events"
            - name: RAY_DASHBOARD_AGGREGATOR_AGENT_PUBLISHER_HTTP_ENDPOINT_EXPOSABLE_EVENT_TYPES
              value: "TASK_DEFINITION_EVENT,TASK_LIFECYCLE_EVENT,ACTOR_TASK_DEFINITION_EVENT,TASK_PROFILE_EVENT,DRIVER_JOB_DEFINITION_EVENT,DRIVER_JOB_LIFECYCLE_EVENT,ACTOR_DEFINITION_EVENT,ACTOR_LIFECYCLE_EVENT,NODE_DEFINITION_EVENT,NODE_LIFECYCLE_EVENT"
            command:
            - /bin/sh
            - -c
            - 'echo "=========================================="; [ -d "/tmp/ray/session_latest" ] && dest="/tmp/ray/prev-logs/$(basename $(readlink /tmp/ray/session_latest))/$(cat /tmp/ray/raylet_node_id)" && echo "dst is $dest" && mkdir -p "$dest" && mv /tmp/ray/session_latest/logs "$dest/logs"; echo "========================================="'
            # This hook retrieves and persists the node_id for the collector
            lifecycle:
              postStart:
                exec:
                  command:
                  - /bin/sh
                  - -lc
                  - --
                  - |
                    GetNodeId(){
                      while true;
                      do
                        nodeid=$(ps -ef | grep raylet | grep node_id | grep -v grep | grep -oP '(?<=--node_id=)[^ ]*')
                        if [ -n "$nodeid" ]; then
                          echo "$(date) raylet started: ${nodeid}" >> /tmp/ray/init.log
                          echo $nodeid > /tmp/ray/raylet_node_id
                          break
                        else
                          sleep 1
                        fi
                      done
                    }
                    GetNodeId
            volumeMounts:
            - name: ray-dir
              mountPath: /tmp/ray
          - name: collector
            image: COLLECTOR_IMAGE
            env:
            - name: GCS_BUCKET
              value: "GCS_BUCKET"
            command:
            - collector
            - --role=Head
            - --runtime-class-name=gcs
            - --ray-cluster-name=RAY_JOB
            - --ray-root-dir=log
            - --events-port=8084
            volumeMounts:
            - name: ray-dir
              mountPath: /tmp/ray
          volumes:
          - name: ray-dir
            emptyDir: {}
    workerGroupSpecs:
    - groupName: cpu
      replicas: 1
      template:
        spec:
          serviceAccountName: ray-history-server
          containers:
          - name: ray-worker
            image: rayproject/ray:2.53.0
            env:
            - name: RAY_enable_ray_event
              value: "true"
            - name: RAY_enable_core_worker_ray_event_to_aggregator
              value: "true"
            - name: RAY_DASHBOARD_AGGREGATOR_AGENT_EVENTS_EXPORT_ADDR
              value: "http://localhost:8084/v1/events"
            - name: RAY_DASHBOARD_AGGREGATOR_AGENT_PUBLISHER_HTTP_ENDPOINT_EXPOSABLE_EVENT_TYPES
              value: "TASK_DEFINITION_EVENT,TASK_LIFECYCLE_EVENT,ACTOR_TASK_DEFINITION_EVENT,TASK_PROFILE_EVENT,DRIVER_JOB_DEFINITION_EVENT,DRIVER_JOB_LIFECYCLE_EVENT,ACTOR_DEFINITION_EVENT,ACTOR_LIFECYCLE_EVENT,NODE_DEFINITION_EVENT,NODE_LIFECYCLE_EVENT"
            command:
            - /bin/sh
            - -c
            - 'echo "=========================================="; [ -d "/tmp/ray/session_latest" ] && dest="/tmp/ray/prev-logs/$(basename $(readlink /tmp/ray/session_latest))/$(cat /tmp/ray/raylet_node_id)" && echo "dst is $dest" && mkdir -p "$dest" && mv /tmp/ray/session_latest/logs "$dest/logs"; echo "========================================="'
            lifecycle:
              postStart:
                exec:
                  command:
                  - /bin/sh
                  - -lc
                  - --
                  - |
                    GetNodeId(){
                      while true;
                      do
                        nodeid=$(ps -ef | grep raylet | grep node_id | grep -v grep | grep -oP '(?<=--node_id=)[^ ]*')
                        if [ -n "$nodeid" ]; then
                          echo $nodeid > /tmp/ray/raylet_node_id
                          break
                        else
                          sleep 1
                        fi
                      done
                    }
                    GetNodeId
            volumeMounts:
            - name: ray-dir
              mountPath: /tmp/ray
          - name: collector
            image: COLLECTOR_IMAGE
            env:
            - name: GCS_BUCKET
              value: "GCS_BUCKET"
            command:
            - collector
            - --role=Worker
            - --runtime-class-name=gcs
            - --ray-cluster-name=RAY_JOB
            - --ray-root-dir=log
            - --events-port=8084
            volumeMounts:
            - name: ray-dir
              mountPath: /tmp/ray
          volumes:
          - name: ray-dir
            emptyDir: {}

Access terminated RayClusters by using the local Ray Dashboard

Port-forward the historyserver service so that it can be accessed by the local Ray Dashboard:

kubectl port-forward svc/historyserver 8080:30080

Start the local Ray Dashboard

Install Ray locally. Make sure you use the 2.55+ version.

pip uninstall -y ray
pip install -U "ray[default]==2.55.0"

For more information, see Ray releases

the Ray Dashboard, run the ray start command:

ray start --head --num-cpus=1 --proxy-server-url=http://localhost:8080

Configure RayCluster for the Ray Dashboard

Finding and selecting the cookies are required for the Ray Dashboard to know which RayCluster to look at.

To select a historical cluster, first get the list of all Ray clusters and their sessions.

In your browser, list your Ray cluster sessions by navigating to the following URL:

http://localhost:8265/clusters

The endpoint call result should look something like the following:

[
 {
  "name": "ratjob",
  "namespace": "default",
  "sessionName": "session_2026-03-20_10-50-19_089740_1",
  "createTime": "2026-03-20T10:50:19Z",
  "createTimeStamp": 1774003819
 },
 {
  "name": "ray-cluster-hs",
  "namespace": "default",
  "sessionName": "session_2026-03-18_17-11-25_410478_1",
  "createTime": "2026-03-18T17:11:25Z",
  "createTimeStamp": 1773853885
 },
 {
  "name": "raycluster-historyserver",
  "namespace": "default",
  "sessionName": "session_2026-02-20_13-03-16_320452_1",
  "createTime": "2026-02-20T13:03:16Z",
  "createTimeStamp": 1771592596
 },
]

Copy a Ray cluster session and navigate to this endpoint in the browser:

http://localhost:8265/enter_cluster/default/raycluster-historyserver/SESSION_ID

The cookies are set when the endpoint loads.

A successful request produces output like the following:

{
 "name": "ratjob",
 "namespace": "default",
 "result": "success",
 "session": "session_2026-03-20_10-41-12_950419_1"
}

You can access the log by using the following dashboard endpoint:

http://localhost:8265

Using the RayJob example, the Ray Dashboard looks like the following examples.

Ray job status page using Ray History Server as a backend: History server serving terminated RayJob status

Ray job log page using Ray History Server as a backend: History server serving terminated RayJob log

What's next

Send feedback and bug reports to the KubeRay issue tracker in GitHub
Try running Trillium TPU with Ray
Learn more about using the Ray operator provided by GKE