"Managed Service for Apache Spark" is the new name for the product formerly known as "Dataproc on Compute Engine" (cluster deployment) and "Google Cloud Serverless for Apache Spark" (serverless deployment).
Hadoop data storage
Stay organized with collections
Save and categorize content based on your preferences.
Managed Service for Apache Spark integrates with Apache Hadoop and the Hadoop Distributed
File System (HDFS). The following features and considerations can be important
when selecting compute and data storage options for Managed Service for Apache Spark
clusters and jobs:
HDFS with Cloud Storage:
Managed Service for Apache Spark uses the
Hadoop Distributed File System (HDFS) for storage. Additionally,
Managed Service for Apache Spark automatically installs the HDFS-compatible
Cloud Storage connector,
which enables the use of Cloud Storage
in parallel with HDFS. Data can be moved in and out of a cluster through
upload and download to HDFS or Cloud Storage.
VM disks:
By default, when no local SSDs are provided, HDFS data and intermediate
shuffle data is stored on VM boot disks, which are
Persistent Disks.
If you use local SSDs,
HDFS data and intermediate shuffle data is stored on the SSDs.
Persistent disk (PD) size and type affect performance and VM size, whether using HDFS or Cloud Storage
for data storage.
VM Boot disks are deleted when the cluster is deleted.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2026-04-10 UTC."],[],[]]