This document explains how to set up and configure Apache Spark and Apache Hive to use the Lakehouse runtime catalog. You learn how to create a Lakehouse for Apache Iceberg Hive catalog, configure your Spark sessions to connect to the metastore, and run workloads to create tables that you can query directly in BigQuery.
Before you begin
- Read About Hive Catalogs in Lakehouse runtime catalog to understand how Spark connects to the Lakehouse runtime catalog.
- Review Supported storage formats and data types.
- Review Limitations and considerations.
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
Verify that billing is enabled for your Google Cloud project.
Enable the Lakehouse for Apache Iceberg, Dataproc API APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.-
Verify that billing is enabled for your Google Cloud project.
Enable the Lakehouse for Apache Iceberg, Dataproc API APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.
Required roles
To get the permissions that you need to use Lakehouse runtime catalog, ask your administrator to grant you the following IAM roles on your project:
-
Create a Managed Service for Apache Spark cluster:
Dataproc Editor (
roles/dataproc.editor) -
Cluster service account:
- Dataproc Worker (
roles/dataproc.worker) - Storage Object User (
roles/storage.objectUser) - BigLake Editor (
roles/biglake.editor) - Service Usage Consumer (
roles/serviceusage.serviceUsageConsumer)
- Dataproc Worker (
-
Write access to all Lakehouse runtime catalog resources:
BigLake Editor (
roles/biglake.editor) -
Read-only access to all Lakehouse runtime catalog resources:
BigLake Viewer (
roles/biglake.viewer)
For more information about granting roles, see Manage access to projects, folders, and organizations.
You might also be able to get the required permissions through custom roles or other predefined roles.
For instructions, see Granting a single role.
General workflow
To use the Lakehouse runtime catalog with Spark and Hive, you follow this general workflow:
- Create a Lakehouse for Apache Iceberg Hive catalog.
- Configure your Spark session by using your preferred tool (such as Managed Service for Apache Spark or BigQuery Studio).
- Perform database and table operations within your Spark session.
- Submit batch workloads to Managed Service for Apache Spark and query the resulting tables directly from BigQuery.
Create a Lakehouse Hive catalog
To use the Lakehouse runtime catalog with Spark and Hive, you must first create a Hive catalog.
A Lakehouse Hive catalog is a collection of Hive databases. Before you run Spark jobs, create a catalog to register it with Lakehouse Metastore. The catalog has a name and a Cloud Storage location where the Hive data resides.
gcloud
To create a Hive catalog, run the following command:
gcloud alpha biglake hive catalogs create BIGLAKE_CATALOG_ID \
--project=PROJECT_ID
--location-uri="gs://GCS_WAREHOUSE_PATH" \
--primary_location=REGION \
--description="DESCRIPTION"
Replace the following:
BIGLAKE_CATALOG_ID: the Hive catalog name.GCS_WAREHOUSE_PATH: the Cloud Storage path that stores your Hive warehouse.PROJECT_ID: your Google Cloud project ID.REGION: the primary region of the metastore. For single-region buckets, it should match the bucket region. For dual-region or multi-region buckets, it should be one of the constituent regions and where the primary replica of the metastore is intended. The other region becomes the secondary replica.DESCRIPTION: a description of the catalog.
cURL
- To create a Hive catalog, run the following command:
curl -X POST -s -i -H "Authorization: Bearer $(gcloud auth print-access-token)" \
-d '{"locationUri": "gs://GCS_WAREHOUSE_PATH", "description": "DESCRIPTION"}' \
-H "Content-Type:application/json" \ "https://biglake.googleapis.com/hive/v1alpha/projects/PROJECT_ID/catalogs?hiveCatalogId=BIGLAKE_CATALOG_ID&primary_location=REGION"
Replace the following:
BIGLAKE_CATALOG_ID: the Hive catalog name.GCS_WAREHOUSE_PATH: the Cloud Storage path that stores your Hive warehouse.PROJECT_ID: your Google Cloud project ID.REGION: the primary region of the metastore. For single-region buckets, it should match the bucket region. For dual-region or multi-region buckets, it should be one of the constituent regions and where the primary replica of the metastore is intended. The other region becomes the secondary replica.DESCRIPTION: a description of the catalog.
Configure and use Spark and Hive
To use the Lakehouse runtime catalog, you must configure your Spark session with specific properties. You can set these properties when you create a Managed Service for Apache Spark cluster or specify them each time you create a session.
These properties include details such as the client factory, Google Cloud project ID, default catalog, and warehouse directory. After the session is established, you can perform fundamental operations like listing existing databases, creating new databases, defining tables, and inserting data.
spark-sql
- Use SSH to connect to your Managed Service for Apache Spark cluster's primary node.
Run
spark-sqlon the command line with the following properties to start an interactive Spark SQL session:spark-sql \ --conf spark.hive.metastore.client.factory.class=com.google.cloud.bigquery.metastore.client.BigLakeMetastoreClientFactory \ --conf spark.hive.metastore.blms.project.id=PROJECT_ID \ --conf spark.hive.metastore.blms.catalog.default=BIGLAKE_CATALOG_ID \ --conf spark.hive.metastore.warehouse.dir=gs://GCS_WAREHOUSE_PATHAfter the session starts, Spark connects to Lakehouse runtime catalog.
Run the following commands to create and query resources:
-- Show all the databases in the current project. SHOW DATABASES; -- Create a database. CREATE DATABASE spark_blms_database; -- Create a Parquet datasource table. CREATE TABLE spark_blms_database.parquet_quick_start (id INT, name STRING) USING PARQUET; -- Insert data into the table. INSERT INTO TABLE spark_blms_database.parquet_quick_start VALUES (1, 'my-first-user');Replace the following:
PROJECT_ID: your Google Cloud project ID.BIGLAKE_CATALOG_ID: the Hive catalog name.GCS_WAREHOUSE_PATH: the Cloud Storage path that stores your Hive warehouse.
Jupyter Notebook
- Complete the instructions to run a Jupyter notebook on a Managed Service for Apache Spark cluster.
- Access the Jupyter web interface from the Web interfaces tab the cluster's details page in the Google Cloud console.
In a new notebook, create a Spark session and then run the following queries:
from pyspark.sql import SparkSession # If a Spark session exists, stop it first by running spark.stop() spark = SparkSession.builder\ .master("local")\ .appName("Lakehouse runtime catalog tutorial")\ .config("spark.hive.metastore.client.factory.class", "com.google.cloud.bigquery.metastore.client.BigLakeMetastoreClientFactory")\ .config("spark.hive.metastore.blms.project.id", "PROJECT_ID")\ .config("spark.hive.metastore.warehouse.dir", "gs://GCS_WAREHOUSE_PATH")\ .config("spark.hive.metastore.blms.catalog.default", "BIGLAKE_CATALOG_ID")\ .getOrCreate() # Show all the databases. df = spark.sql("SHOW DATABASES;") df.show() # Create a database. spark.sql("CREATE DATABASE jupyter_blms_db") # Create a Parquet datasource table. spark.sql("CREATE TABLE jupyter_blms_db.parquet_table(id INT, name STRING) USING PARQUET") # Insert data into the table. spark.sql("INSERT INTO TABLE jupyter_blms_db.parquet_table VALUES (1, 'my-first-user');") # Query from table. spark.sql("SELECT * FROM jupyter_blms_db.parquet_table;").show()Replace the following:
PROJECT_ID: your Google Cloud project ID.GCS_WAREHOUSE_PATH: the Cloud Storage path that stores your Hive warehouse.BIGLAKE_CATALOG_ID: the Hive catalog name.
BigQuery notebook
In the Google Cloud console, go BigQuery.
In the Explorer pane, click + ADD, and then click Python notebook.
In a code cell, configure the Managed Service for Apache Spark session and Lakehouse runtime catalog properties:
from google.cloud.dataproc_spark_connect import DataprocSparkSession from google.cloud.dataproc_v1 import Session import os os.environ['DATAPROC_SPARK_CONNECT_DEFAULT_DATASOURCE'] = "" session = Session() session.environment_config.execution_config.ttl = {"seconds": 864000} session.runtime_config.version = "2.3" session.runtime_config.properties = { "spark.hive.metastore.blms.project.id": "PROJECT_ID", "spark.hive.metastore.blms.catalog.default": "BIGLAKE_CATALOG_ID", "spark.hive.metastore.warehouse.dir": "gs://GCS_WAREHOUSE_PATH", "spark.hive.metastore.client.factory.class": "com.google.cloud.bigquery.metastore.client.BigLakeMetastoreClientFactory", "spark.sql.catalogImplementation": "hive" } spark = DataprocSparkSession.builder.dataprocSessionConfig(session).getOrCreate() print("Spark session created successfully")In another code cell, create a database and table:
# Create a database spark.sql("CREATE DATABASE bq_spark_blms_database;") # Create a parquet datasource table spark.sql("CREATE TABLE bq_spark_blms_database.parquet_quick_start (id INT, name STRING) USING PARQUET;") # Insert data into the table spark.sql("INSERT INTO TABLE bq_spark_blms_database.parquet_quick_start VALUES (1, 'my-first-user');") # Query from table spark.sql("select * from bq_spark_blms_database.parquet_quick_start;").show()Replace the following:
PROJECT_ID: your Google Cloud project ID.GCS_WAREHOUSE_PATH: the Cloud Storage path that stores your Hive warehouse.BIGLAKE_CATALOG_ID: the Hive catalog name.
Hive CLI
To use the Hive CLI, you must configure it to connect to the metastore. You can do this in one of two ways:
- Option 1: Permanent configuration. Update the
/etc/hive/conf/hive-site.xmlfile on your primary node. - Option 2: Temporary configuration. Provide configuration flags when you start the CLI.
To configure the CLI and run your query session, follow these steps:
- Use SSH to connect to your Managed Service for Apache Spark cluster's primary node.
Depending on your configuration method, do one of the following:
To configure permanently (Option 1):
Open
/etc/hive/conf/hive-site.xmlon the primary node and add the following properties:<property> <name>hive.metastore.blms.project.id</name> <value>PROJECT_ID</value> <description></description> </property> <property> <name>hive.metastore.client.factory.class</name> <value>com.google.cloud.bigquery.metastore.client.BigLakeMetastoreClientFactory</value> <description></description> </property> <property> <name>hive.metastore.blms.catalog.default</name> <value>BIGLAKE_CATALOG_ID</value> <description></description> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>gs://GCS_WAREHOUSE_PATH</value> <description></description> </property>Start the Hive CLI:
hive
To configure temporarily on startup (Option 2): Start the Hive CLI with configuration flags:
hive \ --hiveconf hive.metastore.blms.project.id="PROJECT_ID" \ --hiveconf hive.metastore.blms.catalog.default="BIGLAKE_CATALOG_ID" \ --hiveconf hive.metastore.client.factory.class=com.google.cloud.bigquery.metastore.client.BigLakeMetastoreClientFactory \ --hiveconf hive.metastore.warehouse.dir="gs://GCS_WAREHOUSE_PATH"
Then, run the following commands in the Hive CLI session:
show databases; create database hive_query_test; create table hive_query_test.parquet_table (id INT, name STRING) stored as PARQUET; select * from hive_query_test.parquet_table;Replace the following:
PROJECT_ID: your Google Cloud project ID.BIGLAKE_CATALOG_ID: yhe Hive catalog name.GCS_WAREHOUSE_PATH: yhe Cloud Storage path that stores your Hive warehouse.
Submit a Managed Service for Apache Spark batch job
You can submit a PySpark batch workload to Managed Service for Apache Spark that uses the Lakehouse runtime catalog.
This PySpark snippet initializes a Spark session configured to connect with the Lakehouse runtime catalog. It sets essential properties such as the Lakehouse client factory, Google Cloud project ID, default catalog, and warehouse directory. After establishing the session, the code demonstrates how to list existing databases, create a new database, and define a Parquet-formatted table within that database by using Spark SQL commands.
Create a Python file with a PySpark job:
from pyspark.sql import SparkSession spark = ( SparkSession.builder.appName("Lakehouse runtime catalog tutorial") .config( "spark.hive.metastore.client.factory.class", "com.google.cloud.bigquery.metastore.client.BigLakeMetastoreClientFactory") .config("spark.hive.metastore.blms.project.id", "PROJECT_ID") .config("spark.hive.metastore.blms.catalog.default", "BIGLAKE_CATALOG_ID") .config( "spark.hive.metastore.warehouse.dir", "gs://GCS_WAREHOUSE_PATH", ) .enableHiveSupport() .getOrCreate() ) # Show all the databases. spark.sql("SHOW DATABASES;").show() # Create a database. spark.sql("CREATE DATABASE dp_serverless_test") # Create a Parquet datasource table. spark.sql( "CREATE TABLE dp_serverless_test.parquet_table(id INT, name STRING) USING" " PARQUET" ) # Query from table. spark.sql("SELECT * from dp_serverless_test.parquet_table").show()Replace the following:
PROJECT_ID: your Google Cloud project ID.BIGLAKE_CATALOG_ID: the Hive catalog name.GCS_WAREHOUSE_PATH: the Cloud Storage path that stores your Hive warehouse.
Submit the batch job:
gcloud dataproc batches submit pyspark PYTHON_SCRIPT_FILE \ --version=2.2 \ --project=PROJECT_ID \ --region=REGION \ --deps-bucket=gs://CLOUD_STORAGE_BUCKETReplace the following:
PYTHON_SCRIPT_FILE: The path to the PySpark application file. It can be a local path or the Cloud Storage object path.PROJECT_ID: Your project ID.REGION: The region to run the batch job in.CLOUD_STORAGE_BUCKET: The name of the Cloud Storage bucket used to stage any workload's dependencies.
Query the table from BigQuery
After you create resources from Spark in the Lakehouse runtime catalog, you can query them from BigQuery Studio.
In the Google Cloud console, go to BigQuery.
In the query editor, enter the following statement:
SELECT * FROM `PROJECT_ID.BIGLAKE_CATALOG_ID.DATABASE_NAME.TABLE_NAME`;Replace the following:
PROJECT_ID: your project ID.BIGLAKE_CATALOG_ID: your Hive catalog name.DATABASE_NAME: your database name.TABLE_NAME: your table name.