Set up cross-cloud Lakehouse

This document describes how to set up a cross-cloud Lakehouse to query data from a Databricks Unity Catalog catalog directly within Google Cloud. This capability unifies your data analytics by integrating your external data sources with your existing Google Cloud environment.

Afterward, you can use Lakehouse for Apache Iceberg to manage access to your federated data.

Before you begin

  1. Review the Lakehouse overview to understand how Lakehouse manages access to data.
  2. Read About cross-cloud Lakehouse to understand how it works.
  3. Understand how to use regional Secret Manager secrets. This is required to set up a cross-cloud Lakehouse.
  4. Generate an OAuth Service Principal (Client ID and Secret) within your remote catalog provider (for example, Databricks) that has read access to the target catalog. This process is outside of the scope of this documentation.
  5. Optional: If you plan to route queries over a private interconnect between your Google Cloud VPC and AWS VPC, ensure that you have an active Amazon Web Services (AWS) account, provision a Cross-Cloud Interconnect or Partner Interconnect, establish BGP sessions with your Cloud Router, and verify that you have the required IAM permissions in both cloud environments.
  6. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  7. Verify that billing is enabled for your Google Cloud project.

  8. Enable the BigLake, Secret Manager APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the APIs

  9. Verify that billing is enabled for your Google Cloud project.

  10. Enable the BigLake, Secret Manager APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the APIs

Required roles

To get the permissions that you need to set up cross-cloud Lakehouse, ask your administrator to grant you the following IAM roles on your project:

  • Manage Lakehouse catalogs: BigLake Admin (roles/biglake.admin)
  • Manage secrets: Secret Manager Admin (roles/secretmanager.admin)
  • Route traffic over private interconnect: Compute Network Admin (roles/compute.networkAdmin), Service Directory Viewer (roles/servicedirectory.viewer), and Service Directory PSC Authorized Service (roles/servicedirectory.pscAuthorizedService)

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Limitations and considerations

This section lists the limitations and considerations for using Cross-cloud Lakehouse.

  • Supported catalogs: Databricks Unity Catalog on Amazon Web Services (AWS) and Google Cloud.
  • Supported Cloud Providers: Using a private interconnect with your cross-cloud Lakehouse is supported with the following remote cloud providers: Amazon Web Services (AWS). You can use either a Cross-Cloud Interconnect or a Partner Interconnect.
  • BigQuery UI browsing: The BigQuery UI does not support browsing the federated catalog tree natively. You must verify the setup using the CLI and query tables using the 4-part table path.
  • Network routing: If a private interconnect (such as Customer owned CCI or Partner Interconnect) is not configured, queries route over the public internet. This might result in higher AWS egress fees and less predictable performance.
  • Data freshness: The --refresh-interval flag for the federated catalog determines how often metadata is synchronized. A shorter interval provides fresher data but might incur additional API costs from the remote catalog provider.
  • Iceberg Metrics Reporting: Iceberg Metrics Reporting isn't available for federated catalogs. Set the rest-metrics-reporting-enabled property to false in your Iceberg client when accessing a federated catalog.

General workflow

To set up and use cross-cloud Lakehouse, follow these general steps:

  • Set up Cross-Cloud Interconnect (Optional): Configure a private connection between your Google Cloud VPC and your remote cloud provider.
  • Set up federation: Create a secret in Secret Manager with your remote catalog credentials. Then, create a federated catalog in Lakehouse and grant it access to the secret.
  • Verify the connection: Ensure Lakehouse can successfully connect to your remote catalog.
  • Query data: Run queries against your federated data using BigQuery or Managed Service for Apache Spark. For more information, see Use cross-cloud Lakehouse.
  • Configure permissions: Use Identity and Access Management (IAM) to manage who can view and query the federated data.

Set up Cross-Cloud Interconnect (Optional)

Queries to your remote catalog travel over the public internet by default. To help enhance security and compliance, provide predictable performance, and reduce data transfer costs, use a private interconnect. This establishes a dedicated, private network connection between your Google Cloud Virtual Private Cloud (VPC) and your remote cloud provider's network (for example, AWS).

You can provision and configure either of the following private interconnect options between your Google Cloud VPC and your AWS VPC:

Establish BGP sessions between your Cloud Router in Google Cloud and your AWS VPC to ensure route exchange.

To enable private querying, you must configure a path from Lakehouse to your AWS Amazon S3 bucket through your private interconnect. There are two architectural flows you can follow to configure this routing:

  • Internal Load Balancer (ILB) routing (Recommended): This flow uses a Google Cloud Internal Load Balancer to distribute requests across Hybrid Connectivity Network Endpoint Groups (NEGs) pointing to multiple AWS Elastic Network Interfaces (ENIs). This flow is essential for load balancing, scalability, and high availability.
  • Direct endpoint routing: This flow connects Service Directory directly to a single AWS Interface VPC Endpoint IP address.

Select the configuration flow that matches your architecture requirements:

Internal Load Balancer

To configure an Internal Load Balancer (ILB) to distribute requests across multiple AWS ENIs for high availability and load balancing, follow these steps:

Configure AWS networking

First, create an Amazon S3 VPC Interface Endpoint (AWS PrivateLink):

  1. In the AWS VPC console, create an Interface Endpoint for Amazon S3.
  2. For the service name, specify com.amazonaws.<var>AWS_REGION</var>.s3.
  3. Select the VPC and subnets that are connected through Direct Connect to your Google Cloud VPC.
  4. Attach Security Groups to the endpoint to control inbound access.
  5. This provisions Elastic Network Interfaces (ENIs) in each selected subnet. Note the private IP addresses of these ENIs.

Next, configure Security Groups:

  • Ensure that the Security Group or groups attached to the Amazon S3 Endpoint ENIs allow inbound TCP traffic on port 443 from the relevant IP ranges of your Google Cloud VPC.

Configure Google Cloud networking

First, create Hybrid Connectivity Network Endpoint Groups (NEGs) for each AWS Amazon S3 ENI private IP address. These NEGs are zonal.

# Repeat for each AWS ENI IP address
gcloud compute network-endpoint-groups create NEG_NAME_1 \
    --project=PROJECT_ID \
    --network-endpoint-type=NON_GCP_PRIVATE_IP_PORT \
    --zone=GCP_ZONE \
    --network=VPC_NETWORK \
    --default-port=443

gcloud compute network-endpoint-groups update NEG_NAME_1 \
    --project=PROJECT_ID \
    --zone=GCP_ZONE \
    --add-endpoint="ip=AWS_ENI_IP_1,port=443"

# Example for a second ENI
gcloud compute network-endpoint-groups create NEG_NAME_2 \
    --project=PROJECT_ID \
    --network-endpoint-type=NON_GCP_PRIVATE_IP_PORT \
    --zone=GCP_ZONE \
    --network=VPC_NETWORK \
    --default-port=443

gcloud compute network-endpoint-groups update NEG_NAME_2 \
    --project=PROJECT_ID \
    --zone=GCP_ZONE \
    --add-endpoint="ip=AWS_ENI_IP_2,port=443"

Replace the following:

  • NEG_NAME_1, NEG_NAME_2: unique identifiers for your NEGs.
  • PROJECT_ID: your Google Cloud project ID.
  • GCP_ZONE: the Google Cloud zone. For example, us-east4-a.
  • VPC_NETWORK: the Google Cloud VPC network name.
  • AWS_ENI_IP_1, AWS_ENI_IP_2: the private IP addresses of your AWS ENIs.

Next, configure the Internal Load Balancer (ILB):

  1. Create a health check:

    gcloud compute health-checks create tcp HEALTH_CHECK_NAME \
        --project=PROJECT_ID \
        --region=REGION \
        --port=443

    Replace the following:

    • HEALTH_CHECK_NAME: a unique identifier for your health check.
    • PROJECT_ID: your Google Cloud project ID.
    • REGION: the Google Cloud region. For example, us-east4. This must be the same region as the federated catalog.
  2. Create a regional backend service:

    gcloud compute backend-services create BACKEND_SERVICE_NAME \
        --project=PROJECT_ID \
        --load-balancing-scheme=INTERNAL \
        --protocol=TCP \
        --region=REGION \
        --health-checks=HEALTH_CHECK_NAME \
        --health-checks-region=REGION

    Replace the following:

    • BACKEND_SERVICE_NAME: a unique identifier for your backend service.
  3. Add the NEGs to the backend service:

    gcloud compute backend-services add-backend BACKEND_SERVICE_NAME \
        --project=PROJECT_ID \
        --region=REGION \
        --network-endpoint-group=NEG_NAME_1 \
        --network-endpoint-group-zone=GCP_ZONE \
        --balancing-mode=CONNECTION
    
    gcloud compute backend-services add-backend BACKEND_SERVICE_NAME \
        --project=PROJECT_ID \
        --region=REGION \
        --network-endpoint-group=NEG_NAME_2 \
        --network-endpoint-group-zone=GCP_ZONE \
        --balancing-mode=CONNECTION

    Replace the following:

    • NEG_NAME_1, NEG_NAME_2: the names of your zonal NEGs created in the previous step.
    • GCP_ZONE: the Google Cloud zone where your NEGs are located. For example, us-east4-a.
  4. Create a forwarding rule:

    gcloud compute forwarding-rules create FORWARDING_RULE_NAME \
        --project=PROJECT_ID \
        --region=REGION \
        --load-balancing-scheme=INTERNAL \
        --network=VPC_NETWORK \
        --subnet=GCP_SUBNET \
        --ip-protocol=TCP \
        --ports=443 \
        --backend-service=BACKEND_SERVICE_NAME \
        --backend-service-region=REGION

    After creation, note the internal IP address assigned to the forwarding rule. This is your ILB_IP_ADDRESS.

    Replace the following:

    • FORWARDING_RULE_NAME: a unique identifier for your forwarding rule.
    • VPC_NETWORK: the Google Cloud VPC network name.
    • GCP_SUBNET: the Google Cloud subnet name within your VPC network where the ILB will be provisioned.

Configure Service Directory

Register the ILB's IP address in Service Directory, so Lakehouse can discover it.

  1. Create a namespace for your remote cloud:

    gcloud service-directory namespaces create NAMESPACE \
        --project=PROJECT_ID \
        --location=REGION

    Replace the following:

    • NAMESPACE: a unique identifier for your namespace.
    • PROJECT_ID: your Google Cloud project ID.
    • REGION: the Google Cloud region. For example, us-east4. This must be the same region as the federated catalog.
  2. Create a service in the Service Directory namespace:

    gcloud service-directory services create SERVICE_NAME \
        --namespace=NAMESPACE \
        --project=PROJECT_ID \
        --location=REGION

    Replace the following:

    • SERVICE_NAME: a unique identifier for your service.
  3. Create an endpoint for the ILB in the service:

    gcloud service-directory endpoints create ENDPOINT_NAME \
        --project=PROJECT_ID \
        --namespace=NAMESPACE \
        --service=SERVICE_NAME \
        --location=REGION \
        --network=projects/PROJECT_NUMBER/global/networks/VPC_NETWORK \
        --address=ILB_IP_ADDRESS \
        --port=443

    Replace the following:

    • ENDPOINT_NAME: a unique identifier for your endpoint.
    • PROJECT_NUMBER: your Google Cloud project number. Use your project number in the --network flag.
    • ILB_IP_ADDRESS: the internal IP address of your ILB forwarding rule.

Direct endpoint

To configure Service Directory to route traffic directly to a single AWS Interface VPC Endpoint IP address, follow these steps:

  1. Create an Interface VPC Endpoint for Amazon S3 inside your AWS VPC. Note the IP address and port of this endpoint.
  2. Create a namespace for your remote cloud:

    gcloud service-directory namespaces create NAMESPACE \
        --project=PROJECT_ID \
        --location=REGION

    Replace the following:

    • NAMESPACE: a unique identifier for your namespace.
    • PROJECT_ID: your Google Cloud project ID.
    • REGION: the Google Cloud region. For example, us-east4. This must be the same region as the federated catalog.
  3. Create a service in the Service Directory namespace:

    gcloud service-directory services create SERVICE_NAME \
        --namespace=NAMESPACE \
        --project=PROJECT_ID \
        --location=REGION

    Replace the following:

    • SERVICE_NAME: a unique identifier for your service.
  4. Create an endpoint in the service containing the routing information for your Amazon S3 Interface VPC Endpoint:

    gcloud service-directory endpoints create ENDPOINT_NAME \
        --service=SERVICE_NAME \
        --namespace=NAMESPACE \
        --project=PROJECT_ID \
        --location=REGION \
        --address=S3_VPCE_IP_ADDRESS \
        --port=S3_VPCE_PORT \
        --network=projects/PROJECT_NUMBER/global/networks/VPC_NETWORK

    Replace the following:

    • ENDPOINT_NAME: a unique identifier for your endpoint.
    • S3_VPCE_IP_ADDRESS: the IP address of your Amazon S3 Interface VPC Endpoint. For example, 10.0.1.45.
    • S3_VPCE_PORT: the port number of your Amazon S3 Interface VPC Endpoint. For example, 443.
    • PROJECT_NUMBER: your Google Cloud project number. Use your project number in the --network flag.
    • VPC_NETWORK: the Google Cloud VPC network name associated with your private interconnect.

Set up federation

To query your data, you must set up a Lakehouse federated catalog that connects to your remote catalog. The following examples demonstrate this process for a Databricks Unity Catalog catalog.

Create a regional secret

Federation requires credentials to access the remote catalog. Lakehouse uses regional Secret Manager secrets to securely store and retrieve these credentials to authenticate with your remote provider.

For Databricks, you must create a Service Principal in your Databricks account and generate an OAuth Client ID and Client Secret. Ensure this Service Principal has read access to the target Unity Catalog catalog. You then format these credentials as a JSON payload to store in Secret Manager.

  1. Create a JSON file named credentials.json with your payload:

    {
      "client_id": "CLIENT_ID",
      "client_secret": "CLIENT_SECRET"
    }

    Replace the following:

    • CLIENT_ID: the OAuth Client ID for your Databricks Service Principal.
    • CLIENT_SECRET: the OAuth Client Secret for your Databricks Service Principal.
  2. Configure the regional endpoint for Secret Manager:

    By default, Secret Manager uses a global endpoint. However, cross-cloud Lakehouse requires that your secrets be stored in the same region as your Lakehouse catalog. To interact with regional secrets using the gcloud CLI, you must override the default API endpoint for your current session or profile. To avoid connectivity issues, your secret and your catalog must be created in the same region. For example, secretmanager.us-east4.rep.googleapis.com.

    gcloud config set api_endpoint_overrides/secretmanager https://secretmanager.REGION.rep.googleapis.com/

    Replace the following:

    • REGION: the Google Cloud region where your Secret Manager secret is stored. For example, us-east4. To avoid connectivity issues, your secret and your catalog must be created in the same region. For example, secretmanager.us-east4.rep.googleapis.com.
  3. Upload the payload to Secret Manager:

    gcloud secrets create DATABRICKS_SECRET_NAME \
      --location="REGION" \
      --project="PROJECT_ID" \
      --data-file=credentials.json

    Replace the following:

    • DATABRICKS_SECRET_NAME: a name for your Databricks secret.

Create a federated catalog

Create the federated catalog using the gcloud alpha biglake iceberg catalogs create command.

Console

  1. In the Google Cloud console, go to Lakehouse.

    Go to Lakehouse

  2. Click Create catalog.

  3. Click Federated catalog.

    The Catalog configuration details appear.

  4. For Federated catalog source, select Unity (Databricks).

  5. For Data location, select the Lakehouse region where you want to create the federated catalog. For example, us-east4. To minimize latency (even over public internet) do the following when selecting a region:

    • If your Unity Catalog catalog is on AWS, select the Google Cloud region closest to your AWS region.
    • If your Unity Catalog catalog is on Google Cloud, select the exact same region.
  6. Click Continue.

    The Connection details details appear.

  7. In the Remote catalog details section, under Unity instance name, enter your target Databricks instance name. For example: abcd.cloud.databricks.com.

  8. Under Unity catalog name, enter the name of the target Databricks Unity Catalog catalog to federate to.

  9. In the Authentication and network section, under Secret, enter the name of your Databricks secret. Use the following format projects/PROJECT_ID/locations/REGION/secrets/DATABRICKS_SECRET_NAME.

  10. Under Service directory name, enter the name of the target Databricks Unity Catalog catalog to federate to.

  11. Optional: Under Service directory name, enter the path to your Service Directory endpoint. This is only required if you are configuring a Cross-Cloud Interconnect.

  12. Click Create.

gcloud CLI

Public internet (no CCI)

If you don't configure CCI, the connection securely travels over the public internet.

gcloud alpha biglake iceberg catalogs create FEDERATED_CATALOG_NAME \
--project="PROJECT_ID" \
--primary-location="REGION" \
--catalog-type="federated" \
--federated-catalog-type="unity" \
--secret-name="projects/PROJECT_ID/locations/REGION/secrets/DATABRICKS_SECRET_NAME" \
--unity-instance-name="UNITY_INSTANCE_NAME" \
--unity-catalog-name="UNITY_CATALOG_NAME" \
--refresh-interval="REFRESH_INTERVAL" \
--namespace-filters="NAMESPACE_FILTERS"

Replace the following:

  • PROJECT_ID: your Google Cloud project ID.

  • REGION: the Lakehouse region where the federated catalog is created. For example, us-east4. To minimize latency, do the following when selecting a region:

    • If your Unity Catalog catalog is on AWS, select the Google Cloud region closest to your AWS region.
  • If your Unity Catalog catalog is on Google Cloud, select the exact same region.

  • DATABRICKS_SECRET_NAME: the name of your Databricks secret.

  • UNITY_INSTANCE_NAME: your target Databricks instance name. For example: abcd.cloud.databricks.com.

  • UNITY_CATALOG_NAME: the name of the target Databricks Unity Catalog catalog to federate to.

  • REFRESH_INTERVAL: Specifies how often to update the catalog's information. Set this value as a duration, for example, 330s or 5m30s. Shorter intervals update data more often but can cost more in API calls. Longer intervals can cost less, but the queried data might not reflect your most current dataset. If omitted or if you set the value to 0s, then updates will be disabled.

  • NAMESPACE_FILTERS: (OPTIONAL) a comma-separated list of namespaces to federate. For example, ns1,ns2. If omitted, all namespaces will be included.

Customer-owned (CCI)

If you configured a private interconnect (such as Dedicated CCI or Partner Interconnect), provide the Service Directory service reference to ensure that Lakehouse routes traffic privately.

gcloud alpha biglake iceberg catalogs create FEDERATED_CATALOG_NAME 
--project="PROJECT_ID"
--primary-location="REGION"
--catalog-type="federated"
--federated-catalog-type="unity"
--secret-name="projects/PROJECT_ID/locations/REGION/secrets/DATABRICKS_SECRET_NAME"
--unity-instance-name="UNITY_INSTANCE_NAME"
--unity-catalog-name="UNITY_CATALOG_NAME"
--refresh-interval="REFRESH_INTERVAL"
--namespace-filters="NAMESPACE_FILTERS"
--service-directory-name="projects/PROJECT_ID/locations/REGION/namespaces/NAMESPACE/services/SERVICE_NAME"

Replace the following:

  • PROJECT_ID: your Google Cloud project ID.
  • PROJECT_NUMBER: your Google Cloud project number.
  • REGION: the Lakehouse region where the federated catalog is created. For example, us-east4. To minimize latency, do the following when selecting a region:
    • If your Unity Catalog catalog is on AWS, select the Google Cloud region closest to your AWS region.
  • If your Unity Catalog catalog is on Google Cloud, select the exact same region. Note: This must be the same region as the Service Directory namespace and regional secret.
  • DATABRICKS_SECRET_NAME: the name of your Databricks secret.
  • UNITY_INSTANCE_NAME: your target Databricks instance name. For example: abcd.cloud.databricks.com.
  • UNITY_CATALOG_NAME: the name of the target Databricks Unity Catalog catalog to federate.
  • REFRESH_INTERVAL: Specifies how often to update the catalog's information. Set this value as a duration, for example, 330s or 5m30s. Shorter intervals update data more often but can cost more in API calls. Longer intervals can cost less, but the queried data might not reflect your most current dataset. If omitted or if you set the value to 0s, then updates will be disabled.
  • NAMESPACE_FILTERS: (OPTIONAL) a comma-separated list of namespaces to federate. For example, ns1,ns2. If omitted, all namespaces will be included.
  • NAMESPACE: the Service Directory namespace you created during private interconnect setup.
  • SERVICE_NAME: the Service Directory service name you created during private interconnect setup.

Grant the federated catalog access to the secret

When the catalog is created, Lakehouse provisions a unique service account for it (returned as biglake-service-account in the resource description).

You must grant this service account permission to access the secret you created earlier in this tutorial. Note that propagating IAM policies can take a few minutes.

Grant the catalog's service account permission to access the secret.

# Required to use regional secrets
gcloud config set api_endpoint_overrides/secretmanager https://secretmanager.REGION.rep.googleapis.com/
gcloud secrets add-iam-policy-binding DATABRICKS_SECRET_NAME \
  --project="PROJECT_ID" \
  --location="REGION" \
  --member="serviceAccount:$(gcloud alpha biglake iceberg catalogs describe FEDERATED_CATALOG_NAME \
      --project="PROJECT_ID" \
      --location="REGION" \
      --format='value(biglake-service-account)')" \
  --role="roles/secretmanager.secretAccessor"

Verify the connection

Use the describe command to verify that Lakehouse can connect to your remote catalog:

gcloud alpha biglake iceberg catalogs describe FEDERATED_CATALOG_NAME \
     --project="PROJECT_ID" \
     --location="REGION"

To verify that the federated catalog service account has access to the secret, run the following command:

# Required to use regional secrets
gcloud config set api_endpoint_overrides/secretmanager https://secretmanager.REGION.rep.googleapis.com/
gcloud secrets get-iam-policy DATABRICKS_SECRET_NAME \
     --project="PROJECT_ID" \
     --location="REGION"

In the output, verify that the biglake-service-account service account has the roles/secretmanager.secretAccessor role assigned to it.

What's next