Cross-cloud Lakehouse lets you query data stored in other cloud providers directly from Google Cloud without migrating files or building complex ETL pipelines.
As part of Lakehouse for Apache Iceberg, this capability lets you perform unified analytics and apply AI across your distributed datasets using BigQuery, standalone Apache Spark environments, or Managed Service for Apache Spark.
Use cases
Cross-cloud Lakehouse supports several key use cases for accessing data across multiple cloud providers:
- Reduced data movement lets you query data stored in other cloud environments directly, simplifying data access and processing.
- Unified analytics lets you perform advanced analytics with consistent features and hardware optimization across all your data, regardless of where it resides.
- Cross-cloud AI and ML lets you apply AI models, autonomous agents, and machine learning directly to your remote data without migrating it.
How cross-cloud Lakehouse works
Cross-cloud Lakehouse queries remote data using the following process:
- Metadata discovery: Google Cloud's Lakehouse connects to remote Apache Iceberg REST catalogs, such as Databricks Unity. Lakehouse discovers the data without copying any files. Through Secret Manager, Lakehouse authenticates securely.
- Secure transport: Choosing to route traffic over a private interconnect (for example, Dedicated CCI or Partner Interconnect) significantly reduces data transfer costs compared to the public internet and makes latency highly predictable.
- Optimized execution: As queries read data from remote clouds, Lakehouse temporarily caches those data segments locally within Google Cloud on specialized storage. Subsequent queries use the local cache, which avoids a significant portion of cross-cloud egress charges.
Core concepts
This section describes the key components essential to using cross-cloud Lakehouse.
Apache Iceberg REST catalog federation
This is the metadata layer. You connect to remote Apache Iceberg REST catalogs, for example, Databricks Unity. Lakehouse discovers the data without copying any files. Through Workload Identity Federation (OIDC) or OAuth credentials, Lakehouse authenticates securely without requiring long-lived access keys.
Transport layer
This is the transport layer. You can configure Lakehouse to query data stored in remote cloud providers over either the public internet or a dedicated private interconnect.
Select the transport method that matches your architectural and security requirements:
Customer-owned (CCI)
You can configure BigQuery to query data stored in Amazon Web Services (AWS) Amazon S3 buckets over a private, dedicated network connection using either Cross-Cloud Interconnect or Partner Interconnect.
Using a private interconnect provides the following benefits:
- Enhanced security: Data travels across a private network connection between Google Cloud and AWS, avoiding the public internet.
- Reduced costs: Potentially lower egress charges from AWS compared to internet egress, especially when combined with your private interconnect capacity.
- Consistent performance: More predictable network latency and bandwidth compared to the public internet.
Architecture overview
To enable private querying, you configure a path from BigQuery to your AWS Amazon S3 bucket through your private interconnect. A key component in the Google Cloud Virtual Private Cloud (VPC) is an Internal Load Balancer (ILB). The ILB distributes requests from BigQuery to the private endpoints for Amazon S3 within your AWS VPC, which are provisioned using AWS PrivateLink.
Using an ILB with multiple Elastic Network Interfaces (ENIs) as backends is essential for load balancing, scalability, and high availability. This applies whether you use Dedicated CCI or Partner Interconnect.
The private query workflow follows this process:
- BigQuery uses a connection configured with a Service Directory service.
- Service Directory resolves the service name to the internal IP address of the Google Cloud ILB.
- The ILB receives the requests from BigQuery and distributes them to configured backends.
- The ILB backends are Hybrid Connectivity Network Endpoint Groups (NEGs), each pointing to the private IP address of an ENI in your AWS VPC
- Traffic flows from the ILB, through the NEGs, across the private interconnect, to the AWS ENIs.
- The AWS ENIs, part of an Amazon S3 VPC Interface Endpoint (AWS PrivateLink), provide private access to the Amazon S3 service.
Public internet (no CCI)
If you do not configure a private interconnect, queries to your remote catalog travel over the public internet by default.
When querying data over the public internet, consider the following implications:
- Standard encryption: Data access requests and data transfers are encrypted in transit using standard TLS protocols across the public internet.
- Egress costs: Data transfer incurs standard internet egress charges from your remote cloud provider (for example, AWS), which are typically higher than private interconnect egress rates.
- Variable latency: Network performance, bandwidth, and latency depend on public internet routing and congestion, resulting in less predictable query execution times compared to a dedicated private interconnect.
- Simplified setup: Requires no additional networking infrastructure, VPC peering, or Service Directory configuration in Google Cloud or your remote cloud provider.
Architecture overview
When querying data over the public internet, Lakehouse connects directly to your remote catalog and object storage endpoints without requiring private Google Cloud or remote cloud networking infrastructure.
The public internet query workflow follows this process:
- BigQuery initiates a query against a federated table defined in your Lakehouse catalog.
- Lakehouse authenticates securely with your remote Apache Iceberg catalog (for example, Databricks Unity) using credentials stored in Secret Manager.
- Lakehouse retrieves the table metadata and manifest files across the public internet to identify the relevant underlying data files (for example, in AWS Amazon S3).
- Data access requests for the underlying objects are sent directly from Google Cloud over the public internet using standard TLS encryption.
- The remote storage service verifies the request using temporary, scoped credentials vended by Lakehouse and returns the requested data blocks across the public internet to Google Cloud.