Limitations and considerations

Integrating Spark and Hive with the Lakehouse runtime catalog eliminates the operational overhead of maintaining a self-hosted Hive Metastore (HMS) while enabling unified metadata sharing and direct table queries in BigQuery.

This document highlights the functional constraints and service considerations of this integration. Before migrating or building your open-source database pipelines on the Lakehouse runtime catalog, review these limitations to determine if this preview matches your technical requirements.

If you are looking for configuration and query instructions instead of limits, see Use Spark and Hive with the Lakehouse runtime catalog.

Lakehouse runtime catalog limitations

This section lists the limitations of using the Lakehouse runtime catalog with various services.

Metastore limitations

  • Managed Service for Apache Spark supports only PySpark jobs with Lakehouse Metastore.
  • The Dataproc API doesn't support setting Lakehouse Metastore properties in the properties field.
  • You can't create Managed Service for Apache Spark clusters that use Kerberos, because Lakehouse runtime catalog doesn't support delegation token or primary key APIs.
  • Databases and tables can use a Cloud Storage location_uri that is distinct from their Hive catalog, as long as the Cloud Storage bucket is in the same region as the Hive catalog.

Table limitations

  • Table renaming isn't supported.
  • Partition renaming isn't supported.
  • Deleting tables or databases doesn't remove associated files from Cloud Storage.
  • Case-insensitive search isn't supported.
  • Clustering and bucketing aren't supported.

Partition batch size

The Lakehouse runtime catalog supports the storage and retrieval of partitioning information for use in partition pruning. It's optimized for reads over writes, which results in faster query performance through partition pruning.

To optimize partition ingestion performance, the batch partition size is limited to 900.

Set the following configuration for the Hive and Spark properties that determine the batch size of partitioning operations:

  • SET hive.msck.repair.batch.size = 900;
  • SET spark.sql.addPartitionInBatch.size = 900;

BigQuery limitations

  • By default, BigQuery doesn't support ARRAY<ARRAY<>> or ARRAY<MAP<>> data types. Support for MAP must be added to an allowlist. Contact biglake-help@google.com if your workloads use MAP extensively.
  • MAP key types support only primitive data types. You can't use ARRAY, STRUCT, or MAP as key types.
  • During the preview, BigQuery can query only data from Cloud Storage. The following limitations apply:
    • Table location URIs can't include a wildcard (*).
    • Table location URIs must be directories.

Cross-region replication and disaster recovery limitations

The Lakehouse runtime catalog offers cross-region replication and disaster recovery to improve your catalog's availability and resilience.

When using the Lakehouse runtime catalog with Hive catalogs, the following limitations apply:

  • Hive catalogs don't provide full disaster recovery capabilities, such as user-initiated failover.

  • When you create a Hive catalog, you must set its primary_location to match your Cloud Storage bucket's region. The Lakehouse runtime catalog then automatically copies the metadata to a secondary region based on your bucket's dual-region or multi-region configuration. This secondary metadata copy is read-only, and you can't promote it to primary. Data redundancy relies on your bucket's dual-region or multi-region settings, which is separate from Lakehouse runtime catalog metadata replication.

Considerations for using Lakehouse runtime catalog as a Hive metastore replacement

The preview version of the Lakehouse runtime catalog supports a subset of the Hive Metastore interface. This design prioritizes compatibility with the Spark ExternalCatalog, which doesn't require full compatibility with the Hive Metastore.

Resource mapping

The following table maps Hive Metastore resources to the Lakehouse runtime catalog resources and their required Identity and Access Management (IAM) permissions.

Hive Metastore resource Lakehouse runtime catalog resource IAM permission
Catalog Catalog biglake.catalogs.*
Database Database biglake.namespaces.*
Table Table biglake.tables.*

Governance

The Hive Metastore (HMS) provides governance at the table, column, and partition levels. The Lakehouse runtime catalog provides table-level and partition-level IAM permissions. Column-level governance isn't supported.

Storage limitations

  • All BigQuery external table limitations apply.

Partition limitations

  • Tracking column-level statistics at the partition level isn't supported.
  • The BatchCreateHivePartitions API limits calls to 900 partitions.

What's next