This document details how data types and storage formats behave when integrating Spark and Hive with BigQuery through the Lakehouse runtime catalog.
Specifically, this document provides:
- Supported storage formats: A compatibility breakdown of formats like Parquet, ORC, Avro, CSV, and JSON across Hive SerDe and Spark data sources.
- Data type mappings: The precise conversion rules between Spark and BigQuery data types.
Use this page to verify that your table schemas and storage formats align with the metastore before you run workloads or query tables across engines.
Supported storage formats between Hive and Spark
The following sections describe the storage format and data source compatibility between Hive, Spark, and BigQuery.
Detailed storage format mapping
BigQuery determines the storage format of a table based on the
input_format, output_format, and SerDe library in the metadata. The
following table maps these properties to the BigQuery storage format.
| Input format, output format, and SerDe library | BigQuery storage format |
|---|---|
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe |
Parquet |
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat org.apache.hadoop.hive.ql.io.orc.OrcSerde |
ORC |
org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat org.apache.hadoop.hive.serde2.avro.AvroSerDe |
Avro |
org.apache.hadoop.mapred.TextInputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat org.apache.hadoop.hive.serde2.OpenCSVSerde |
CSV |
org.apache.hadoop.mapred.TextInputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat org.openx.data.jsonserde.JsonSerDe |
JSON |
org.apache.hadoop.mapred.TextInputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat org.apache.hive.hcatalog.data.JsonSerDe |
JSON |
Hive SerDe compatibility
The following table lists the compatibility of Hive SerDe table formats with BigQuery.
| Format | Spark SQL DDL syntax | Queryable from BigQuery |
|---|---|---|
| Parquet | CREATE TABLE ... STORED AS PARQUET |
Yes |
| ORC | CREATE TABLE ... STORED AS ORC |
Yes |
| Avro | CREATE TABLE ... STORED AS AVRO |
Yes |
| CSV | CREATE TABLE ... ROW FORMAT 'org.apache.hadoop.hive.serde2.OpenCSVSerde' |
Yes |
| JSON | CREATE TABLE ... ROW FORMAT 'org.openx.data.jsonserde.JsonSerDe' |
Yes |
Spark data source compatibility
The following table lists the compatibility of Spark data source table formats with BigQuery.
CSV and JSON SerDe tables are queryable from BigQuery. However, CSV and JSON Spark data source tables are not.
| Format | Spark SQL DDL syntax | Queryable from BigQuery |
|---|---|---|
| Parquet | CREATE TABLE ... USING PARQUET |
Yes |
| ORC | CREATE TABLE ... USING ORC |
Yes |
| Avro | CREATE TABLE ... USING AVRO |
Yes |
| CSV | CREATE TABLE ... USING CSV |
No |
| JSON | CREATE TABLE ... USING JSON |
No |
Supported data types from Spark to BigQuery
The following table maps Spark data types to BigQuery data types.
| Spark data type | BigQuery data type |
|---|---|
BYTE or TINYINT |
INT64 |
SMALLINT or SHORT |
INT64 |
INT or INTEGER |
INT64 |
BIGINT or LONG |
INT64 |
DECIMAL or NUMERIC |
BIGNUMERIC |
FLOAT |
FLOAT64 |
DOUBLE |
FLOAT64 |
REAL |
FLOAT64 |
BOOLEAN |
BOOL |
STRING |
STRING |
VARCHAR |
STRING |
CHAR or CHARACTER |
STRING |
BINARY |
BYTES |
DATE |
DATE |
TIMESTAMP or TIMESTAMP_LTZ |
TIMESTAMP |
ARRAY |
ARRAY |
STRUCT<col_name: type1, ...> |
STRUCT<col_name: type1, ...> |
MAP<key_type, value_type> |
ARRAY<STRUCT<key: key_type, value: value_type>>To enable this feature, send an email to biglake-help@google.com. This is only necessary if your workloads use MAP. |