Supported storage formats and data types

This document details how data types and storage formats behave when integrating Spark and Hive with BigQuery through the Lakehouse runtime catalog.

Specifically, this document provides:

  • Supported storage formats: A compatibility breakdown of formats like Parquet, ORC, Avro, CSV, and JSON across Hive SerDe and Spark data sources.
  • Data type mappings: The precise conversion rules between Spark and BigQuery data types.

Use this page to verify that your table schemas and storage formats align with the metastore before you run workloads or query tables across engines.

Supported storage formats between Hive and Spark

The following sections describe the storage format and data source compatibility between Hive, Spark, and BigQuery.

Detailed storage format mapping

BigQuery determines the storage format of a table based on the input_format, output_format, and SerDe library in the metadata. The following table maps these properties to the BigQuery storage format.

Input format, output format, and SerDe library BigQuery storage format
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
Parquet
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
org.apache.hadoop.hive.ql.io.orc.OrcSerde
ORC
org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
org.apache.hadoop.hive.serde2.avro.AvroSerDe
Avro
org.apache.hadoop.mapred.TextInputFormat
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
org.apache.hadoop.hive.serde2.OpenCSVSerde
CSV
org.apache.hadoop.mapred.TextInputFormat
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
org.openx.data.jsonserde.JsonSerDe
JSON
org.apache.hadoop.mapred.TextInputFormat
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
org.apache.hive.hcatalog.data.JsonSerDe
JSON

Hive SerDe compatibility

The following table lists the compatibility of Hive SerDe table formats with BigQuery.

Format Spark SQL DDL syntax Queryable from BigQuery
Parquet CREATE TABLE ... STORED AS PARQUET Yes
ORC CREATE TABLE ... STORED AS ORC Yes
Avro CREATE TABLE ... STORED AS AVRO Yes
CSV CREATE TABLE ... ROW FORMAT 'org.apache.hadoop.hive.serde2.OpenCSVSerde' Yes
JSON CREATE TABLE ... ROW FORMAT 'org.openx.data.jsonserde.JsonSerDe' Yes

Spark data source compatibility

The following table lists the compatibility of Spark data source table formats with BigQuery.

CSV and JSON SerDe tables are queryable from BigQuery. However, CSV and JSON Spark data source tables are not.

Format Spark SQL DDL syntax Queryable from BigQuery
Parquet CREATE TABLE ... USING PARQUET Yes
ORC CREATE TABLE ... USING ORC Yes
Avro CREATE TABLE ... USING AVRO Yes
CSV CREATE TABLE ... USING CSV No
JSON CREATE TABLE ... USING JSON No

Supported data types from Spark to BigQuery

The following table maps Spark data types to BigQuery data types.

Spark data type BigQuery data type
BYTE or TINYINT INT64
SMALLINT or SHORT INT64
INT or INTEGER INT64
BIGINT or LONG INT64
DECIMAL or NUMERIC BIGNUMERIC
FLOAT FLOAT64
DOUBLE FLOAT64
REAL FLOAT64
BOOLEAN BOOL
STRING STRING
VARCHAR STRING
CHAR or CHARACTER STRING
BINARY BYTES
DATE DATE
TIMESTAMP or TIMESTAMP_LTZ TIMESTAMP
ARRAY ARRAY
STRUCT<col_name: type1, ...> STRUCT<col_name: type1, ...>
MAP<key_type, value_type> ARRAY<STRUCT<key: key_type, value: value_type>>
To enable this feature, send an email to biglake-help@google.com. This is only necessary if your workloads use MAP.

What's next