Migrating tables to Iceberg

MIGRATING TO ICEBERG

Apache Iceberg supports migrating data from legacy table formats like Apache Hive or directly from data files stored in the following formats:

  • Apache Parquet
  • Apache ORC
  • Apache Avro

Migrating data from other formats (e.g. CSV, JSON, Sequence File) requires rewriting the data since these formats do not support the necessary features for schema evolution or type safety.

Procedures are provided in Apache Spark and Trino to translate existing data and metadata into Iceberg metadata and register it with a compatible catalog. The recipes in this chapter show how to use these procedures:

  • The snapshot procedure creates Iceberg metadata without affecting the original table so you can test compatibility with readers and writers. Table snapshots should be used temporarily for testing.
  • The migrate procedure replaces the original table with an Iceberg table.
  • The add_files procedure adds directories of data files to an existing Iceberg table.
  • Tables already in Iceberg format can be migrated between different catalog implementations using REGISTER TABLE.

Rewriting vs. referencing existing data

Leaving data in place and overlaying Iceberg metadata is the fastest way to test and migrate existing datasets. It is also cost effective because minimal data processing is necessary to manage the dataset going forward with Iceberg.

However, when migrating to Iceberg tables there are benefits to rewriting files as opposed to simply referencing existing data files. A few examples are:

  • Hidden partitioning – If you want to partition data using built-in partition transforms for date/time based on a timestamp field, rewriting the data will restructure all of the existing data to take advantage of the appropriate layout.
  • Column identifiers –Iceberg requires column identifiers to be written to data files because columns are projected by id. Existing data files generally do not contain column identifiers, so a name to field ID mapping is required. Rewriting data will ensure that column identifiers are present in the data files and ensures safe schema evolution.
  • Object store layout – Legacy Hive and path-structured data layouts are the main source of performance issues with object stores like Amazon S3 and contribute to partition hotspots and throttling. Iceberg is not limited to a specific physical layout and supports multiple strategies including an object store layout optimized for cloud storage.
  • Physical data location – Leaving data in place requires that the location of the original data does not change. If the intention is to migrate to a new bucket or storage location/prefix when migrating to Iceberg, then rewriting the data will be required as all data locations are absolute references in Iceberg metadata.

Read and write operation support during migration

One of the most common problems with migrating is moving both producers and consumers of data at the same time. When transitioning ownership of data to an Iceberg table, new data must be committed to the table through the Iceberg commit path. Iceberg relies on metadata to track changes to the dataset which means that adding files to directories under the table location does not automatically add them to the dataset.

Producers and consumers of data must transition together or they will be out of sync. The migration tools provided by Iceberg allow for replacing the catalog entries, but coordination is required during migration.

Connecting to a Hive metastore

The recipes in this chapter convert tables from a Hive metastore. If you don’t have one running and want to test out the commands, you can run one locally in Docker:

# Run a Hive Metastore
mkdir -p /tmp/hive/data/warehouse
docker run -d -p 9083:9083 --env SERVICE_NAME=metastore \
  --name hive-metastore \
  --mount type=bind,source=/tmp/hive/data/warehouse,target=/opt/hive/data/warehouse \
  apache/hive:4.0.0-beta-1

To connect to a Hive metastore from Spark, the following settings are needed:

  • spark.hadoop.hive.metastore.uris – sets the Hive metastore URI for Spark’s session catalog
  • spark.sql.catalog.spark_catalog – sets up an Iceberg catalog to wrap the session catalog
  • spark.sql.catalog.spark_catalog.type=hive – sets the Iceberg catalog to connect to HMS
  • spark.sql.catalog.spark_catalog.uri – sets the Hive metastore URI for the Iceberg catalog

Here’s a complete example that can runs Spark locally to connect to the docker Hive metastore:

# Start a Spark SQL shell (3.5.0)
./bin/spark-sql \
  --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2 \
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
  --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
  --conf spark.sql.catalog.spark_catalog.type=hive \
  --conf spark.sql.catalog.spark_catalog.uri=thrift://localhost:9083 \
  --conf spark.sql.catalog.spark_catalog.warehouse=/tmp/hive/data/warehouse \
  --conf spark.sql.warehouse.dir=/tmp/hive/data/warehouse \
  --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083