Hive ADD FILES

MIGRATING TO ICEBERG

The ADD FILES procedure in Apache Iceberg provides the ability to add data in existing files to a table without rewriting the data. The procedure inspects each data file’s metadata to build Iceberg metadata and statistics then commits this information to the table. The SNAPSHOT and MIGRATE procedures are preferred if you are migrating from an Apache Hive table as they can provide a reliable workflow for conversion. However, add_files is a useful tool for custom migration.

Data files added by this procedure must be in a supported format, namely Parquet, ORC, or Avro.

Data files from Hive tables require setting a name mapping on the Iceberg table to convert from name-based schema evolution to Iceberg’s ID-based schema evolution. File schemas must be compatible with the table schema (after the name mapping is applied) or projected column values will be null. This is done automatically when using the SNAPSHOT and MIGRATE procedures.

For this example, a Hive metastore is not required. You can use any catalog that supports Iceberg.

Creating a name mapping

Hive Parquet and Avro tables use name-based schemas, but Iceberg resolves columns based on unique field IDs for safer and more reliable schema evolution. Name-based resolution can be mapped into Iceberg’s ID system using a name mapping.

If a Parquet or Avro file doesn’t have field IDs in its metadata, then Iceberg will assign IDs based on the file’s field names.

To create a name mapping, you can use the utility class MappingUtil in the Java API. The name mapping must be stored as JSON in table properties as schema.name-mapping.default. This Java example shows how to create a name mapping from an Iceberg schema and set it on a table.

import org.apache.iceberg.mapping.MappingUtil;
import org.apache.iceberg.mapping.NameMappingParser;

Table table = catalog.loadTable(...);

// Create a name mapping
NameMapping nameMapping = MappingUtil.create(table.schema());
String nameMappingJson = NameMappingParser.toJson(nameMapping);

// Add the name mapping to the target table
table.updateProperties()
  .set("schema.name-mapping.default", nameMappingJson)
  .commit();

The name mapping created in this example uses the Iceberg table’s names. If the names in your Iceberg table differ from the names in the data files you’re importing, you will need to add or replace the names in the generated JSON mapping.

You can also write the mapping directly. The JSON format is straightforward:

[ {
  "field-id" : 1,
  "names" : [ "id" ]
}, {
  "field-id" : 2,
  "names" : [ "data", "data_alias" ]
} ]

Creating example data files

To exercise this procedure, first create source Parquet files using a Hive table that uses Parquet as the storage format.

CREATE DATABASE IF NOT EXISTS cookbook;
USE cookbook;

CREATE TABLE parquet_data (s string) USING PARQUET
    LOCATION 'file:/tmp/hive/data/warehouse/';
-- Time taken: 0.091 seconds

-- Insert a row to create a data file

INSERT INTO parquet_data values ('parquet data');
-- Time taken: 0.621 seconds

There is now one Parquet data file with the value "parquet data", which we will add to an Iceberg table.

Running the add_files procedure

To use the add_files procedure, first create an Iceberg table that the files will be added to.

CREATE TABLE ice (s string) USING ICEBERG;

Next, call the procedure with the destination table and a Parquet dataset.

CALL system.add_files('cookbook.ice', '`parquet`.`file:/tmp/parquet_data`');
-- 1  1  -- added file and changed partition count

SELECT * FROM ice;
-- parquet data

SELECT file_path FROM cookbook.ice.files;
-- file:/tmp/parquet_data/part-00000-117c016d-06210e4f5425-c000.snappy.parquet

Now that the files are added, the Iceberg metadata contains entries for the data files in their original location and they are considered part of the table going forward.