Apache Iceberg Guide¶

Tags guide formats iceberg multi-engine

Iceberg provides table format features with true multi-engine support. This guide covers how to benchmark with Iceberg across different query engines, catalog configuration, and multi-engine workflows.

Overview¶

Apache Iceberg was created at Netflix to solve a specific problem: running the same analytics across multiple query engines without locking into one vendor. Unlike Hive tables (Hive metastore-dependent), Iceberg tables work identically across Spark, Trino, Flink, Athena, Snowflake, and more.

For benchmarking, Iceberg’s engine independence enables fair comparisons. We can load TPC-H data once, store it in Iceberg format, and run the same queries on Spark, Trino, and Athena. The data is identical across engines, isolating engine performance from data format differences.

We added Iceberg support to BenchBox for users building multi-engine lakehouses and those running on platforms where Iceberg is the native format (Athena, Starburst, Snowflake).

Multi-Engine Support¶

Supported Engines¶

The same Iceberg table is readable by:

Apache Spark
Trino and Presto
Apache Flink
Amazon Athena
Snowflake
BigQuery
DuckDB (via extension)

This multi-engine support comes from a carefully designed metadata layer that doesn’t depend on any specific query engine’s implementation.

Why Engine Independence Matters¶

For benchmarking, engine independence means:

Load data once, query from multiple engines
Fair comparisons (identical data across engines)
Production-like conditions for multi-engine architectures
No vendor lock-in in your benchmark methodology

Iceberg Architecture¶

Metadata Files¶

iceberg_table/
├── metadata/
│   ├── v1.metadata.json
│   ├── v2.metadata.json (current version)
│   └── snap-xxxxx.avro (manifest list)
├── data/
│   ├── partition=2024-01/
│   │   └── data-xxxxx.parquet
│   └── partition=2024-02/
│       └── data-xxxxx.parquet
└── (catalog points here)

Manifest Lists and Files¶

Metadata files: Track table state, schema, and partitioning
Manifest lists: Point to manifest files per snapshot
Manifest files: List data files and their statistics

Data Files¶

Data files are typically Parquet (or ORC, Avro) containing actual data. Iceberg manages the metadata layer; the underlying storage format is Parquet in most cases.

Iceberg vs Delta Lake¶

Aspect	Iceberg	Delta Lake
Engine lock-in	None	Databricks-optimized
Catalog options	REST, Hive, Glue, Nessie	Unity Catalog, Hive
Partition evolution	Yes (metadata-only)	Limited
Hidden partitioning	Yes	No
Community	Apache project	Linux Foundation

Both formats provide ACID transactions and time travel. The key difference is Iceberg’s broader engine support and more flexible partitioning.

Performance Considerations¶

Manifest Overhead¶

Iceberg tracks files through a hierarchy of manifests:

Read current metadata file
Read manifest list (points to manifests)
Read relevant manifest files
Filter to needed data files based on partition pruning
Read data files

Overhead characteristics:

Grows with file count and table history
Mitigated by manifest caching (engine-dependent)
Partition pruning reduces manifest reads

For TPC-H benchmarks with standard data loads, manifest overhead is typically under 50ms. For tables with millions of files, consider compaction.

Mitigation Strategies¶

Use partition pruning to reduce manifest reads
Enable manifest caching where available
Compact manifests for tables with many small files
Use expire_snapshots to clean up historical metadata

Advanced Partitioning¶

Partition Evolution¶

Iceberg allows changing partition schemes without rewriting data:

-- Original table partitioned by date
CREATE TABLE events (...) PARTITIONED BY (event_date);

-- Add hour-level partitioning for new data (no rewrite)
ALTER TABLE events ADD PARTITION FIELD hour(event_time);

Benchmark implications:

Partition evolution is metadata-only (fast)
Queries automatically use appropriate partitioning per file
Historical data keeps its original partitioning

Hidden Partitioning¶

Iceberg can partition on transforms without exposing partition columns in queries:

-- Create table partitioned by month transform
CREATE TABLE orders
  PARTITIONED BY (month(order_date))
  AS SELECT * FROM raw_orders;

-- Query by exact date, Iceberg applies partition pruning automatically
SELECT * FROM orders WHERE order_date = '2024-03-15';

Benchmark implications:

Simpler query writing (no partition column gymnastics)
Partition pruning happens automatically
Consistent query behavior across engines

Partition Transforms¶

Iceberg supports several partition transforms:

Transform	Example	Use Case
identity	`identity(region)`	Exact value partitioning
year	`year(date_col)`	Annual aggregations
month	`month(date_col)`	Monthly aggregations
day	`day(date_col)`	Daily partitions
hour	`hour(timestamp_col)`	Hourly partitions
bucket	`bucket(100, id)`	Distribute by hash
truncate	`truncate(10, string_col)`	Prefix partitioning

Catalog Configuration¶

REST Catalog¶

The REST catalog is the most portable option, working across engines:

benchbox run --platform spark --format iceberg --catalog rest

AWS Glue¶

For AWS-native workflows with Athena, EMR:

benchbox run --platform spark --format iceberg --catalog glue

Hive Metastore¶

For Spark/Hadoop ecosystems:

benchbox run --platform spark --format iceberg --catalog hive

Nessie¶

For git-like versioning (experimental):

benchbox run --platform spark --format iceberg --catalog nessie

Catalog	Use Case	BenchBox Support
REST Catalog	Standard, portable	Default
Hive Metastore	Spark/Hadoop ecosystems	Supported
AWS Glue	AWS-native (Athena, EMR)	Supported
Nessie	Git-like versioning	Experimental

Multi-Engine Workflows¶

Write with One Engine, Read with Another¶

A key Iceberg benefit: load data with Spark, query with Trino (or vice versa).

BenchBox multi-engine benchmark:

# 1. Generate and load data with Spark
benchbox run --platform spark --benchmark tpch --scale 10 \
  --format iceberg --phases load

# 2. Run queries with Trino (same Iceberg tables)
benchbox run --platform trino --benchmark tpch --scale 10 \
  --format iceberg --phases power

# 3. Run queries with Athena (same tables via Glue)
benchbox run --platform athena --benchmark tpch --scale 10 \
  --format iceberg --phases power

Ensuring Consistent Results¶

BenchBox validates query results against reference answers, catching:

Type handling differences between engines
NULL behavior variations
Ordering differences

# Compare results from different engines
benchbox compare-results spark-results.json trino-results.json --validate

Common Multi-Engine Issues¶

1. Schema evolution lag

Some engines support new Iceberg schema features before others. If you add a column with Spark, older Trino versions might not see it.

2. Partition pruning variations

Partition pruning implementation differs by engine. One engine might prune more aggressively than another.

3. Statistics availability

Column-level statistics may not be shared across all engines. This affects query optimization.

4. Caching effects

Engine-specific caching can skew comparisons. BenchBox uses cold cache between runs to mitigate this.

When to Use Iceberg¶

Best-Fit Scenarios¶

Scenario	Why Iceberg
Multi-engine comparison	Same data, fair comparison
Lakehouse architectures	Production-like conditions
AWS-native platforms	Athena, EMR native support
Cross-platform portability	Engine-independent format

When to Use Parquet Instead¶

Scenario	Why Parquet
Single-engine benchmarks	Simpler, no catalog needed
Local development	No external catalog required
Quick comparisons	Less setup overhead
DuckDB-only	Native Parquet is faster