Delta Lake Guide¶

Tags guide formats delta-lake databricks

Delta Lake adds transaction logs and ACID semantics on top of Parquet. This guide covers how transaction overhead affects benchmarks, when to use OPTIMIZE and Z-ORDER, and Databricks-specific considerations.

Overview¶

Delta Lake started as Databricks’ solution to the “data lake reliability problem”: too many small files, no consistency guarantees, and no way to roll back bad data loads. It solves these problems by adding a transaction log on top of Parquet.

For benchmarking, Delta Lake matters because:

It’s the default format on Databricks (a major analytics platform)
Transaction overhead affects measured performance
OPTIMIZE and Z-ORDER change query characteristics
You might want to compare Delta vs raw Parquet

We added Delta Lake support to BenchBox because Databricks is one of the most widely-used analytics platforms, and understanding Delta’s performance characteristics helps interpret results.

How Delta Lake Works¶

Transaction Log Architecture¶

Delta Lake stores data in Parquet files, just like plain Parquet. The difference is the _delta_log directory, which tracks:

Which Parquet files belong to the table
Table schema and metadata
Transaction history (enabling time travel)

delta_table/
├── _delta_log/
│   ├── 00000000000000000000.json  # Transaction 0: create table
│   ├── 00000000000000000001.json  # Transaction 1: insert data
│   └── 00000000000000000002.json  # Transaction 2: more inserts
├── part-00000-xxxx.parquet
├── part-00001-xxxx.parquet
└── part-00002-xxxx.parquet

Every read starts by consulting the transaction log to identify valid files. Every write creates a new log entry.

ACID Guarantees for Benchmarking¶

Property	What it means	Benchmark implication
Atomicity	Writes succeed or fail completely	No partial data during multi-table loads
Consistency	Schema enforced on write	No type mismatches between runs
Isolation	Concurrent reads/writes don’t conflict	Reproducible during multi-phase benchmarks
Durability	Committed data persists	Results survive process crashes

For benchmarking, ACID guarantees matter most during data loading. With Delta Lake, a failed load doesn’t leave partial data. With raw Parquet, you might have orphaned files to clean up.

Delta Lake Ecosystem¶

Variant	Maintainer	Notes
Delta Lake OSS	Linux Foundation	Core features, open protocol
Databricks Delta	Databricks	Photon optimization, Unity Catalog
delta-rs	Delta Lake project	Rust implementation, DuckDB/Polars

BenchBox supports Delta Lake through:

Native Databricks integration (full feature set)
delta-rs for DuckDB (read-only via extension)
Spark with open-source Delta Lake

Performance Considerations¶

Transaction Log Overhead¶

Every Delta Lake read starts with the transaction log:

Read latest checkpoint (if exists)
Read transaction log entries since checkpoint
Build list of valid Parquet files
Read Parquet files

Overhead characteristics:

Small tables: Negligible (a few milliseconds)
Many transactions: Grows linearly with log entries
Mitigated by checkpoints: Every 10 transactions by default

For TPC-H benchmarks with a single data load, the overhead is typically under 10ms. For tables with thousands of transactions, consider running checkpoint or OPTIMIZE.

OPTIMIZE: File Compaction¶

Delta Lake’s OPTIMIZE command compacts small files into larger ones:

OPTIMIZE lineitem;

After a data load that creates many small files, OPTIMIZE consolidates them. This reduces:

File listing overhead
Number of file opens
Transaction log size

Benchmark results: Raw load is about 12% slower than Parquet. After OPTIMIZE, the difference drops to ~2%.

Z-ORDER: Data Clustering¶

Z-ORDER clusters data by specified columns:

OPTIMIZE lineitem ZORDER BY (l_shipdate, l_orderkey);

Z-ORDER reorders data so rows with similar values are stored together. Queries filtering on Z-ORDER columns skip more data.

Benchmark results: For date-filtered queries, Z-ORDER provides 17-30% improvement:

Query	Without Z-ORDER	With Z-ORDER	Improvement
Q1	1.8s	1.5s	17%
Q4	0.9s	0.7s	22%
Q6	0.5s	0.35s	30%

Q6 benefits most because it filters on l_shipdate with a narrow range. Z-ORDER clusters matching rows together, reducing data scanned.

Trade-off: Z-ORDER adds significant write overhead. For TPC-H SF10 lineitem, Z-ORDER takes 3-5 minutes on a small Databricks cluster.

Databricks-Specific Features¶

Photon Engine¶

Databricks Delta includes Photon, a C++ vectorized execution engine optimized for Delta tables. Photon provides significant speedups for scan-heavy queries.

For benchmarks, you may want to:

Disable Photon to measure baseline performance
Enable Photon for realistic production comparisons
Document which engine was used in your methodology

Auto-Compaction¶

Databricks can run OPTIMIZE automatically in the background for streaming tables. For benchmarking, consider disabling auto-compaction to measure baseline performance and have full control over when compaction occurs.

Liquid Clustering¶

Liquid Clustering is Databricks’ newer alternative to Z-ORDER. It provides dynamic clustering without requiring explicit OPTIMIZE commands. This is relevant for streaming workloads but less common in TPC benchmarking scenarios.

DuckDB Delta Support¶

delta-rs Extension¶

As of January 2026, DuckDB’s delta-rs extension:

Supports read operations (SELECT, aggregations, joins)
Supports Delta Lake protocol versions 1 and 2
Does not support write operations (load via Python)
May have compatibility issues with newest Delta features

Limitations¶

DuckDB Delta support is read-only. BenchBox generates Delta tables using the deltalake Python library, then queries with DuckDB. This workflow is useful for testing Delta read performance across platforms.

# Install delta extension (one-time)
duckdb -c "INSTALL delta; LOAD delta;"

# Run with Delta Lake format
benchbox run --platform duckdb --benchmark tpch --scale 1 --format delta

Open-Source vs Databricks Delta¶

Aspect	OSS Delta Lake	Databricks Delta
OPTIMIZE	Manual	Auto-compact option
Z-ORDER	Supported	Enhanced (Liquid Clustering)
Caching	Standard	Photon acceleration
Protocol	Same	Same

Results are comparable across implementations when using the same Delta protocol version and disabling Databricks-specific optimizations.

When to Use Delta Lake¶

Best-Fit Scenarios¶

Scenario	Why Delta Lake
Databricks benchmarks	Native format, optimized
Multi-phase loads	ACID ensures consistency
Load + query workloads	Measure end-to-end
Time travel testing	Benchmark historical queries

When to Use Parquet Instead¶

Scenario	Why Parquet
Cross-platform comparison	Universal support
Maximum simplicity	No transaction overhead
Storage efficiency focus	No metadata overhead
Non-Spark/Databricks	Limited Delta support

Benchmark Design Tips¶

Include OPTIMIZE in load phase: Measure realistic query performance
Measure with and without Z-ORDER: Quantify benefit for your queries
Track transaction log size: Grows with table history
Use VACUUM before benchmarks: Remove old versions for clean baseline

# Enable OPTIMIZE after load (default on Databricks)
benchbox run --platform databricks --benchmark tpch --format delta --optimize

# Skip OPTIMIZE (measure raw load performance)
benchbox run --platform databricks --benchmark tpch --format delta --no-optimize

# Enable Z-ORDER on specific columns
benchbox run --platform databricks --benchmark tpch --format delta \
  --zorder="l_shipdate,l_orderkey"