Delta Lake Guide¶
Delta Lake adds transaction logs and ACID semantics on top of Parquet. This guide covers how transaction overhead affects benchmarks, when to use OPTIMIZE and Z-ORDER, and Databricks-specific considerations.
Overview¶
Delta Lake started as Databricks’ solution to the “data lake reliability problem”: too many small files, no consistency guarantees, and no way to roll back bad data loads. It solves these problems by adding a transaction log on top of Parquet.
For benchmarking, Delta Lake matters because:
It’s the default format on Databricks (a major analytics platform)
Transaction overhead affects measured performance
OPTIMIZE and Z-ORDER change query characteristics
You might want to compare Delta vs raw Parquet
We added Delta Lake support to BenchBox because Databricks is one of the most widely-used analytics platforms, and understanding Delta’s performance characteristics helps interpret results.
How Delta Lake Works¶
Transaction Log Architecture¶
Delta Lake stores data in Parquet files, just like plain Parquet. The difference is the _delta_log directory, which tracks:
Which Parquet files belong to the table
Table schema and metadata
Transaction history (enabling time travel)
delta_table/
├── _delta_log/
│ ├── 00000000000000000000.json # Transaction 0: create table
│ ├── 00000000000000000001.json # Transaction 1: insert data
│ └── 00000000000000000002.json # Transaction 2: more inserts
├── part-00000-xxxx.parquet
├── part-00001-xxxx.parquet
└── part-00002-xxxx.parquet
Every read starts by consulting the transaction log to identify valid files. Every write creates a new log entry.
ACID Guarantees for Benchmarking¶
Property |
What it means |
Benchmark implication |
|---|---|---|
Atomicity |
Writes succeed or fail completely |
No partial data during multi-table loads |
Consistency |
Schema enforced on write |
No type mismatches between runs |
Isolation |
Concurrent reads/writes don’t conflict |
Reproducible during multi-phase benchmarks |
Durability |
Committed data persists |
Results survive process crashes |
For benchmarking, ACID guarantees matter most during data loading. With Delta Lake, a failed load doesn’t leave partial data. With raw Parquet, you might have orphaned files to clean up.
Delta Lake Ecosystem¶
Variant |
Maintainer |
Notes |
|---|---|---|
Delta Lake OSS |
Linux Foundation |
Core features, open protocol |
Databricks Delta |
Databricks |
Photon optimization, Unity Catalog |
delta-rs |
Delta Lake project |
Rust implementation, DuckDB/Polars |
BenchBox supports Delta Lake through:
Native Databricks integration (full feature set)
delta-rs for DuckDB (read-only via extension)
Spark with open-source Delta Lake
Performance Considerations¶
Transaction Log Overhead¶
Every Delta Lake read starts with the transaction log:
Read latest checkpoint (if exists)
Read transaction log entries since checkpoint
Build list of valid Parquet files
Read Parquet files
Overhead characteristics:
Small tables: Negligible (a few milliseconds)
Many transactions: Grows linearly with log entries
Mitigated by checkpoints: Every 10 transactions by default
For TPC-H benchmarks with a single data load, the overhead is typically under 10ms. For tables with thousands of transactions, consider running checkpoint or OPTIMIZE.
OPTIMIZE: File Compaction¶
Delta Lake’s OPTIMIZE command compacts small files into larger ones:
OPTIMIZE lineitem;
After a data load that creates many small files, OPTIMIZE consolidates them. This reduces:
File listing overhead
Number of file opens
Transaction log size
Benchmark results: Raw load is about 12% slower than Parquet. After OPTIMIZE, the difference drops to ~2%.
Z-ORDER: Data Clustering¶
Z-ORDER clusters data by specified columns:
OPTIMIZE lineitem ZORDER BY (l_shipdate, l_orderkey);
Z-ORDER reorders data so rows with similar values are stored together. Queries filtering on Z-ORDER columns skip more data.
Benchmark results: For date-filtered queries, Z-ORDER provides 17-30% improvement:
Query |
Without Z-ORDER |
With Z-ORDER |
Improvement |
|---|---|---|---|
Q1 |
1.8s |
1.5s |
17% |
Q4 |
0.9s |
0.7s |
22% |
Q6 |
0.5s |
0.35s |
30% |
Q6 benefits most because it filters on l_shipdate with a narrow range. Z-ORDER clusters matching rows together, reducing data scanned.
Trade-off: Z-ORDER adds significant write overhead. For TPC-H SF10 lineitem, Z-ORDER takes 3-5 minutes on a small Databricks cluster.
Databricks-Specific Features¶
Photon Engine¶
Databricks Delta includes Photon, a C++ vectorized execution engine optimized for Delta tables. Photon provides significant speedups for scan-heavy queries.
For benchmarks, you may want to:
Disable Photon to measure baseline performance
Enable Photon for realistic production comparisons
Document which engine was used in your methodology
Auto-Compaction¶
Databricks can run OPTIMIZE automatically in the background for streaming tables. For benchmarking, consider disabling auto-compaction to measure baseline performance and have full control over when compaction occurs.
Liquid Clustering¶
Liquid Clustering is Databricks’ newer alternative to Z-ORDER. It provides dynamic clustering without requiring explicit OPTIMIZE commands. This is relevant for streaming workloads but less common in TPC benchmarking scenarios.
DuckDB Delta Support¶
delta-rs Extension¶
As of January 2026, DuckDB’s delta-rs extension:
Supports read operations (SELECT, aggregations, joins)
Supports Delta Lake protocol versions 1 and 2
Does not support write operations (load via Python)
May have compatibility issues with newest Delta features
Limitations¶
DuckDB Delta support is read-only. BenchBox generates Delta tables using the deltalake Python library, then queries with DuckDB. This workflow is useful for testing Delta read performance across platforms.
# Install delta extension (one-time)
duckdb -c "INSTALL delta; LOAD delta;"
# Run with Delta Lake format
benchbox run --platform duckdb --benchmark tpch --scale 1 --format delta
Open-Source vs Databricks Delta¶
Aspect |
OSS Delta Lake |
Databricks Delta |
|---|---|---|
OPTIMIZE |
Manual |
Auto-compact option |
Z-ORDER |
Supported |
Enhanced (Liquid Clustering) |
Caching |
Standard |
Photon acceleration |
Protocol |
Same |
Same |
Results are comparable across implementations when using the same Delta protocol version and disabling Databricks-specific optimizations.
When to Use Delta Lake¶
Best-Fit Scenarios¶
Scenario |
Why Delta Lake |
|---|---|
Databricks benchmarks |
Native format, optimized |
Multi-phase loads |
ACID ensures consistency |
Load + query workloads |
Measure end-to-end |
Time travel testing |
Benchmark historical queries |
When to Use Parquet Instead¶
Scenario |
Why Parquet |
|---|---|
Cross-platform comparison |
Universal support |
Maximum simplicity |
No transaction overhead |
Storage efficiency focus |
No metadata overhead |
Non-Spark/Databricks |
Limited Delta support |
Benchmark Design Tips¶
Include OPTIMIZE in load phase: Measure realistic query performance
Measure with and without Z-ORDER: Quantify benefit for your queries
Track transaction log size: Grows with table history
Use VACUUM before benchmarks: Remove old versions for clean baseline
# Enable OPTIMIZE after load (default on Databricks)
benchbox run --platform databricks --benchmark tpch --format delta --optimize
# Skip OPTIMIZE (measure raw load performance)
benchbox run --platform databricks --benchmark tpch --format delta --no-optimize
# Enable Z-ORDER on specific columns
benchbox run --platform databricks --benchmark tpch --format delta \
--zorder="l_shipdate,l_orderkey"
See Also¶
Table Format Guides: Overview of all formats
Format Conversion Reference: CLI commands for format conversion
Parquet Deep Dive: Foundation format for Delta Lake
Apache Iceberg Guide: Alternative table format