`convert` - Convert Data Formats¶

Tags reference cli

Convert benchmark data from TBL (pipe-delimited) format to optimized columnar formats like Parquet, Delta Lake, or Apache Iceberg.

Basic Syntax¶

benchbox convert --input PATH --format FORMAT [OPTIONS]

Why Convert Formats?¶

Converting TPC benchmark data to columnar formats provides benefits:

Query Performance: Faster query execution due to columnar storage and compression (improvement varies by query selectivity)
Storage Efficiency: Better compression ratio compared to raw TBL files (typically 3-5x)
Platform Compatibility: Native support in modern data platforms (Databricks, Snowflake, etc.)
Analytics Features: Partition pruning, predicate pushdown, and statistics-based optimization

Core Options¶

Required:

--input PATH: Input directory containing benchmark data and _datagen_manifest.json
--format FORMAT: Target format (parquet, delta, iceberg)

Output:

--output PATH: Output directory (default: same as input)

Compression:

--compression CODEC: Compression algorithm (default: snappy)
- snappy - Fast compression, moderate ratio (default)
- gzip - Better ratio, slower compression
- zstd - Best ratio, moderate speed
- none - No compression

Partitioning:

--partition COLUMN: Column(s) to partition by (can be specified multiple times)
- Creates Hive-style partitioning (column=value/ directories)
- Enables partition pruning for filtered queries

Schema:

--benchmark NAME: Benchmark name for schema lookup (auto-detected from manifest if not specified)

Validation:

--validate/--no-validate: Validate row counts after conversion (default: enabled)
- Ensures data integrity and TPC compliance
- Disable with --no-validate for faster conversion (not TPC compliant)

Debugging:

--verbose, -v: Enable verbose output

Supported Formats¶

Apache Parquet¶

Columnar storage format with efficient compression and encoding.

benchbox convert --input ./data/tpch_sf1 --format parquet

Characteristics:

Single file per table (or partitioned directory)
Excellent query engine support (DuckDB, Spark, Polars, etc.)
Row group statistics for predicate pushdown
No external dependencies required (uses PyArrow)

Delta Lake¶

Open table format with ACID transactions and time travel.

benchbox convert --input ./data/tpch_sf1 --format delta

Characteristics:

Directory-based format with _delta_log/ transaction log
ACID transactions and concurrent writes
Time travel (query historical versions)
Schema evolution support
Requires: deltalake package (uv add deltalake)

Apache Iceberg¶

Modern table format with hidden partitioning and schema evolution.

benchbox convert --input ./data/tpch_sf1 --format iceberg

Characteristics:

Directory-based format with metadata/ directory
Hidden partitioning (partition columns not exposed in schema)
Schema evolution without rewriting data
Snapshot isolation
Requires: pyiceberg package (uv add pyiceberg)

Examples¶

Basic Conversion¶

# Convert TPC-H data to Parquet with default settings
benchbox convert --input ./benchmark_runs/tpch_sf1 --format parquet

# Convert to Delta Lake
benchbox convert --input ./benchmark_runs/tpch_sf1 --format delta

Compression Options¶

# Use Zstd for best compression ratio
benchbox convert --input ./data/tpch_sf10 --format parquet --compression zstd

# Use gzip for compatibility with older systems
benchbox convert --input ./data/tpch_sf10 --format parquet --compression gzip

# No compression for debugging
benchbox convert --input ./data/tpch_sf1 --format parquet --compression none

Partitioned Output¶

# Partition lineitem table by ship date
benchbox convert --input ./data/tpch_sf1 --format parquet \
    --partition l_shipdate

# Multiple partition columns (hierarchical)
benchbox convert --input ./data/tpch_sf1 --format parquet \
    --partition l_returnflag --partition l_linestatus

# Partition Delta Lake by date
benchbox convert --input ./data/tpch_sf10 --format delta \
    --partition l_shipdate

Performance Optimization¶

# Skip validation for faster conversion (not TPC compliant)
benchbox convert --input ./data/tpch_sf100 --format parquet --no-validate

# Verbose output for debugging
benchbox convert --input ./data/tpch_sf1 --format parquet --verbose

Separate Output Directory¶

# Convert to a different directory
benchbox convert --input ./raw_data/tpch_sf1 --format parquet \
    --output ./converted_data/tpch_sf1_parquet

Output¶

The convert command displays progress and summary information:

Converting to PARQUET
Input: ./benchmark_runs/tpch_sf1
Compression: snappy
Row validation: enabled

Converting 8 tables...
  ✓ customer: 150,000 rows, compression: 2.45x
  ✓ lineitem: 6,001,215 rows, compression: 3.12x
  ✓ nation: 25 rows, compression: 1.89x
  ✓ orders: 1,500,000 rows, compression: 2.87x
  ✓ part: 200,000 rows, compression: 2.34x
  ✓ partsupp: 800,000 rows, compression: 2.56x
  ✓ region: 5 rows, compression: 1.67x
  ✓ supplier: 10,000 rows, compression: 2.23x

Summary:
  Tables converted: 8
  Total rows: 8,661,245
  Source size: 1024.5 MB
  Output size: 342.1 MB
  Overall compression: 2.99x
  Manifest updated: ./benchmark_runs/tpch_sf1/_datagen_manifest.json

Using Converted Data¶

With DuckDB¶

-- Read Parquet file
SELECT * FROM read_parquet('./data/customer.parquet');

-- Read partitioned Parquet dataset
SELECT * FROM read_parquet('./data/lineitem/**/*.parquet', hive_partitioning=true);

-- Read Delta Lake table (requires delta extension)
INSTALL delta;
LOAD delta;
SELECT * FROM delta_scan('./data/lineitem');

With Python¶

import duckdb
import pyarrow.parquet as pq

# Read Parquet with PyArrow
table = pq.read_table('./data/customer.parquet')
df = table.to_pandas()

# Read Parquet with DuckDB
conn = duckdb.connect()
df = conn.execute("SELECT * FROM read_parquet('./data/customer.parquet')").fetchdf()

# Read Delta Lake
from deltalake import DeltaTable
dt = DeltaTable('./data/customer')
df = dt.to_pandas()

Prerequisites¶

The convert command requires:

Generated benchmark data: Run benchbox run --phases generate first
Valid manifest: _datagen_manifest.json in the input directory

For Delta Lake and Iceberg formats, install optional dependencies:

# Delta Lake support
uv add deltalake

# Apache Iceberg support
uv add pyiceberg

Troubleshooting¶

“Manifest not found”¶

Ensure you’ve generated benchmark data first:

benchbox run --platform duckdb --benchmark tpch --scale 1 --phases generate

“Could not get schemas from benchmark”¶

The benchmark type couldn’t be auto-detected. Specify it explicitly:

benchbox convert --input ./data --format parquet --benchmark tpch

“deltalake package not installed”¶

Install the Delta Lake package:

uv add deltalake

“pyiceberg package not installed”¶

Install the PyIceberg package:

uv add pyiceberg

convert - Convert Data Formats¶

Basic Syntax¶

Why Convert Formats?¶

Core Options¶

Supported Formats¶

Apache Parquet¶

Delta Lake¶

Apache Iceberg¶

Examples¶

Basic Conversion¶

Compression Options¶

Partitioned Output¶

Performance Optimization¶

Separate Output Directory¶

Output¶

Using Converted Data¶

With DuckDB¶

With Python¶

Prerequisites¶

Troubleshooting¶

“Manifest not found”¶

“Could not get schemas from benchmark”¶

“deltalake package not installed”¶

“pyiceberg package not installed”¶

See Also¶

`convert` - Convert Data Formats¶