DataFrame Benchmarking Quickstart¶
Run TPC-H using DataFrame APIs (Polars, Pandas) instead of SQL.
Why DataFrame Benchmarking?¶
Compare paradigms: SQL vs DataFrame execution on same queries
Data science workflows: Benchmark tools you actually use
No database required: Pure Python execution
Quick Start with Polars¶
benchbox run --platform polars-df --benchmark tpch --scale 0.1
This runs all 22 TPC-H queries using Polars’ DataFrame API.
Available DataFrame Platforms¶
Platform |
CLI Name |
Best For |
|---|---|---|
Polars |
|
High performance, production |
Pandas |
|
Compatibility, reference |
PySpark |
|
Distributed computing |
DataFusion |
|
Arrow-native processing |
SQL vs DataFrame Comparison¶
# SQL execution (DuckDB)
benchbox run --platform duckdb --benchmark tpch --scale 0.1 -o sql.json
# DataFrame execution (Polars)
benchbox run --platform polars-df --benchmark tpch --scale 0.1 -o polars.json
# Compare
benchbox compare sql.json polars.json
How It Works¶
BenchBox translates TPC-H queries into DataFrame operations:
SQL (TPC-H Q1):
SELECT l_returnflag, l_linestatus,
SUM(l_quantity), SUM(l_extendedprice)
FROM lineitem
WHERE l_shipdate <= DATE '1998-12-01' - INTERVAL 90 DAY
GROUP BY l_returnflag, l_linestatus
ORDER BY l_returnflag, l_linestatus
DataFrame (Polars):
lineitem.filter(
pl.col("l_shipdate") <= date(1998, 9, 2)
).group_by(
"l_returnflag", "l_linestatus"
).agg(
pl.col("l_quantity").sum(),
pl.col("l_extendedprice").sum()
).sort("l_returnflag", "l_linestatus")
Both produce identical results.
Performance Tuning¶
Enable auto-tuning for optimal performance:
benchbox run --platform polars-df --benchmark tpch --df-tuning auto
Or use a custom configuration:
# View available settings
benchbox df-tuning show-defaults --platform polars
# Use custom tuning
benchbox run --platform polars-df --benchmark tpch --df-tuning tuning.yaml
Pandas Example¶
# Run with Pandas
benchbox run --platform pandas-df --benchmark tpch --scale 0.01
# Note: Pandas is slower than Polars for large datasets
# Use scale 0.01-0.1 for Pandas benchmarks
Cross-Platform DataFrame Comparison¶
# Compare DataFrame platforms
benchbox run --platform polars-df --benchmark tpch -o polars.json
benchbox run --platform pandas-df --benchmark tpch -o pandas.json
benchbox run --platform datafusion-df --benchmark tpch -o datafusion.json
benchbox compare polars.json pandas.json datafusion.json
TPC-DS Support¶
DataFrame mode also supports TPC-DS (99 queries):
benchbox run --platform polars-df --benchmark tpcds --scale 1
Note: TPC-DS requires scale factor >= 1 due to data generator constraints.
Programmatic Usage¶
from benchbox.platforms.dataframe import PolarsDataFrameAdapter
from benchbox import TPCH
adapter = PolarsDataFrameAdapter()
benchmark = TPCH(scale_factor=0.1)
# Run benchmark
results = benchmark.run_dataframe(adapter)
# Access results
for query_result in results.query_results:
print(f"{query_result.query_id}: {query_result.execution_time_ms:.1f}ms")
Next Steps¶
DataFrame Platform Guide - Detailed configuration
DataFrame Migration Guide - Adopting DataFrame mode
Tuning Reference - Performance optimization