Migrating to DataFrame Benchmarking¶
This guide explains how to adopt DataFrame-based benchmarking alongside or instead of SQL-based execution.
Overview¶
BenchBox supports two execution paradigms:
Paradigm |
When to Use |
Example Platforms |
|---|---|---|
SQL Mode |
Database servers, cloud warehouses |
Snowflake, BigQuery, DuckDB |
DataFrame Mode |
In-memory analytics, data science |
Polars, Pandas, PySpark |
Both modes run the same benchmarks (TPC-H, TPC-DS) enabling direct paradigm comparison.
Quick Migration¶
From SQL to DataFrame¶
SQL Mode:
benchbox run --platform duckdb --benchmark tpch --scale 0.1
DataFrame Mode:
benchbox run --platform polars-df --benchmark tpch --scale 0.1
The only difference is the platform name with -df suffix.
Platform Mapping¶
SQL Platform |
DataFrame Equivalent |
Notes |
|---|---|---|
DuckDB |
|
Similar performance profile |
DataFusion |
|
Same engine, different API |
PostgreSQL |
|
Reference implementation |
Spark SQL |
|
Same cluster, different API |
Family Architecture¶
DataFrame platforms are organized into two families based on their API style:
Expression Family (Declarative)¶
Uses col() and lit() functions with method chaining:
# Polars, PySpark, DataFusion
df.filter(col("amount") > lit(100)).group_by("customer").agg(col("amount").sum())
Platforms: polars-df, pyspark-df, datafusion-df
Pandas Family (Imperative)¶
Uses string column access and boolean indexing:
# Pandas, Modin, Dask, cuDF
df[df["amount"] > 100].groupby("customer")["amount"].sum()
Platforms: pandas-df, modin-df, dask-df, cudf-df
Data Compatibility¶
DataFrame and SQL modes share the same data files:
benchmark_runs/datagen/
├── tpch_sf0.1/ # Used by both modes
│ ├── lineitem.csv
│ ├── orders.csv
│ └── ...
No data regeneration required when switching between modes.
Tuning Configuration¶
DataFrame platforms support performance tuning:
# View platform defaults
benchbox df-tuning show-defaults --platform polars
# Auto-detect optimal settings
benchbox run --platform polars-df --benchmark tpch --df-tuning auto
# Custom configuration
benchbox run --platform polars-df --benchmark tpch --df-tuning ./tuning.yaml
Example tuning file:
# tuning.yaml
platform: polars
settings:
parallelism:
n_threads: 8
memory:
streaming_enabled: true
Programmatic Usage¶
SQL Mode¶
from benchbox import DuckDBAdapter, TPCH
adapter = DuckDBAdapter()
benchmark = TPCH(scale_factor=0.1)
results = benchmark.run(adapter)
DataFrame Mode¶
from benchbox.platforms.dataframe import PolarsDataFrameAdapter
from benchbox import TPCH
adapter = PolarsDataFrameAdapter()
benchmark = TPCH(scale_factor=0.1)
results = benchmark.run_dataframe(adapter)
Cross-Paradigm Comparison¶
Run the same benchmark on both paradigms:
# SQL execution
benchbox run --platform duckdb --benchmark tpch --scale 1 -o sql_results.json
# DataFrame execution
benchbox run --platform polars-df --benchmark tpch --scale 1 -o df_results.json
# Compare results
benchbox compare sql_results.json df_results.json
Limitations¶
TPC-H and TPC-DS only: DataFrame mode currently supports these two benchmarks
Result validation: DataFrame results use approximate comparison (float tolerance)
No multi-stream: DataFrame mode runs single-stream power tests only