DataFrame Cross-Platform Comparison¶
DEPRECATED: The
compare-dataframescommand is deprecated. Usebenchbox compare --runinstead.Migration:
# OLD benchbox compare-dataframes -p polars-df -p pandas-df # NEW benchbox compare --run -p polars-df -p pandas-dfSee the Platform Comparison Guide for the unified comparison interface.
This guide covers how to compare DataFrame platform performance using the benchbox compare-dataframes command. Use this to make data-driven decisions about which DataFrame library to use for your workload.
Overview¶
BenchBox provides a comprehensive benchmark suite for comparing:
Cross-platform comparisons: Polars vs Pandas vs DataFusion vs PySpark
SQL vs DataFrame API: Compare SQL execution against DataFrame API on the same data
Performance visualization: Generate charts and reports automatically
Prerequisites¶
BenchBox installed with DataFrame extras:
uv add benchbox --extra dataframeBenchmark data generated (TPC-H)
Quick Start¶
List Available Platforms¶
benchbox compare-dataframes --list-platforms
Output shows installed and available platforms:
DataFrame Platforms
Installed:
polars-df (expression, single_node ) [lazy, streaming]
pandas-df (pandas , single_node ) [standard]
datafusion-df (expression, single_node ) [lazy]
Not installed:
modin-df (pandas , distributed )
cudf-df (pandas , gpu_accelerated)
dask-df (pandas , distributed )
pyspark-df (expression, distributed )
Install extras with: uv add benchbox --extra dataframe-<name>
Compare Two Platforms¶
# Compare Polars vs Pandas on TPC-H SF 0.01
benchbox compare-dataframes -p polars-df -p pandas-df --scale 0.01
Compare Against SQL¶
# Compare Polars DataFrame API vs DuckDB SQL
benchbox compare-dataframes -p polars-df --vs-sql duckdb --scale 0.01
Command Reference¶
benchbox compare-dataframes [OPTIONS]
Platform Selection¶
Option |
Description |
|---|---|
|
DataFrame platforms to compare (repeatable) |
|
Compare against SQL platform |
|
Show available platforms and exit |
Benchmark Configuration¶
Option |
Default |
Description |
|---|---|---|
|
|
Benchmark to run |
|
|
Scale factor |
|
all |
Comma-separated query IDs (e.g., Q1,Q6,Q10) |
|
|
Warmup iterations |
|
|
Benchmark iterations |
Output Options¶
Option |
Default |
Description |
|---|---|---|
|
none |
Output directory for results |
|
|
Output format |
|
false |
Generate visualization charts |
|
|
Chart theme |
|
auto |
Directory containing benchmark data |
Usage Examples¶
Cross-Platform DataFrame Comparison¶
Compare multiple DataFrame platforms:
# Compare Polars, Pandas, and DataFusion
benchbox compare-dataframes \
-p polars-df \
-p pandas-df \
-p datafusion-df \
--scale 0.1 \
--iterations 5
Example output:
DataFrame Platform Comparison
Platforms: polars-df, pandas-df, datafusion-df
Scale factor: 0.1
Queries: all
Iterations: 5
Running benchmarks...
============================================================
RESULTS
============================================================
Fastest: polars-df
Slowest: pandas-df
Platform Geomean (ms) Total (ms) Success
------------------------------------------------------------
polars-df 125.34 2847.23 100%
datafusion-df 142.67 3245.89 100%
pandas-df 312.45 7098.34 100%
Query Winners:
Q1: polars-df
Q2: datafusion-df
Q3: polars-df
...
============================================================
SQL vs DataFrame Comparison¶
Compare DataFrame API against SQL for the same queries:
benchbox compare-dataframes \
-p polars-df \
--vs-sql duckdb \
--scale 0.1
Example output:
SQL vs DataFrame Comparison
SQL Platform: duckdb
DataFrame Platform: polars-df
Scale factor: 0.1
Queries: all
Running benchmarks...
============================================================
SQL vs DataFrame RESULTS
============================================================
SQL Platform: duckdb
DataFrame Platform: polars-df
DataFrame faster: 15 queries (68.2%)
SQL faster: 7 queries
Average speedup: 1.24x
Query SQL (ms) DataFrame (ms) Speedup
------------------------------------------------------------
Q1 45.23 38.12 1.19x
Q2 123.45 89.34 1.38x
Q3 67.89 72.45 0.94x
...
============================================================
Generate Reports and Charts¶
Save results with visualization:
benchbox compare-dataframes \
-p polars-df \
-p pandas-df \
--scale 1 \
--output ./comparison_results \
--format markdown \
--generate-charts \
--theme dark
This creates:
comparison_results/
├── comparison.md # Markdown report
└── charts/
├── comparison_bar.png
├── comparison_bar.html
├── platform_distribution.png
├── platform_distribution.html
├── query_heatmap.png
└── query_heatmap.html
Specific Query Subset¶
Run only specific queries for faster iteration:
benchbox compare-dataframes \
-p polars-df \
-p pandas-df \
--queries Q1,Q6,Q10,Q14 \
--iterations 10
JSON Output for Automation¶
Export results for programmatic analysis:
benchbox compare-dataframes \
-p polars-df \
-p pandas-df \
--format json \
--output ./results
The JSON output includes:
{
"config": {
"scale_factor": 0.01,
"query_ids": null,
"iterations": 3
},
"results": [
{
"platform": "polars-df",
"query_results": [...],
"geometric_mean_ms": 125.34,
"total_time_ms": 2847.23,
"success_rate": 100.0
},
...
],
"summary": {
"fastest_platform": "polars-df",
"slowest_platform": "pandas-df",
"query_winners": {"Q1": "polars-df", ...}
}
}
Programmatic API¶
For custom analysis, use the Python API directly:
from benchbox.core.dataframe import (
BenchmarkConfig,
DataFrameBenchmarkSuite,
SQLVsDataFrameBenchmark,
run_quick_comparison,
run_sql_vs_dataframe,
)
Quick Comparison¶
from benchbox.core.dataframe import run_quick_comparison
from pathlib import Path
# Run quick comparison
results = run_quick_comparison(
platforms=["polars-df", "pandas-df"],
data_dir=Path("benchmark_runs/tpch/sf001/data"),
scale_factor=0.01,
)
for result in results:
print(f"{result.platform}: {result.geometric_mean_ms:.2f}ms")
Full Suite with Statistics¶
from benchbox.core.dataframe import (
BenchmarkConfig,
DataFrameBenchmarkSuite,
)
# Configure benchmark
config = BenchmarkConfig(
scale_factor=0.1,
query_ids=["Q1", "Q6", "Q10"],
warmup_iterations=2,
benchmark_iterations=5,
)
# Create suite
suite = DataFrameBenchmarkSuite(config=config)
# Run comparison
results = suite.run_comparison(
platforms=["polars-df", "pandas-df"],
data_dir=Path("benchmark_runs/tpch/sf01/data"),
)
# Get summary
summary = suite.get_summary(results)
print(f"Fastest: {summary.fastest_platform}")
print(f"Speedup ratio: {summary.speedup_ratio:.2f}x")
SQL vs DataFrame¶
from benchbox.core.dataframe import run_sql_vs_dataframe
summary = run_sql_vs_dataframe(
sql_platform="duckdb",
df_platform="polars-df",
data_dir=Path("benchmark_runs/tpch/sf001/data"),
scale_factor=0.01,
)
print(f"DataFrame faster: {summary.df_faster_count} queries")
print(f"Average speedup: {summary.average_speedup:.2f}x")
Generate Charts Programmatically¶
from benchbox.core.dataframe import (
DataFrameBenchmarkSuite,
DataFrameComparisonPlotter,
)
# Run comparison
suite = DataFrameBenchmarkSuite()
results = suite.run_comparison(["polars-df", "pandas-df"], data_dir)
# Generate charts
plotter = DataFrameComparisonPlotter(results, theme="dark")
exports = plotter.generate_charts(
output_dir="charts/",
formats=["png", "html", "svg"],
dpi=300,
)
for chart_type, paths in exports.items():
print(f"{chart_type}: {paths}")
Platform Categories¶
The benchmark suite categorizes platforms by capability:
Category |
Platforms |
Use Case |
|---|---|---|
Single Node |
Polars, Pandas, DataFusion |
In-memory analysis, medium datasets |
Distributed |
PySpark, Dask, Modin |
Large datasets, cluster computing |
GPU Accelerated |
cuDF |
CUDA-enabled GPU acceleration |
Platform Capabilities¶
from benchbox.core.dataframe import PLATFORM_CAPABILITIES
for platform, cap in PLATFORM_CAPABILITIES.items():
features = []
if cap.supports_lazy:
features.append("lazy")
if cap.supports_streaming:
features.append("streaming")
if cap.supports_gpu:
features.append("gpu")
if cap.supports_distributed:
features.append("distributed")
print(f"{platform}: {', '.join(features) or 'standard'}")
Best Practices¶
1. Start Small¶
Begin with SF 0.01 to verify everything works:
benchbox compare-dataframes -p polars-df -p pandas-df --scale 0.01
2. Use Multiple Iterations¶
Reduce variance with multiple iterations:
benchbox compare-dataframes \
-p polars-df \
-p pandas-df \
--iterations 5 \
--warmup 2
3. Match Your Production Scale¶
Test at scale factors representative of your production data:
Data Size |
Recommended SF |
|---|---|
< 1 GB |
0.01 - 0.1 |
1-10 GB |
0.1 - 1 |
10-100 GB |
1 - 10 |
> 100 GB |
10+ (distributed platforms) |
4. Compare Apples to Apples¶
When comparing SQL vs DataFrame:
Use the same underlying data
Same machine/resources
Same query semantics
5. Document Your Results¶
Use markdown output for documentation:
benchbox compare-dataframes \
-p polars-df \
-p pandas-df \
--scale 1 \
--format markdown \
--output ./docs/benchmark_results
Interpreting Results¶
Geometric Mean¶
The geometric mean provides a single number summarizing performance across all queries. Lower is better.
Success Rate¶
Percentage of queries that completed successfully. 100% expected for production-ready platforms.
Query Winners¶
Shows which platform was fastest for each query. Useful for identifying platform strengths:
Polars often wins aggregation-heavy queries
DuckDB may win complex joins
Pandas competitive on small datasets
Speedup Ratio¶
For SQL vs DataFrame comparison:
> 1.0: DataFrame faster< 1.0: SQL faster~1.0: Approximately equal
Troubleshooting¶
Data Not Found¶
Error: Data directory not found: benchmark_runs/tpch/sf001/data
Generate data first:
benchbox run --platform duckdb --benchmark tpch --scale 0.01
Platform Not Available¶
Warning: Platform modin-df not available, skipping
Install the required extra:
uv add benchbox --extra dataframe-modin
Memory Issues at Large Scale¶
For large scale factors, consider:
Using streaming-capable platforms (Polars)
Using distributed platforms (PySpark, Dask)
Reducing concurrent platforms compared