DataFrame Platforms¶
BenchBox supports native DataFrame benchmarking alongside traditional SQL database benchmarking. DataFrame platforms execute TPC-H queries using native DataFrame APIs rather than SQL, enabling direct performance comparison between different DataFrame libraries.
Why DataFrame Benchmarking Matters¶
The Challenge¶
Modern data engineering teams face a choice between two paradigms:
SQL-based platforms (DuckDB, BigQuery, Snowflake) - Query using SQL strings
DataFrame libraries (Pandas, Polars, PySpark) - Query using native APIs
Traditionally, comparing these approaches required:
Writing queries twice (SQL + DataFrame)
Custom benchmarking infrastructure
Non-standardized measurement methodologies
The BenchBox Solution¶
BenchBox provides unified TPC-H benchmarking across both paradigms:
# SQL mode - queries executed via SQL
benchbox run --platform duckdb --benchmark tpch --scale 1
# DataFrame mode - queries executed via native DataFrame API
benchbox run --platform polars-df --benchmark tpch --scale 1
Same benchmark. Same scale factor. Same metrics. Different execution paradigm.
Key Benefits¶
Benefit |
Description |
|---|---|
Apples-to-apples comparison |
Compare SQL vs DataFrame on identical workloads |
Library evaluation |
Benchmark Polars vs Pandas vs PySpark on real analytics queries |
Migration assessment |
Measure performance impact of paradigm switches |
Optimization validation |
Test DataFrame API optimizations against SQL baselines |
Supported Platforms¶
BenchBox supports 4 production-ready DataFrame platforms with 3 more in development.
Expression Family (Production-Ready)¶
Libraries using expression objects and declarative, lazy-evaluation style:
Platform |
CLI Name |
Status |
Best For |
|---|---|---|---|
Polars |
|
Production-ready |
High-performance single-node analytics (recommended) |
DataFusion |
|
Production-ready |
Arrow-native workflows, Rust performance |
PySpark |
|
Production-ready |
Distributed big data processing |
Pandas Family (Production-Ready)¶
Libraries using string-based column access and imperative style:
Platform |
CLI Name |
Status |
Best For |
|---|---|---|---|
Pandas |
|
Production-ready |
Data science prototyping, ecosystem compatibility |
Coming Soon¶
Infrastructure is in place for these platforms:
Platform |
CLI Name |
Family |
Status |
Notes |
|---|---|---|---|---|
Modin |
|
Pandas |
Infrastructure ready |
Ray/Dask backends |
Dask |
|
Pandas |
Infrastructure ready |
Parallel computing |
cuDF |
|
Pandas |
Infrastructure ready |
NVIDIA GPU acceleration |
Quick Start¶
Installation¶
# Polars DataFrame (recommended - core dependency)
# Already included in base BenchBox installation
# Pandas DataFrame
uv add benchbox --extra dataframe-pandas
# PySpark DataFrame
uv add benchbox --extra dataframe-pyspark
# Install all DataFrame platforms
uv add benchbox --extra dataframe-all
Running Your First DataFrame Benchmark¶
# Run TPC-H on Polars DataFrame
benchbox run --platform polars-df --benchmark tpch --scale 0.01
# Run TPC-H on Pandas DataFrame
benchbox run --platform pandas-df --benchmark tpch --scale 0.01
# Run TPC-H on PySpark DataFrame (local mode)
benchbox run --platform pyspark-df --benchmark tpch --scale 0.01
Comparing SQL vs DataFrame¶
# SQL mode (Polars SQL interface)
benchbox run --platform polars --benchmark tpch --scale 0.1
# DataFrame mode (Polars expression API)
benchbox run --platform polars-df --benchmark tpch --scale 0.1
Architecture: Family-Based Design¶
BenchBox uses a family-based architecture that enables 95%+ code reuse across DataFrame libraries.
The Two Families¶
Python DataFrame libraries cluster into two syntactic families based on API design:
Pandas Family¶
Libraries using string-based column access and imperative style:
# Pandas-style syntax
df = df[df['l_shipdate'] <= cutoff]
result = df.groupby(['l_returnflag', 'l_linestatus']).agg({
'l_quantity': 'sum',
'l_extendedprice': 'sum'
})
Members: Pandas, Modin, cuDF, Dask, Vaex
Expression Family¶
Libraries using expression objects and declarative style:
# Expression-style syntax
result = (
df.filter(col('l_shipdate') <= lit(cutoff))
.group_by('l_returnflag', 'l_linestatus')
.agg(
col('l_quantity').sum().alias('sum_qty'),
col('l_extendedprice').sum().alias('sum_base_price')
)
)
Members: Polars, PySpark, DataFusion
Why Family-Based Works¶
Libraries within each family are intentionally API-compatible:
Modin, cuDF, Dask - designed as Pandas drop-in replacements
PySpark, DataFusion - share expression-based conceptual model with Polars
Result: Write query once per family, run on multiple platforms.
Platform Configuration¶
Polars DataFrame (polars-df)¶
benchbox run --platform polars-df --benchmark tpch --scale 1 \
--platform-option streaming=true \
--platform-option rechunk=true
Option |
Default |
Description |
|---|---|---|
|
|
Enable streaming mode for large datasets |
|
|
Rechunk data for better memory layout |
|
- |
Limit rows to read (for testing) |
Pandas DataFrame (pandas-df)¶
benchbox run --platform pandas-df --benchmark tpch --scale 1 \
--platform-option dtype_backend=pyarrow
Option |
Default |
Description |
|---|---|---|
|
|
Backend for nullable dtypes ( |
PySpark DataFrame (pyspark-df)¶
benchbox run --platform pyspark-df --benchmark tpch --scale 1 \
--platform-option driver_memory=8g \
--platform-option shuffle_partitions=8
Option |
Default |
Description |
|---|---|---|
|
|
Spark master URL |
|
|
Memory for driver process |
|
CPU count |
Partitions for shuffle operations |
|
|
Enable Adaptive Query Execution |
See PySpark DataFrame Platform for detailed configuration options.
Scale Factor Guidelines¶
DataFrame platforms have memory constraints. Recommended scale factors:
Platform |
Max Recommended SF |
Memory Required (SF=1) |
Notes |
|---|---|---|---|
Pandas |
10 |
~6 GB |
Eager evaluation, high memory |
Polars |
100 |
~4 GB |
Lazy evaluation, efficient |
Modin |
10 |
~6 GB + overhead |
Distributed overhead |
Dask |
100+ |
Configurable |
Disk spillover supported |
cuDF |
1-10 |
GPU VRAM |
Limited by GPU memory |
PySpark |
1000+ |
Cluster memory |
Distributed processing |
Query Implementation Details¶
BenchBox implements all 22 TPC-H queries for DataFrame platforms:
Expression Family Example (Q1)¶
def q1_expression_impl(ctx: DataFrameContext) -> Any:
"""TPC-H Q1: Pricing Summary Report."""
lineitem = ctx.get_table("lineitem")
col, lit = ctx.col, ctx.lit
cutoff_date = date(1998, 9, 2)
result = (
lineitem.filter(col("l_shipdate") <= lit(cutoff_date))
.group_by("l_returnflag", "l_linestatus")
.agg(
col("l_quantity").sum().alias("sum_qty"),
col("l_extendedprice").sum().alias("sum_base_price"),
(col("l_extendedprice") * (lit(1) - col("l_discount")))
.sum().alias("sum_disc_price"),
)
.sort("l_returnflag", "l_linestatus")
)
return result
Pandas Family Example (Q1)¶
def q1_pandas_impl(ctx: DataFrameContext) -> Any:
"""TPC-H Q1: Pricing Summary Report."""
lineitem = ctx.get_table("lineitem")
cutoff = pd.to_datetime("1998-12-01") - pd.Timedelta(days=90)
filtered = lineitem[lineitem["l_shipdate"] <= cutoff]
result = (
filtered
.groupby(["l_returnflag", "l_linestatus"], as_index=False)
.agg({
"l_quantity": ["sum", "mean"],
"l_extendedprice": ["sum", "mean"],
})
.sort_values(["l_returnflag", "l_linestatus"])
)
return result
Execution Differences: SQL vs DataFrame¶
Aspect |
SQL Mode |
DataFrame Mode |
|---|---|---|
Query representation |
SQL strings |
Native API calls |
Optimization |
Database query planner |
Library-specific (lazy if available) |
Type checking |
Runtime (query execution) |
Some compile-time (IDE support) |
Composability |
CTEs, subqueries |
Method chaining, intermediate variables |
Debugging |
EXPLAIN plans |
Step-through execution |
Performance Characteristics¶
Polars DataFrame¶
Strengths:
Lazy evaluation with query optimization
Excellent memory efficiency
Parallel execution by default
Fast file scanning with predicate pushdown
Best for:
Medium to large datasets (up to ~100GB)
Complex analytical queries
Memory-constrained environments
Pandas DataFrame¶
Strengths:
Familiar API for Python developers
Extensive ecosystem
Good for prototyping
Best for:
Smaller datasets (up to ~10GB)
Quick iteration
Compatibility with existing codebases
Checking Platform Availability¶
# Check which DataFrame platforms are installed
benchbox profile
# Detailed platform status
python -c "from benchbox.platforms.dataframe import format_platform_status_table; print(format_platform_status_table())"
Example output:
DataFrame Platform Status
============================================================
Platform Family Available Version
------------------------------------------------------------
Pandas pandas ✓ 2.1.4
Polars expression ✓ 1.15.0
Modin pandas ✗ N/A
Dask pandas ✗ N/A
PySpark expression ✗ N/A
DataFusion expression ✓ 43.0.0
------------------------------------------------------------
Available: 3/6 platforms
DataFrame Tuning¶
BenchBox provides a comprehensive tuning system for DataFrame platforms that allows you to optimize runtime performance based on your system profile and workload characteristics.
Quick Start with Tuning¶
# Use auto-detected optimal settings based on your system
benchbox run --platform polars-df --benchmark tpch --scale 1 --tuning auto
# Use a custom tuning configuration file
benchbox run --platform polars-df --benchmark tpch --tuning ./my_tuning.yaml
CLI Commands¶
# View recommended settings for your system
benchbox df-tuning show-defaults --platform polars
# Create a sample tuning configuration
benchbox df-tuning create-sample --platform polars --output polars_tuning.yaml
# Validate a configuration file
benchbox df-tuning validate polars_tuning.yaml --platform polars
# List supported platforms
benchbox df-tuning list-platforms
Tuning Configuration Options¶
DataFrame tuning is organized into configuration categories:
Parallelism Settings¶
Setting |
Type |
Default |
Applicable Platforms |
Description |
|---|---|---|---|---|
|
int |
auto |
Polars, Modin |
Number of threads (Polars: |
|
int |
auto |
Dask, Modin |
Number of worker processes |
|
int |
auto |
Dask |
Threads per worker process |
Memory Settings¶
Setting |
Type |
Default |
Applicable Platforms |
Description |
|---|---|---|---|---|
|
str |
None |
Dask |
Memory limit per worker (e.g., |
|
int |
None |
Polars, Pandas, Dask |
Size of chunks for streaming/batched operations |
|
bool |
false |
Dask, cuDF |
Enable spilling to disk when memory is exhausted |
|
str |
None |
Dask |
Directory for spill files (None = temp directory) |
|
bool |
true |
Polars |
Rechunk data after filter operations for better memory layout |
Execution Settings¶
Setting |
Type |
Default |
Applicable Platforms |
Description |
|---|---|---|---|---|
|
bool |
false |
Polars |
Enable streaming execution for memory efficiency |
|
str |
None |
Polars, Modin |
Preferred execution engine. Polars: |
|
bool |
true |
Polars, Dask |
Enable lazy evaluation where supported |
|
int |
None |
All lazy platforms |
Maximum seconds for collect/compute operations |
Data Type Settings¶
Setting |
Type |
Default |
Applicable Platforms |
Description |
|---|---|---|---|---|
|
str |
|
Pandas, Dask, Modin |
Backend for nullable dtypes: |
|
bool |
false |
Polars, Pandas |
Enable global string caching for categoricals |
|
bool |
false |
Pandas, Modin |
Auto-convert low-cardinality strings to categoricals |
|
float |
0.5 |
Pandas, Modin |
Unique ratio threshold for auto-categorization (0.0-1.0) |
I/O Settings¶
Setting |
Type |
Default |
Applicable Platforms |
Description |
|---|---|---|---|---|
|
str |
|
All |
Memory allocator for Arrow: |
|
bool |
false |
Pandas, Dask, Modin |
Use memory-mapped files for reading |
|
bool |
true |
Pandas, Dask |
Pre-buffer data during file reads |
|
int |
None |
Polars, Pandas, cuDF |
Row group size for Parquet writing |
GPU Settings (cuDF)¶
Setting |
Type |
Default |
Applicable Platforms |
Description |
|---|---|---|---|---|
|
bool |
false |
cuDF |
Enable GPU acceleration |
|
int |
0 |
cuDF |
CUDA device ID to use (0-indexed) |
|
bool |
true |
cuDF |
Spill GPU memory to host RAM when exhausted |
|
str |
|
cuDF |
RMM memory pool type: |
Example Tuning Configurations¶
Polars - Large Dataset Optimization¶
_metadata:
version: "1.0"
platform: polars
description: "Optimized settings for large TPC-H workloads"
parallelism:
thread_count: 8
execution:
streaming_mode: true
engine_affinity: streaming
lazy_evaluation: true
memory:
chunk_size: 100000
rechunk_after_filter: true
io:
memory_pool: jemalloc
Pandas - Memory-Efficient Processing¶
_metadata:
version: "1.0"
platform: pandas
description: "Memory-efficient Pandas configuration"
data_types:
dtype_backend: pyarrow
auto_categorize_strings: true
categorical_threshold: 0.3
io:
memory_map: true
pre_buffer: false
Dask - Distributed Workload¶
_metadata:
version: "1.0"
platform: dask
description: "Distributed Dask configuration"
parallelism:
worker_count: 4
threads_per_worker: 2
memory:
memory_limit: "4GB"
spill_to_disk: true
spill_directory: /tmp/dask-spill
execution:
lazy_evaluation: true
cuDF - GPU Acceleration¶
_metadata:
version: "1.0"
platform: cudf
description: "GPU-accelerated cuDF configuration"
gpu:
enabled: true
device_id: 0
spill_to_host: true
pool_type: managed
Smart Defaults¶
When you use --tuning auto, BenchBox detects your system profile and applies appropriate settings:
System Profile |
Memory |
Typical Settings Applied |
|---|---|---|
Very Low |
<4GB |
Streaming mode, small chunks (10K), spill to disk |
Low |
4-8GB |
Streaming mode, moderate chunks (50K), memory-mapped I/O |
Medium |
8-32GB |
Lazy evaluation, moderate chunks (100K), pyarrow backend |
High |
>32GB |
In-memory processing, no streaming, large chunks |
GPU Available |
N/A |
GPU enabled, managed pool, spill to host |
System Detection¶
BenchBox automatically detects:
CPU cores: Used to set thread/worker counts
Available RAM: Determines memory category and chunk sizes
GPU presence: Enables CUDA-based acceleration for cuDF
GPU memory: Sets spill thresholds and pool types
Platform-Specific Smart Defaults¶
Polars:
Low memory:
streaming_mode=true,chunk_size=50000High memory:
engine_affinity="in-memory",lazy_evaluation=true
Pandas:
Low memory:
dtype_backend="numpy_nullable",memory_map=trueMedium+ memory:
dtype_backend="pyarrow"(better performance)
Dask:
Workers set to CPU cores / 2
Threads per worker: 2
Memory limit per worker: available RAM / workers
cuDF:
Small GPU (<8GB):
pool_type="managed",spill_to_host=trueLarge GPU (≥8GB):
pool_type="pool",spill_to_host=false
Tuning Validation¶
BenchBox validates your tuning configuration and reports issues at three levels:
Level |
Description |
Example |
|---|---|---|
ERROR |
Invalid configuration that will fail |
Invalid |
WARNING |
Suboptimal or conflicting settings |
|
INFO |
Suggestions for improvement |
Streaming mode without |
Validate your configuration before running:
# Validate a configuration file
benchbox df-tuning validate my_config.yaml --platform polars
# Output includes issues and suggestions
✓ Configuration valid
⚠ WARNING: streaming_mode enabled without chunk_size - consider setting chunk_size
ℹ INFO: Consider enabling lazy_evaluation for better performance
Troubleshooting¶
Platform Not Available¶
ValueError: Unknown DataFrame platform: polars-df
Solution: Ensure you’re using a supported platform name with the -df suffix.
Memory Errors with Pandas¶
MemoryError: Unable to allocate array
Solutions:
Reduce scale factor:
--scale 0.1Switch to Polars:
--platform polars-dfUse PyArrow backend:
--platform-option dtype_backend=pyarrow
Slow Performance with Large Datasets¶
For scale factors > 10:
Polars: Enable streaming mode
--platform-option streaming=true
Pandas: Consider switching to Polars or Dask
API Reference¶
DataFrameContext Protocol¶
The context provides table access and expression helpers:
from benchbox.core.dataframe import DataFrameContext
# Table access
df = ctx.get_table("lineitem")
# Expression builders (expression family)
col = ctx.col # Column reference
lit = ctx.lit # Literal value
DataFrameQuery Class¶
Query definition with dual-family implementations:
from benchbox.core.dataframe import DataFrameQuery, QueryCategory
query = DataFrameQuery(
query_id="Q1",
query_name="Pricing Summary Report",
category=QueryCategory.TPCH,
expression_impl=q1_expression_impl,
pandas_impl=q1_pandas_impl,
)