DataFrame Platforms¶
BenchBox supports native DataFrame benchmarking alongside traditional SQL database benchmarking. DataFrame platforms execute benchmark queries using native DataFrame APIs rather than SQL, enabling direct performance comparison between different DataFrame libraries.
Supported Benchmarks for DataFrame Execution:
Benchmark |
DataFrame Support |
Query Count |
|---|---|---|
TPC-H |
Full |
22 queries |
TPC-DS |
Full |
99 queries |
Read Primitives |
Full |
149 queries (both families) |
Write Primitives |
Full |
INSERT, UPDATE, DELETE, MERGE, BULK_LOAD |
Why DataFrame Benchmarking Matters¶
The Challenge¶
Modern data engineering teams face a choice between two paradigms:
SQL-based platforms (DuckDB, BigQuery, Snowflake) - Query using SQL strings
DataFrame libraries (Pandas, Polars, PySpark) - Query using native APIs
Traditionally, comparing these approaches required:
Writing queries twice (SQL + DataFrame)
Custom benchmarking infrastructure
Non-standardized measurement methodologies
The BenchBox Solution¶
BenchBox provides unified TPC-H benchmarking across both paradigms:
# SQL mode - queries executed via SQL
benchbox run --platform duckdb --benchmark tpch --scale 1
# DataFrame mode - queries executed via native DataFrame API
benchbox run --platform polars-df --benchmark tpch --scale 1
Same benchmark. Same scale factor. Same metrics. Different execution paradigm.
Key Benefits¶
Benefit |
Description |
|---|---|
Apples-to-apples comparison |
Compare SQL vs DataFrame on identical workloads |
Library evaluation |
Benchmark Polars vs Pandas vs PySpark on real analytics queries |
Migration assessment |
Measure performance impact of paradigm switches |
Optimization validation |
Test DataFrame API optimizations against SQL baselines |
Supported Platforms¶
BenchBox supports 4 production-ready DataFrame platforms with 3 more in development.
Expression Family (Production-Ready)¶
Libraries using expression objects and declarative, lazy-evaluation style:
Platform |
CLI Name |
Status |
Best For |
|---|---|---|---|
Polars |
|
Production-ready |
High-performance single-node analytics (recommended) |
DataFusion |
|
Production-ready |
Arrow-native workflows, Rust performance |
PySpark |
|
Production-ready |
Distributed big data processing |
Pandas Family (Production-Ready)¶
Libraries using string-based column access and imperative style:
Platform |
CLI Name |
Status |
Best For |
|---|---|---|---|
Pandas |
|
Production-ready |
Data science prototyping, ecosystem compatibility |
Coming Soon¶
Infrastructure is in place for these platforms:
Platform |
CLI Name |
Family |
Status |
Notes |
|---|---|---|---|---|
Modin |
|
Pandas |
Infrastructure ready |
Ray/Dask backends |
Dask |
|
Pandas |
Infrastructure ready |
Parallel computing |
cuDF |
|
Pandas |
Infrastructure ready |
NVIDIA GPU acceleration |
Quick Start¶
Installation¶
# Polars DataFrame (recommended - core dependency)
# Already included in base BenchBox installation
# Pandas DataFrame
uv add benchbox --extra pandas
# PySpark DataFrame
uv add benchbox --extra pyspark
# Install all DataFrame platforms
uv add benchbox --extra dataframe-all
Running Your First DataFrame Benchmark¶
# Run TPC-H on Polars DataFrame
benchbox run --platform polars-df --benchmark tpch --scale 0.01
# Run TPC-H on Pandas DataFrame
benchbox run --platform pandas-df --benchmark tpch --scale 0.01
# Run TPC-H on PySpark DataFrame (local mode)
benchbox run --platform pyspark-df --benchmark tpch --scale 0.01
Comparing SQL vs DataFrame¶
# SQL mode (Polars SQL interface)
benchbox run --platform polars --benchmark tpch --scale 0.1
# DataFrame mode (Polars expression API)
benchbox run --platform polars-df --benchmark tpch --scale 0.1
Architecture: Family-Based Design¶
BenchBox uses a family-based architecture that enables 95%+ code reuse across DataFrame libraries.
The Two Families¶
Python DataFrame libraries cluster into two syntactic families based on API design:
Pandas Family¶
Libraries using string-based column access and imperative style:
# Pandas-style syntax
df = df[df['l_shipdate'] <= cutoff]
result = df.groupby(['l_returnflag', 'l_linestatus']).agg({
'l_quantity': 'sum',
'l_extendedprice': 'sum'
})
Members: Pandas, Modin, cuDF, Dask, Vaex
Expression Family¶
Libraries using expression objects and declarative style:
# Expression-style syntax
result = (
df.filter(col('l_shipdate') <= lit(cutoff))
.group_by('l_returnflag', 'l_linestatus')
.agg(
col('l_quantity').sum().alias('sum_qty'),
col('l_extendedprice').sum().alias('sum_base_price')
)
)
Members: Polars, PySpark, DataFusion
Why Family-Based Works¶
Libraries within each family are intentionally API-compatible:
Modin, cuDF, Dask - designed as Pandas drop-in replacements
PySpark, DataFusion - share expression-based conceptual model with Polars
Result: Write query once per family, run on multiple platforms.
Platform Configuration¶
Polars DataFrame (polars-df)¶
benchbox run --platform polars-df --benchmark tpch --scale 1 \
--platform-option streaming=true \
--platform-option rechunk=true
Option |
Default |
Description |
|---|---|---|
|
|
Enable streaming mode for large datasets |
|
|
Rechunk data for better memory layout |
|
- |
Limit rows to read (for testing) |
Pandas DataFrame (pandas-df)¶
benchbox run --platform pandas-df --benchmark tpch --scale 1 \
--platform-option dtype_backend=pyarrow
Option |
Default |
Description |
|---|---|---|
|
|
Backend for nullable dtypes ( |
PySpark DataFrame (pyspark-df)¶
benchbox run --platform pyspark-df --benchmark tpch --scale 1 \
--platform-option driver_memory=8g \
--platform-option shuffle_partitions=8
Option |
Default |
Description |
|---|---|---|
|
|
Spark master URL |
|
|
Memory for driver process |
|
CPU count |
Partitions for shuffle operations |
|
|
Enable Adaptive Query Execution |
See PySpark DataFrame Platform for detailed configuration options.
Scale Factor Guidelines¶
DataFrame platforms have memory constraints. Recommended scale factors:
Platform |
Max Recommended SF |
Memory Required (SF=1) |
Notes |
|---|---|---|---|
Pandas |
10 |
~6 GB |
Eager evaluation, high memory |
Polars |
100 |
~4 GB |
Lazy evaluation, efficient |
Modin |
10 |
~6 GB + overhead |
Distributed overhead |
Dask |
100+ |
Configurable |
Disk spillover supported |
cuDF |
1-10 |
GPU VRAM |
Limited by GPU memory |
PySpark |
1000+ |
Cluster memory |
Distributed processing |
Supported Benchmarks¶
TPC-H (22 queries)¶
All 22 TPC-H queries have DataFrame implementations for both expression and pandas families.
Read Primitives (149 queries)¶
The Read Primitives benchmark provides comprehensive DataFrame support:
Expression Family: 149 query implementations (Polars, PySpark, DataFusion)
Pandas Family: 149 query implementations (Pandas, Modin, Dask, cuDF)
Categories: aggregation, filter, groupby, orderby, window, qualify, broadcast, string, and more
# Run Read Primitives on Polars DataFrame
benchbox run --platform polars-df --benchmark read-primitives --scale 0.01
See Read Primitives for full documentation.
Write Primitives (DataFrame Write Operations)¶
Write operations on DataFrame platforms are supported via the DataFrameWriteOperationsManager:
Platform |
INSERT |
UPDATE |
DELETE |
MERGE |
BULK_LOAD |
|---|---|---|---|---|---|
Polars |
✅ |
✅ |
✅ |
✅ |
✅ |
Pandas |
✅ |
❌ |
❌ |
❌ |
✅ |
PySpark |
✅ |
✅* |
✅* |
✅* |
✅ |
*PySpark requires Delta Lake table format for row-level operations.
See Write Primitives for full documentation.
Query Implementation Details¶
BenchBox implements all 22 TPC-H queries for DataFrame platforms:
Expression Family Example (Q1)¶
def q1_expression_impl(ctx: DataFrameContext) -> Any:
"""TPC-H Q1: Pricing Summary Report."""
lineitem = ctx.get_table("lineitem")
col, lit = ctx.col, ctx.lit
cutoff_date = date(1998, 9, 2)
result = (
lineitem.filter(col("l_shipdate") <= lit(cutoff_date))
.group_by("l_returnflag", "l_linestatus")
.agg(
col("l_quantity").sum().alias("sum_qty"),
col("l_extendedprice").sum().alias("sum_base_price"),
(col("l_extendedprice") * (lit(1) - col("l_discount")))
.sum().alias("sum_disc_price"),
)
.sort("l_returnflag", "l_linestatus")
)
return result
Pandas Family Example (Q1)¶
def q1_pandas_impl(ctx: DataFrameContext) -> Any:
"""TPC-H Q1: Pricing Summary Report."""
lineitem = ctx.get_table("lineitem")
cutoff = pd.to_datetime("1998-12-01") - pd.Timedelta(days=90)
filtered = lineitem[lineitem["l_shipdate"] <= cutoff]
result = (
filtered
.groupby(["l_returnflag", "l_linestatus"], as_index=False)
.agg({
"l_quantity": ["sum", "mean"],
"l_extendedprice": ["sum", "mean"],
})
.sort_values(["l_returnflag", "l_linestatus"])
)
return result
Execution Differences: SQL vs DataFrame¶
Aspect |
SQL Mode |
DataFrame Mode |
|---|---|---|
Query representation |
SQL strings |
Native API calls |
Optimization |
Database query planner |
Library-specific (lazy if available) |
Type checking |
Runtime (query execution) |
Some compile-time (IDE support) |
Composability |
CTEs, subqueries |
Method chaining, intermediate variables |
Debugging |
EXPLAIN plans |
Step-through execution |
Performance Characteristics¶
Polars DataFrame¶
Strengths:
Lazy evaluation with query optimization
Excellent memory efficiency
Parallel execution by default
Fast file scanning with predicate pushdown
Best for:
Medium to large datasets (up to ~100GB)
Complex analytical queries
Memory-constrained environments
Pandas DataFrame¶
Strengths:
Familiar API for Python developers
Extensive ecosystem
Good for prototyping
Best for:
Smaller datasets (up to ~10GB)
Quick iteration
Compatibility with existing codebases
Checking Platform Availability¶
# Check which DataFrame platforms are installed
benchbox profile
# Detailed platform status
python -c "from benchbox.platforms.dataframe import format_platform_status_table; print(format_platform_status_table())"
Example output:
DataFrame Platform Status
============================================================
Platform Family Available Version
------------------------------------------------------------
Pandas pandas ✓ 2.1.4
Polars expression ✓ 1.15.0
Modin pandas ✗ N/A
Dask pandas ✗ N/A
PySpark expression ✗ N/A
DataFusion expression ✓ 43.0.0
------------------------------------------------------------
Available: 3/6 platforms
DataFrame Tuning¶
BenchBox provides a comprehensive tuning system for DataFrame platforms that allows you to optimize runtime performance based on your system profile and workload characteristics.
Quick Start with Tuning¶
# Use auto-detected optimal settings based on your system
benchbox run --platform polars-df --benchmark tpch --scale 1 --tuning auto
# Use a custom tuning configuration file
benchbox run --platform polars-df --benchmark tpch --tuning ./my_tuning.yaml
CLI Commands¶
# View recommended settings for your system
benchbox tuning show-defaults --platform polars
# Create a sample tuning configuration
benchbox tuning create-sample --platform polars --output polars_tuning.yaml
# Validate a configuration file
benchbox tuning validate polars_tuning.yaml --platform polars
# List supported platforms
benchbox tuning list-platforms
Tuning Configuration Options¶
DataFrame tuning is organized into configuration categories:
Parallelism Settings¶
Setting |
Type |
Default |
Applicable Platforms |
Description |
|---|---|---|---|---|
|
int |
auto |
Polars, Modin |
Number of threads (Polars: |
|
int |
auto |
Dask, Modin |
Number of worker processes |
|
int |
auto |
Dask |
Threads per worker process |
Memory Settings¶
Setting |
Type |
Default |
Applicable Platforms |
Description |
|---|---|---|---|---|
|
str |
None |
Dask |
Memory limit per worker (e.g., |
|
int |
None |
Polars, Pandas, Dask |
Size of chunks for streaming/batched operations |
|
bool |
false |
Dask, cuDF |
Enable spilling to disk when memory is exhausted |
|
str |
None |
Dask |
Directory for spill files (None = temp directory) |
|
bool |
true |
Polars |
Rechunk data after filter operations for better memory layout |
Execution Settings¶
Setting |
Type |
Default |
Applicable Platforms |
Description |
|---|---|---|---|---|
|
bool |
false |
Polars |
Enable streaming execution for memory efficiency |
|
str |
None |
Polars, Modin |
Preferred execution engine. Polars: |
|
bool |
true |
Polars, Dask |
Enable lazy evaluation where supported |
|
int |
None |
All lazy platforms |
Maximum seconds for collect/compute operations |
Data Type Settings¶
Setting |
Type |
Default |
Applicable Platforms |
Description |
|---|---|---|---|---|
|
str |
|
Pandas, Dask, Modin |
Backend for nullable dtypes: |
|
bool |
false |
Polars, Pandas |
Enable global string caching for categoricals |
|
bool |
false |
Pandas, Modin |
Auto-convert low-cardinality strings to categoricals |
|
float |
0.5 |
Pandas, Modin |
Unique ratio threshold for auto-categorization (0.0-1.0) |
I/O Settings¶
Setting |
Type |
Default |
Applicable Platforms |
Description |
|---|---|---|---|---|
|
str |
|
All |
Memory allocator for Arrow: |
|
bool |
false |
Pandas, Dask, Modin |
Use memory-mapped files for reading |
|
bool |
true |
Pandas, Dask |
Pre-buffer data during file reads |
|
int |
None |
Polars, Pandas, cuDF |
Row group size for Parquet writing |
Write Settings¶
Setting |
Type |
Default |
Applicable Platforms |
Description |
|---|---|---|---|---|
|
list[object|str] |
|
All DataFrame platforms |
Sort columns before write for better compression and scan locality |
|
list[object|str] |
|
Dask, PySpark |
Hive-style partitioned output directories |
|
int |
None |
Polars, Pandas, cuDF |
Parquet row group sizing for write path |
|
int |
None |
Platform-dependent |
Preferred output file size target |
|
int |
None |
Dask, PySpark |
Number of output files/partitions |
|
str |
|
All DataFrame platforms |
Parquet compression codec ( |
|
int |
None |
Codec-dependent |
Compression level for codecs that support levels |
|
list[str] |
|
All DataFrame platforms |
Force dictionary encoding on selected columns |
|
list[str] |
|
All DataFrame platforms |
Disable dictionary encoding for selected columns |
|
|
|
All Parquet write paths |
Parquet data page serialization version |
GPU Settings (cuDF)¶
Setting |
Type |
Default |
Applicable Platforms |
Description |
|---|---|---|---|---|
|
bool |
false |
cuDF |
Enable GPU acceleration |
|
int |
0 |
cuDF |
CUDA device ID to use (0-indexed) |
|
bool |
true |
cuDF |
Spill GPU memory to host RAM when exhausted |
|
str |
|
cuDF |
RMM memory pool type: |
Example Tuning Configurations¶
Polars - SF10+ Dataset Optimization¶
_metadata:
version: "1.0"
platform: polars
description: "Optimized settings for large TPC-H workloads"
parallelism:
thread_count: 8
execution:
streaming_mode: true
engine_affinity: streaming
lazy_evaluation: true
memory:
chunk_size: 100000
rechunk_after_filter: true
io:
memory_pool: jemalloc
Pandas - Memory-Efficient Processing¶
_metadata:
version: "1.0"
platform: pandas
description: "Memory-efficient Pandas configuration"
data_types:
dtype_backend: pyarrow
auto_categorize_strings: true
categorical_threshold: 0.3
io:
memory_map: true
pre_buffer: false
Dask - Distributed Workload¶
_metadata:
version: "1.0"
platform: dask
description: "Distributed Dask configuration"
parallelism:
worker_count: 4
threads_per_worker: 2
memory:
memory_limit: "4GB"
spill_to_disk: true
spill_directory: /tmp/dask-spill
execution:
lazy_evaluation: true
cuDF - GPU Acceleration¶
_metadata:
version: "1.0"
platform: cudf
description: "GPU-accelerated cuDF configuration"
gpu:
enabled: true
device_id: 0
spill_to_host: true
pool_type: managed
Smart Defaults¶
When you use --tuning auto, BenchBox detects your system profile and applies appropriate settings:
System Profile |
Memory |
Typical Settings Applied |
|---|---|---|
Very Low |
<4GB |
Streaming mode, small chunks (10K), spill to disk |
Low |
4-8GB |
Streaming mode, moderate chunks (50K), memory-mapped I/O |
Medium |
8-32GB |
Lazy evaluation, moderate chunks (100K), pyarrow backend |
High |
>32GB |
In-memory processing, no streaming, large chunks |
GPU Available |
N/A |
GPU enabled, managed pool, spill to host |
System Detection¶
BenchBox automatically detects:
CPU cores: Used to set thread/worker counts
Available RAM: Determines memory category and chunk sizes
GPU presence: Enables CUDA-based acceleration for cuDF
GPU memory: Sets spill thresholds and pool types
Platform-Specific Smart Defaults¶
Polars:
Low memory:
streaming_mode=true,chunk_size=50000High memory:
engine_affinity="in-memory",lazy_evaluation=true
Pandas:
Low memory:
dtype_backend="numpy_nullable",memory_map=trueMedium+ memory:
dtype_backend="pyarrow"(2-5x faster for aggregations)
Dask:
Workers set to CPU cores / 2
Threads per worker: 2
Memory limit per worker: available RAM / workers
cuDF:
Small GPU (<8GB):
pool_type="managed",spill_to_host=trueLarge GPU (≥8GB):
pool_type="pool",spill_to_host=false
Tuning Validation¶
BenchBox validates your tuning configuration and reports issues at three levels:
Level |
Description |
Example |
|---|---|---|
ERROR |
Invalid configuration that will fail |
Invalid |
WARNING |
Suboptimal or conflicting settings |
|
INFO |
Suggestions for improvement |
Streaming mode without |
Validate your configuration before running:
# Validate a configuration file
benchbox tuning validate my_config.yaml --platform polars
# Output includes issues and suggestions
✓ Configuration valid
⚠ WARNING: streaming_mode enabled without chunk_size - consider setting chunk_size
ℹ INFO: Consider enabling lazy_evaluation for improved efficiency
Troubleshooting¶
Platform Not Available¶
ValueError: Unknown DataFrame platform: polars-df
Solution: Ensure you’re using a supported platform name with the -df suffix.
Memory Errors with Pandas¶
MemoryError: Unable to allocate array
Solutions:
Reduce scale factor:
--scale 0.1Switch to Polars:
--platform polars-dfUse PyArrow backend:
--platform-option dtype_backend=pyarrow
Slow Performance with Large Datasets¶
For scale factors > 10:
Polars: Enable streaming mode
--platform-option streaming=true
Pandas: Consider switching to Polars or Dask
API Reference¶
DataFrameContext Protocol¶
The context provides table access and expression helpers:
from benchbox.core.dataframe import DataFrameContext
# Table access
df = ctx.get_table("lineitem")
# Expression builders (expression family)
col = ctx.col # Column reference
lit = ctx.lit # Literal value
DataFrameQuery Class¶
Query definition with dual-family implementations:
from benchbox.core.dataframe import DataFrameQuery, QueryCategory
query = DataFrameQuery(
query_id="Q1",
query_name="Pricing Summary Report",
category=QueryCategory.TPCH,
expression_impl=q1_expression_impl,
pandas_impl=q1_pandas_impl,
)