Platform Selection Guide¶
This guide helps you choose the optimal database platform for your BenchBox benchmarking needs based on your specific requirements, constraints, and objectives. Start by running uv run benchbox check-deps --matrix to see which optional connectors are installed locally, then use benchbox platforms install <name> and benchbox platforms enable <name> to activate the adapters that match your shortlist.
Curious about engines that are being evaluated next? Check the potential future platforms reference for research notes and roadmap context.
Quick Start: Choose Your Path¶
I want to start immediately with minimal setup¶
Included by default: DuckDB
No additional dependencies or configuration required
In-process analytical database with columnar storage
Suitable for learning and experimentation
I need the most cost-effective solution¶
Open-source options: DuckDB (free) or ClickHouse (self-hosted)
Open source with no licensing costs
Run on your own hardware
Minimal operational overhead
I need maximum performance¶
High-performance options: ClickHouse or Snowflake with large warehouses
Column-oriented architecture optimized for analytical workloads
Scalable compute resources
Advanced query optimization features
I want fully managed cloud service¶
Cloud platform options: BigQuery, Snowflake, or Databricks
No infrastructure management required
Auto-scaling capabilities
Enterprise support and SLAs available
I need enterprise features and governance¶
Enterprise platform options: Snowflake or Databricks
Advanced security and compliance features
Role-based access control
Audit logging and governance capabilities
I want to benchmark DataFrame libraries (not SQL)¶
DataFrame platforms: Polars-df, Pandas-df, PySpark-df, DataFusion-df
Native DataFrame API benchmarking (no SQL)
Compare Polars vs Pandas vs PySpark on identical workloads
Measure performance of data science workflows
See DataFrame Platforms Guide for details
# Quick start with DataFrame platforms
benchbox run --platform polars-df --benchmark tpch --scale 0.1 # Expression API
benchbox run --platform pandas-df --benchmark tpch --scale 0.1 # Pandas API
Platform Extras and Installation Matrix¶
BenchBox uses optional dependency extras to keep the core installation lightweight. Install just the connectors you need, or combine extras for broader coverage. Always wrap the extras specification in quotes when using shells like zsh.
Extra |
What it Enables |
Install Command |
|---|---|---|
|
DuckDB + SQLite (local development) |
|
|
Cloudpathlib helpers for S3, GCS, Azure paths |
|
|
Databricks, BigQuery, Redshift, Snowflake connectors |
|
|
Databricks SQL Warehouses + Unity Catalog tooling |
|
|
BigQuery client libraries + Google Cloud Storage support |
|
|
Redshift connector, boto3, and cloud storage helpers |
|
|
Snowflake connector + cloud storage helpers |
|
|
Trino/Starburst distributed SQL connector |
|
|
PrestoDB (Meta) distributed SQL connector |
|
|
ClickHouse native driver |
|
|
Every connector and helper listed above |
|
Need to double-check your environment? Run benchbox check-deps --matrix for the full installation matrix or benchbox check-deps --platform <name> for targeted guidance.
Presto Lineage: Trino vs PrestoDB vs Athena¶
Trino: Recommended default for new distributed deployments and Starburst; uses
trinoPython client andX-Trino-*headers.PrestoDB: Choose for legacy Presto clusters or Meta ecosystem compatibility; uses
presto-python-clientwithX-Presto-*headers.Athena: AWS managed service built on the Presto/Trino lineage; choose when you want serverless execution without running your own coordinator.
SQL vs DataFrame: When to Choose Each¶
Before diving into platform specifics, consider whether SQL or DataFrame APIs better match your benchmarking needs.
When to Choose SQL Platforms¶
Choose SQL platforms (DuckDB, BigQuery, Snowflake, etc.) when:
You’re benchmarking database systems - Comparing cloud data warehouses or analytical databases
Your production workloads use SQL - Testing the same queries your applications execute
You need distributed execution - Cloud platforms automatically scale compute
Your team thinks in SQL - Familiar with SQL query patterns and optimization
# SQL platforms execute queries as SQL strings
benchbox run --platform duckdb --benchmark tpch --scale 1
benchbox run --platform snowflake --benchmark tpch --scale 100
When to Choose DataFrame Platforms¶
Choose DataFrame platforms (Polars-df, Pandas-df, PySpark-df, etc.) when:
You’re benchmarking DataFrame libraries - Comparing Polars vs Pandas vs PySpark
Your production code uses DataFrame APIs - Data science workflows, ML pipelines
You want to compare paradigms - SQL vs native DataFrame performance
You need lazy evaluation control - Fine-grained control over execution plans
# DataFrame platforms execute using native APIs
benchbox run --platform polars-df --benchmark tpch --scale 1
benchbox run --platform pandas-df --benchmark tpch --scale 1
Side-by-Side Comparison¶
Aspect |
SQL Platforms |
DataFrame Platforms |
|---|---|---|
Query Format |
SQL strings |
Native API calls (method chaining) |
Execution |
Database engine |
Library (in-process or distributed) |
Scale Limit |
Petabyte+ (cloud) |
Varies: SF 10-100 (single-node), SF 1000+ (PySpark) |
Use Case |
Production analytics, data warehousing |
Data science, ML pipelines, library evaluation |
Debugging |
EXPLAIN plans |
Step-through execution, lazy plan inspection |
Type Safety |
Runtime (SQL parsing) |
Some compile-time (IDE support) |
DataFrame Platform Quick Reference¶
Platform |
CLI Name |
Best For |
Max Scale Factor |
|---|---|---|---|
Polars |
|
High-performance single-node analytics |
SF 100 (lazy eval) |
Pandas |
|
Data science prototyping, compatibility |
SF 10 (memory-bound) |
PySpark |
|
Distributed processing, big data |
SF 1000+ (cluster) |
DataFusion |
|
Arrow-native workflows, Rust performance |
SF 50+ |
For detailed DataFrame platform documentation, see the DataFrame Platforms Guide.
Detailed Selection Criteria¶
1. Data Scale Requirements¶
Small Scale (< 1GB)¶
Platform Options: DuckDB, DataFusion, SQLite
# Commonly used for development and testing
from benchbox import TPCH
from benchbox.platforms.duckdb import DuckDBAdapter
benchmark = TPCH(scale_factor=0.1) # ~100MB
adapter = DuckDBAdapter() # In-memory processing
results = adapter.run_benchmark(benchmark)
# Or use DataFusion for PyArrow-based workflows
from benchbox.platforms.datafusion import DataFusionAdapter
adapter = DataFusionAdapter(
memory_limit="4G",
data_format="parquet"
)
results = adapter.run_benchmark(benchmark)
Characteristics:
No additional setup or configuration required
Embedded databases suitable for small datasets
Free and lightweight
Commonly used in CI/CD pipelines
DataFusion offers PyArrow integration for Arrow-based workflows
Medium Scale (1GB - 100GB)¶
Platform Options: DuckDB, DataFusion, ClickHouse, BigQuery
# DuckDB with persistent storage
adapter = DuckDBAdapter(
database_path="analytics.duckdb",
memory_limit="16GB"
)
# Or ClickHouse for distributed processing
from benchbox.platforms.clickhouse import ClickHouseAdapter
adapter = ClickHouseAdapter(
host="clickhouse-cluster",
max_memory_usage="32GB",
max_threads=16
)
Characteristics:
Balance of performance and cost
Support for complex analytical queries
Moderate resource requirements
Large Scale (100GB+)¶
Platform Options: BigQuery, Snowflake, Databricks, Redshift
# Serverless architecture with BigQuery
from benchbox.platforms.bigquery import BigQueryAdapter
adapter = BigQueryAdapter(
project_id="analytics-project",
dataset_id="large_scale_benchmarks",
location="US",
job_priority="INTERACTIVE"
)
Characteristics:
Support for petabyte-scale analytics
Auto-scaling infrastructure
Managed cloud data warehouse platforms
2. Budget and Cost Considerations¶
Free/Open Source Solutions¶
Platform |
Cost Structure |
Ongoing Costs |
|---|---|---|
DuckDB |
Completely free |
Hardware only |
DataFusion |
Completely free |
Hardware only |
SQLite |
Completely free |
Hardware only |
ClickHouse |
Open source |
Infrastructure costs |
Example Setup:
# Completely free solution with DuckDB
adapter = DuckDBAdapter(
database_path="cost_effective.duckdb",
memory_limit="8GB" # Use available RAM efficiently
)
# Or DataFusion for in-memory analytics
from benchbox.platforms.datafusion import DataFusionAdapter
adapter = DataFusionAdapter(
memory_limit="8G",
data_format="parquet" # Better performance
)
Pay-per-Use Cloud Solutions¶
Platform |
Pricing Model |
Cost Control |
|---|---|---|
BigQuery |
$5/TB queried |
Query limits, slots |
Databricks |
DBU consumption |
Auto-termination |
Cost-Optimized Configuration:
# BigQuery with cost controls
adapter = BigQueryAdapter(
maximum_bytes_billed=1000000000, # 1GB limit per query
job_priority="BATCH", # Lower cost
query_cache=True # Reuse results
)
Provisioned Cloud Solutions¶
Platform |
Pricing Model |
Cost Optimization |
|---|---|---|
Snowflake |
Warehouse time |
Auto-suspend/resume |
Redshift |
Cluster hours |
Reserved instances |
Cost-Optimized Configuration:
# Snowflake with aggressive auto-suspend
adapter = SnowflakeAdapter(
warehouse_size="MEDIUM",
auto_suspend=60, # 1 minute
auto_resume=True
)
3. Performance Requirements¶
Latency-Sensitive Applications¶
Platform Options: DuckDB, ClickHouse
# DuckDB: In-process query execution
adapter = DuckDBAdapter(
memory_limit="32GB", # Keep data in memory
thread_limit=None # Use all CPU cores
)
# ClickHouse: Column-oriented analytical database
adapter = ClickHouseAdapter(
max_threads=32,
max_memory_usage="64GB",
compression=True # Reduce network I/O
)
Characteristics:
In-process or local execution reduces network overhead
Column-oriented storage optimized for analytical queries
Query latency depends on data volume, query complexity, and hardware resources
Throughput-Optimized Workloads¶
Platform Options: BigQuery, Snowflake, Databricks
# BigQuery: Serverless auto-scaling
adapter = BigQueryAdapter(
job_priority="INTERACTIVE",
query_cache=False, # Always compute fresh results
location="US" # Multi-region deployment
)
# Snowflake: Multi-cluster scaling
adapter = SnowflakeAdapter(
warehouse_size="X-LARGE",
multi_cluster_warehouse=True,
auto_suspend=600 # 10 minutes for sustained workloads
)
Performance Characteristics:
Concurrent user support: Hundreds to thousands of users
Auto-scaling: Automatic resource adjustment available
Query queuing: Workload management capabilities
4. Operational Requirements¶
Minimal Operations (Fully Managed)¶
Platform Options: BigQuery, Snowflake
# Fully managed serverless architecture
adapter = BigQueryAdapter(
# No infrastructure to manage
# Automatic optimization
# Built-in monitoring
)
Characteristics:
No database administration required
Automatic updates and maintenance
Built-in monitoring and alerting
High uptime SLAs (typically 99.9%+)
Some Operations Acceptable¶
Platform Options: Databricks, Redshift, ClickHouse (managed)
# Managed service with configuration options
adapter = DatabricksAdapter(
cluster_size="large",
auto_terminate_minutes=30, # Cost control
runtime_engine="PHOTON" # Vectorized query engine
)
Characteristics:
Managed infrastructure with user control
Configurable performance settings
Access to advanced features
Professional support available
Full Control Required¶
Platform Options: ClickHouse (self-hosted), DuckDB
# Self-managed deployment
adapter = ClickHouseAdapter(
host="your-optimized-cluster.company.com",
# Custom cluster configuration
# User-managed monitoring and alerting
# Complete data control
)
Characteristics:
Complete infrastructure control
Custom optimization opportunities
Data locality and privacy
No vendor dependencies
5. Team and Organizational Factors¶
Individual Developer/Researcher¶
Platform Options: DuckDB
# No additional setup or configuration required
adapter = DuckDBAdapter()
benchmark = TPCH(scale_factor=1)
results = adapter.run_benchmark(benchmark)
# Included by default with BenchBox
Characteristics:
No database administration knowledge required
Comprehensive documentation and community
Reproducible results across environments
Focus on analysis rather than infrastructure
Small Team (2-10 people)¶
Platform Options: DuckDB, BigQuery, or ClickHouse
# Shared cloud project access
adapter = BigQueryAdapter(
project_id="team-analytics-project",
dataset_id="shared_benchmarks",
# Shared Google Cloud project
# Collaboration features available
# Cost visibility and controls
)
Characteristics:
Shared access and collaboration capabilities
Cost transparency and controls
Minimal administrative overhead
Scalable as team grows
Medium Team (10-50 people)¶
Platform Options: Snowflake, Databricks
# Enterprise features for growing teams
adapter = SnowflakeAdapter(
# Role-based access control
# Multiple warehouses for different workloads
# Query result sharing
# Usage monitoring and governance
)
Characteristics:
Advanced security and governance
Workload isolation capabilities
Resource management features
Enterprise support available
Large Enterprise (50+ people)¶
Platform Options: Snowflake, Databricks, Redshift
# Enterprise deployment
adapter = DatabricksAdapter(
# Unity Catalog for governance
# Multiple workspaces
# Advanced security features
# Enterprise support and SLAs
)
Characteristics:
Enterprise security and compliance features
Advanced governance capabilities
Professional services and support
Integration with enterprise tools
6. Technical Ecosystem¶
Python-Centric Environment¶
All platforms supported. DuckDB and Databricks offer native Python integration:
# DuckDB: Native Python integration
import duckdb
import pandas as pd
# Direct DataFrame integration
df = pd.read_csv("data.csv")
result = duckdb.query("SELECT * FROM df WHERE value > 100").to_df()
# Databricks: Spark/Python integration
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("BenchBox").getOrCreate()
SQL-Heavy Teams¶
Platform Options: BigQuery, Snowflake, Redshift
# ANSI SQL support with platform-specific extensions
adapter = BigQueryAdapter()
# ANSI SQL dialect support
# Platform-managed optimization
# Standard SQL interface
Multi-Language Requirements¶
Platform Options: ClickHouse, Databricks
# ClickHouse: Multiple client libraries available
# - Python, Java, Go, C++, JavaScript
# - HTTP API for any language
# - JDBC/ODBC drivers
# Databricks: Spark ecosystem language support
# - Scala, Python, Java, R, SQL
# - REST APIs
# - Notebook interface
7. Cloud Strategy¶
Google Cloud Platform¶
Native platform: BigQuery Alternative options: ClickHouse (GKE), DuckDB
# GCP-native integration
adapter = BigQueryAdapter(
project_id="gcp-project",
location="US", # Or EU for data residency
# Integration with:
# - Cloud Storage
# - Cloud Functions
# - Dataflow
# - AI/ML services
)
Amazon Web Services¶
Native platform: Redshift Alternative options: DuckDB, ClickHouse (EKS)
# AWS-native integration
adapter = RedshiftAdapter(
# Integration with:
# - S3 for data loading
# - IAM for security
# - CloudWatch for monitoring
# - Lambda for automation
)
Microsoft Azure¶
Multi-cloud platforms available on Azure: Snowflake, Databricks Azure-native platforms (planned): Microsoft Fabric, Azure Synapse Analytics Alternative options: DuckDB
Current support:
# Multi-cloud platforms available on Azure
adapter = SnowflakeAdapter(
# Available on Azure infrastructure
# Integration with Azure services
# Enterprise features
)
adapter = DatabricksAdapter(
# Available on Azure (all three clouds: AWS, Azure, GCP)
# Unity Catalog governance
# Lakehouse architecture
)
Planned Azure-native platforms:
Microsoft Azure’s native analytics platforms are under evaluation for future BenchBox releases. See the Azure Platforms page for details on:
Microsoft Fabric - Unified SaaS data platform (OneLake, integrated workloads)
Azure Synapse Analytics - Enterprise data warehouse (dedicated/serverless SQL pools)
Azure Data Explorer - Real-time analytics engine (Kusto/KQL)
Azure users can currently benchmark using Databricks or Snowflake, both fully supported on Azure infrastructure.
Multi-Cloud or Cloud-Agnostic¶
Multi-cloud platforms: Snowflake, Databricks Cloud-agnostic options: DuckDB, ClickHouse
# Multi-cloud deployment
adapter = SnowflakeAdapter(
# Available on AWS, Azure, GCP
# Consistent interface across clouds
# Reduced cloud vendor lock-in
)
On-Premises or Hybrid¶
Self-hosted options: ClickHouse, DuckDB Hybrid option: Databricks (private cloud)
# On-premises deployment
adapter = ClickHouseAdapter(
host="on-prem-cluster.company.com",
# Complete data control
# No cloud dependencies
# Custom security policies
)
Selection Criteria Summary¶
By Primary Goal¶
Learning/Research
├── Small data (< 1GB): DuckDB (included by default)
└── Large data (> 1GB): DuckDB or BigQuery (has free tier)
Development/Testing
├── Local development: DuckDB (no dependencies)
├── CI/CD pipeline: SQLite or DuckDB (embedded)
└── Integration testing: DuckDB (fast setup)
Production Analytics
├── Open-source: ClickHouse (self-hosted)
└── Cloud-native: BigQuery, Snowflake, or Databricks (managed)
Performance Benchmarking
├── Cross-platform comparison: Multiple platforms
└── Single-platform optimization: Platform-specific testing
By Key Decision Factors¶
1. Data scale considerations:
├── < 1GB → DuckDB or SQLite (embedded databases)
├── 1-100GB → DuckDB, ClickHouse, or cloud platforms
└── > 100GB → Cloud platforms (BigQuery, Snowflake, Databricks, Redshift)
2. Budget model:
├── No cost → DuckDB, SQLite, ClickHouse (self-hosted)
├── Pay-per-use → BigQuery, Databricks (usage-based)
└── Predictable costs → Snowflake, Redshift (provisioned)
3. Team size:
├── Individual → DuckDB (minimal setup)
├── Small team → DuckDB, BigQuery (shared access)
├── Medium team → Snowflake, Databricks (governance features)
└── Large enterprise → Snowflake, Databricks, Redshift (enterprise features)
4. Cloud environment:
├── Google Cloud → BigQuery (GCP-native)
├── AWS → Redshift (AWS-native)
├── Azure → Snowflake, Databricks (available on Azure)
├── Multi-cloud → Snowflake, Databricks (multi-cloud support)
└── On-premises → ClickHouse, DuckDB (self-hosted)
5. Architecture characteristics:
├── In-process columnar → DuckDB, ClickHouse (no network overhead)
├── Serverless/auto-scaling → BigQuery, Snowflake (elastic compute)
└── Managed clusters → Various cloud platforms available
Platform Combination Strategies¶
Development to Production Pipeline¶
# Development: Fast, local, free
dev_adapter = DuckDBAdapter()
# Testing: Automated, reproducible
test_adapter = DuckDBAdapter(database_path="test.duckdb")
# Production: Scalable, managed
prod_adapter = SnowflakeAdapter(
warehouse_size="LARGE",
database="PRODUCTION"
)
# Use same benchmark code across all environments
benchmark = TPCH(scale_factor=0.1) # Small for dev/test
prod_benchmark = TPCH(scale_factor=100) # Large for prod
Multi-Platform Validation¶
# Validate results across platforms
platforms = [
("DuckDB", DuckDBAdapter()),
("ClickHouse", ClickHouseAdapter(host="test-cluster")),
("BigQuery", BigQueryAdapter(project_id="validation-project")),
]
benchmark = TPCH(scale_factor=1)
results = {}
for name, adapter in platforms:
results[name] = adapter.run_benchmark(benchmark)
print(f"{name}: {results[name].total_time:.2f}s")
# Compare query results for consistency
Cost-Performance Optimization¶
# Start with cost-effective option
adapter = DuckDBAdapter(memory_limit="16GB")
results = adapter.run_benchmark(benchmark)
if results.total_time > target_sla:
# Upgrade to cloud platform
adapter = SnowflakeAdapter(warehouse_size="LARGE")
results = adapter.run_benchmark(benchmark)
# Monitor costs and performance over time
Migration Strategies¶
From SQLite to Production¶
# Phase 1: Validate on SQLite (testing only)
sqlite_adapter = SQLiteAdapter()
small_benchmark = ReadPrimitivesBenchmark(scale_factor=0.001)
sqlite_results = sqlite_adapter.run_benchmark(small_benchmark)
# Phase 2: Scale testing with DuckDB
duckdb_adapter = DuckDBAdapter()
medium_benchmark = TPCH(scale_factor=1)
duckdb_results = duckdb_adapter.run_benchmark(medium_benchmark)
# Phase 3: Production deployment
prod_adapter = BigQueryAdapter(project_id="production")
prod_benchmark = TPCH(scale_factor=100)
prod_results = prod_adapter.run_benchmark(prod_benchmark)
From DuckDB to Cloud¶
# Development on DuckDB
dev_config = {"scale_factor": 0.1}
dev_adapter = DuckDBAdapter()
# Migration validation
migration_config = {"scale_factor": 1}
cloud_adapter = BigQueryAdapter(
project_id="migration-test",
maximum_bytes_billed=1000000000 # Cost protection
)
# Production deployment
prod_config = {"scale_factor": 100}
prod_adapter = BigQueryAdapter(
project_id="production",
job_priority="INTERACTIVE"
)
Common Anti-Patterns¶
❌ Wrong Platform Choices¶
Using SQLite for large datasets
# DON'T: SQLite with TPC-H SF=10 adapter = SQLiteAdapter() benchmark = TPCH(scale_factor=10) # Will be extremely slow
Using expensive platforms for development
# DON'T: Snowflake X-Large for development adapter = SnowflakeAdapter(warehouse_size="4X-LARGE") # $$$ benchmark = TPCH(scale_factor=0.01) # Tiny dataset
Ignoring cost controls
# DON'T: No cost limits on pay-per-query platforms adapter = BigQueryAdapter() # No maximum_bytes_billed # Risk: Unexpected large bills
✅ Better Alternatives¶
Progressive scaling approach
# Start small, scale up as needed if dataset_size < 1_000_000: adapter = DuckDBAdapter() elif dataset_size < 100_000_000: adapter = ClickHouseAdapter() else: adapter = BigQueryAdapter()
Cost-aware cloud usage
# Always set cost controls adapter = BigQueryAdapter( maximum_bytes_billed=5_000_000_000, # 5GB limit job_priority="BATCH" # Lower cost )
Environment-specific configurations
import os if os.getenv("ENVIRONMENT") == "development": adapter = DuckDBAdapter() elif os.getenv("ENVIRONMENT") == "testing": adapter = DuckDBAdapter(database_path="test.duckdb") else: # production adapter = SnowflakeAdapter(warehouse_size="LARGE")
Getting Started¶
First-Time Users¶
DuckDB is included by default
uv add benchbox
from benchbox import TPCH from benchbox.platforms.duckdb import DuckDBAdapter benchmark = TPCH(scale_factor=0.1) adapter = DuckDBAdapter() results = adapter.run_benchmark(benchmark)
Scale factor options
0.001 (1MB): Rapid iteration
0.01 (10MB): Quick testing
0.1 (100MB): Small dataset testing
1.0 (1GB): Medium-scale testing
Available benchmarks
TPC-H: Industry-standard analytical benchmark
ClickBench: Real-world analytical queries
Primitives: Fundamental operation testing
Moving to Production¶
Evaluate cloud platforms with test datasets
Configure cost controls and monitoring
Validate with realistic data volumes
Implement security and governance policies
Plan for capacity and growth
Expert Tips¶
Performance Optimization¶
Match platform to workload characteristics
OLAP-heavy: ClickHouse, BigQuery
Mixed workloads: Snowflake, Databricks
Development: DuckDB
Use appropriate scale factors for testing
Development: 0.001-0.01
Integration testing: 0.1-1
Performance testing: 10-100+
Leverage platform-specific optimizations
DuckDB: Memory sizing, threading
ClickHouse: Partitioning, ordering
BigQuery: Partitioning, clustering
Snowflake: Warehouse sizing, clustering
Cost Management¶
Implement cost controls early
Monitor usage patterns and optimize
Use development/production separation
Consider reserved capacity for predictable workloads
Operational Excellence¶
Automate deployments and scaling
Implement proper monitoring and alerting
Plan for disaster recovery and backup
Document configuration and procedures
Summary¶
Platform selection depends on your specific requirements. Common platform choices by use case:
SQL Platforms¶
Initial setup: DuckDB (included by default, no dependencies)
No-cost options: DuckDB, SQLite, or ClickHouse (self-hosted)
In-process execution: DuckDB or ClickHouse (columnar architecture, no network overhead)
Enterprise teams: Snowflake or Databricks (governance features)
Cloud-native deployments: BigQuery, Snowflake, or Databricks (managed services)
AWS environments: Redshift (AWS-native integration)
Self-hosted deployments: ClickHouse (full control)
DataFrame Platforms¶
Single-node analytics: Polars-df (multi-threaded, lazy evaluation)
Data science prototyping: Pandas-df (familiar API, ecosystem compatibility)
Distributed big data: PySpark-df (cluster-scale processing)
Arrow-native workflows: DataFusion-df (Rust-based, Arrow integration)
GPU acceleration: cuDF-df (NVIDIA RAPIDS, coming soon)
DuckDB is included with BenchBox by default for SQL benchmarking. Polars is included for DataFrame benchmarking. Cloud platforms and additional DataFrame libraries can be added as requirements grow.
See Also¶
Platform Documentation¶
DataFrame Platforms Guide - Native DataFrame API benchmarking (Polars, Pandas, PySpark, DataFusion)
Platform Quick Reference - Quick setup guides for all platforms
Platform Comparison Matrix - Feature and performance comparison
ClickHouse Local Mode - Running ClickHouse locally
Future Platforms - Upcoming platform support
Cloud Storage - Cloud storage integration (S3, GCS, Azure)
Compression - Data compression strategies
Getting Started¶
Getting Started Guide - Run your first benchmark in 5 minutes
Installation Guide - Installation and setup
CLI Quick Reference - Command-line usage
Examples - Code snippets and patterns
Understanding BenchBox¶
Architecture Overview - How BenchBox works
Workflow Patterns - Common workflows (dev, cloud, CI/CD)
Data Model - Result schemas and analysis
Glossary - Platform and benchmark terminology
Benchmarks¶
Benchmark Catalog - Available benchmarks
TPC-H Benchmark - Standard analytical benchmark
TPC-DS Benchmark - Complex decision support
ClickBench - Real-world analytics
Advanced Topics¶
Performance Guide - Performance tuning and optimization
Custom Benchmarks - Creating custom benchmarks
Configuration Handbook - Advanced configuration options
Dry Run Mode - Preview queries without execution