Multi-Platform Database Support¶
BenchBox supports running benchmarks across multiple database platforms through its platform adapter architecture. This allows you to compare performance, validate query compatibility, and test your applications across different database systems.
Looking ahead? See potential future platforms for research notes on the engines we plan to evaluate next.
Supported Platforms¶
SQL Platforms¶
Platform |
Status |
Description |
Installation |
|---|---|---|---|
DuckDB |
✅ Built-in |
In-process analytical database |
|
DataFusion |
✅ Available |
In-memory query engine (Apache Arrow) |
|
ClickHouse |
✅ Available |
Column-oriented OLAP database |
|
Databricks |
✅ Available |
Data Intelligence Platform (lakehouse) |
|
BigQuery |
✅ Available |
Serverless data warehouse (Google Cloud) |
|
Redshift |
✅ Available |
Cloud data warehouse (AWS) |
|
Snowflake |
✅ Available |
Data Cloud / Multi-cloud data warehouse |
|
Trino |
✅ Available |
Distributed SQL (Trino/Starburst) |
|
PrestoDB |
✅ Available |
Distributed SQL (Meta’s Presto) |
|
SQLite |
✅ Built-in |
Embedded transactional database |
(built-in) |
Azure Platforms |
🔄 Planned |
Microsoft Fabric, Azure Synapse, Azure Data Explorer |
See Azure Platforms |
DataFrame Platforms (Native API)¶
BenchBox supports benchmarking DataFrame libraries using their native APIs instead of SQL. This enables direct performance comparison between SQL and DataFrame paradigms on identical workloads. See DataFrame Platforms for full details.
Platform |
CLI Name |
Status |
Family |
Description |
Installation |
|---|---|---|---|---|---|
Polars |
|
✅ Production |
Expression |
Fast Rust-based DataFrame library with lazy evaluation |
(core dependency) |
Pandas |
|
✅ Production |
Pandas |
Reference Pandas implementation |
|
PySpark |
|
✅ Production |
Expression |
Apache Spark DataFrame API (distributed) |
|
DataFusion |
|
✅ Production |
Expression |
Arrow-native query engine |
|
Modin |
|
🔄 Planned |
Pandas |
Distributed Pandas replacement |
|
Dask |
|
🔄 Planned |
Pandas |
Parallel computing DataFrames |
|
cuDF |
|
🔄 Planned |
Pandas |
NVIDIA GPU-accelerated DataFrames |
|
Quick Start:
# Run TPC-H on DataFrame platforms
benchbox run --platform polars-df --benchmark tpch --scale 0.1
benchbox run --platform pandas-df --benchmark tpch --scale 0.1
benchbox run --platform pyspark-df --benchmark tpch --scale 0.1
benchbox run --platform datafusion-df --benchmark tpch --scale 0.1
Quick Start¶
1. Install Dependencies¶
Install all cloud platforms at once:
uv add benchbox[cloud]
Or install individual platforms:
# ClickHouse
uv add benchbox[clickhouse]
# Databricks
uv add benchbox[databricks]
2. Use the Platform Management CLI¶
BenchBox now includes a dedicated CLI for managing database platforms. This simplifies installation, configuration, and validation.
List all available platforms and their status:
benchbox platforms list
Check the status of a specific platform (e.g., Databricks):
benchbox platforms status databricks
Install missing libraries for a platform (guided):
benchbox platforms install bigquery
Enable or disable a platform:
benchbox platforms enable snowflake
benchbox platforms disable sqlite
Run an interactive setup wizard:
benchbox platforms setup
3. Run Multi-Platform Benchmark¶
from benchbox.platforms import get_platform_adapter
from benchbox import TPCH
# Create benchmark
benchmark = TPCH(scale_factor=0.1)
# Test on multiple platforms
platforms = ["duckdb", "clickhouse"]
for platform_name in platforms:
print(f"Running on {platform_name}...")
try:
adapter = get_platform_adapter(platform_name)
results = adapter.run_benchmark(benchmark)
print(f"Completed in {results.duration_seconds:.2f}s")
print(f"Average query time: {results.average_query_time:.3f}s")
except Exception as e:
print(f"Could not run on {platform_name}: {e}")
DuckDB¶
Type: In-process analytical database Common Use Cases: Development, testing, small to medium-scale analytics workloads
Configuration¶
from benchbox.platforms.duckdb import DuckDBAdapter
# In-memory database (default)
adapter = DuckDBAdapter()
# Persistent file database
adapter = DuckDBAdapter(database_path="benchmark.duckdb")
Apache DataFusion¶
Type: In-memory query engine (Apache Arrow-based) Common Use Cases: In-process analytics, rapid prototyping, PyArrow workflows, memory-constrained OLAP
Configuration¶
from benchbox.platforms.datafusion import DataFusionAdapter
# In-memory analytics with Parquet (recommended)
adapter = DataFusionAdapter(
working_dir="./datafusion_working",
memory_limit="16G",
data_format="parquet" # or "csv" for lower memory
)
# Memory-constrained configuration
adapter = DataFusionAdapter(
memory_limit="4G",
data_format="csv",
target_partitions=4
)
ClickHouse¶
Type: Column-oriented OLAP database Common Use Cases: Analytical workloads, OLAP queries, real-time analytics
Configuration¶
from benchbox.platforms.clickhouse import ClickHouseAdapter
adapter = ClickHouseAdapter(
host="localhost",
port=9000,
database="benchmark",
username="default",
password=""
)
Databricks¶
Type: Data Intelligence Platform (lakehouse architecture) Common Use Cases: SQL analytics, ML/data science workflows, lakehouse deployments
Configuration¶
from benchbox.platforms.databricks import DatabricksAdapter
adapter = DatabricksAdapter(
server_hostname="dbc-12345678-abcd.cloud.databricks.com",
http_path="/sql/1.0/warehouses/abcd1234efgh5678",
access_token="dapi1234567890abcdef",
catalog="hive_metastore",
schema="default"
)
BigQuery¶
Type: Serverless data warehouse (Google Cloud) Common Use Cases: Large-scale analytics, petabyte-scale datasets, GCP-native applications
Configuration¶
from benchbox.platforms.bigquery import BigQueryAdapter
adapter = BigQueryAdapter(
project_id="my-benchbox-project",
dataset_id="benchbox_test",
credentials_path="/path/to/service-account-key.json",
location="US"
)
Redshift¶
Type: Cloud data warehouse (AWS) Common Use Cases: AWS-native analytics, variable workloads, serverless or provisioned deployments
Configuration¶
from benchbox.platforms.redshift import RedshiftAdapter
adapter = RedshiftAdapter(
host="benchbox-workgroup.123456.us-east-1.redshift-serverless.amazonaws.com",
port=5439,
database="benchbox",
username="admin",
password="SecurePassword123",
is_serverless=True,
workgroup_name="benchbox-workgroup"
)
Snowflake¶
Type: Data Cloud (multi-cloud data warehouse) Common Use Cases: Enterprise analytics, multi-cloud deployments, elastic scaling workloads
Configuration¶
from benchbox.platforms.snowflake import SnowflakeAdapter
adapter = SnowflakeAdapter(
account="xy12345.us-east-1",
username="benchbox_user",
password="secure_password_123",
warehouse="COMPUTE_WH",
database="BENCHBOX",
schema="PUBLIC"
)
SQLite¶
Type: Embedded transactional database Common Use Cases: Testing, development, small datasets, CI/CD validation
Configuration¶
from benchbox.platforms.sqlite import SQLiteAdapter
# In-memory database
adapter = SQLiteAdapter()
# File-based database
adapter = SQLiteAdapter(database_path="benchmark.db")
Troubleshooting¶
Common Issues Across Platforms¶
Connection Errors¶
Problem: Unable to connect to database
Solutions by Platform:
DuckDB:
# Check file permissions
import os
os.access("benchmark.duckdb", os.W_OK)
# Use absolute path
adapter = DuckDBAdapter(database_path="/full/path/to/benchmark.duckdb")
ClickHouse:
# Verify server is running
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
result = sock.connect_ex(('localhost', 9000))
if result == 0:
print("ClickHouse is running")
# Check credentials
adapter = ClickHouseAdapter(
host="localhost",
port=9000,
username="default",
password="" # Empty password for default user
)
Cloud Platforms (Databricks, BigQuery, Snowflake, Redshift):
# Verify environment variables
import os
print(f"DATABRICKS_TOKEN: {'SET' if os.getenv('DATABRICKS_TOKEN') else 'NOT SET'}")
print(f"DATABRICKS_HOST: {os.getenv('DATABRICKS_HOST')}")
# Test connection before running benchmark
from benchbox.platforms.databricks import DatabricksAdapter
adapter = DatabricksAdapter()
try:
adapter.test_connection()
print("Connection successful")
except Exception as e:
print(f"Connection failed: {e}")
Authentication Issues¶
BigQuery:
# Set credentials
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"
# Or use application default credentials
gcloud auth application-default login
Databricks:
# Personal access token
export DATABRICKS_TOKEN="dapi..."
export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
# Or use Databricks CLI config
databricks configure --token
Snowflake:
# Use key-pair authentication
from benchbox.platforms.snowflake import SnowflakeAdapter
adapter = SnowflakeAdapter(
account="xy12345",
username="user",
private_key_path="/path/to/rsa_key.p8",
private_key_passphrase="passphrase"
)
Out of Memory Errors¶
DuckDB:
# Set memory limit
adapter = DuckDBAdapter(memory_limit="4GB")
# Use persistent database for large datasets
adapter = DuckDBAdapter(
database_path="large_dataset.duckdb",
memory_limit="8GB"
)
ClickHouse:
# Increase memory limits
adapter = ClickHouseAdapter(
host="localhost",
settings={
"max_memory_usage": "10000000000", # 10GB
"max_bytes_before_external_sort": "5000000000"
}
)
Cloud Platforms:
# BigQuery: Use query cache
from benchbox.platforms.bigquery import BigQueryAdapter
adapter = BigQueryAdapter(
maximum_bytes_billed=10000000000, # 10GB limit
use_query_cache=True
)
# Snowflake: Increase warehouse size
from benchbox.platforms.snowflake import SnowflakeAdapter
adapter = SnowflakeAdapter(
warehouse="LARGE_WH", # or X-LARGE, 2X-LARGE
warehouse_size="LARGE"
)
Slow Query Performance¶
General Debugging:
# Enable verbose logging
import logging
logging.basicConfig(level=logging.DEBUG)
# Run with profiling
adapter = DuckDBAdapter(enable_profiling=True)
# Test with smaller scale factor first
benchmark = TPCH(scale_factor=0.01) # Start small
Platform-Specific Optimizations:
DuckDB:
# Increase thread count
adapter = DuckDBAdapter(thread_limit=8)
# Use persistent database
adapter = DuckDBAdapter(database_path="cached.duckdb")
ClickHouse:
# Enable query optimizations
adapter = ClickHouseAdapter(
host="localhost",
settings={
"max_threads": 8,
"optimize_read_in_order": 1,
"enable_filesystem_cache": 1
}
)
Cloud Platforms:
# Databricks: Use larger cluster
adapter = DatabricksAdapter(
http_path="/sql/1.0/warehouses/large-warehouse"
)
# BigQuery: Use batch priority for cost savings
adapter = BigQueryAdapter(
job_priority="BATCH", # Slower but cheaper
use_legacy_sql=False
)
Data Loading Failures¶
Check file format:
# DuckDB supports multiple formats
conn.execute("""
CREATE TABLE test AS
SELECT * FROM read_parquet('data/*.parquet')
""")
# Or CSV with explicit schema
conn.execute("""
CREATE TABLE test AS
SELECT * FROM read_csv('data/*.csv',
delim='|',
header=false,
auto_detect=true)
""")
Verify file paths:
from pathlib import Path
data_dir = Path("./tpch_data")
if not data_dir.exists():
print(f"Data directory not found: {data_dir}")
else:
files = list(data_dir.glob("*.parquet"))
print(f"Found {len(files)} data files")
Platform-Specific Issues¶
DuckDB¶
Issue: Database file is locked
# Solution: Ensure no other process is using the file
# Or use separate database files
adapter1 = DuckDBAdapter(database_path="db1.duckdb")
adapter2 = DuckDBAdapter(database_path="db2.duckdb")
ClickHouse¶
Issue: “Memory limit exceeded” errors
# Solution: Increase limits or enable external operations
adapter = ClickHouseAdapter(
settings={
"max_memory_usage": "20000000000",
"max_bytes_before_external_group_by": "10000000000",
"max_bytes_before_external_sort": "10000000000"
}
)
Databricks¶
Issue: “Cluster not found” or “Warehouse not available”
# Solution: Verify HTTP path
from benchbox.platforms.databricks import DatabricksAdapter
# List available warehouses
adapter = DatabricksAdapter()
warehouses = adapter.list_warehouses() # If implemented
print(f"Available warehouses: {warehouses}")
# Use correct HTTP path format
adapter = DatabricksAdapter(
http_path="/sql/1.0/warehouses/abc123def456"
)
BigQuery¶
Issue: “Exceeded quota” or billing errors
# Solution: Set cost controls
from benchbox.platforms.bigquery import BigQueryAdapter
adapter = BigQueryAdapter(
maximum_bytes_billed=5000000000, # 5GB limit
job_priority="BATCH", # Lower cost
use_query_cache=True, # Reuse cached results
dry_run=True # Test without execution first
)
Snowflake¶
Issue: Warehouse auto-suspended
# Solution: Configure auto-resume
from benchbox.platforms.snowflake import SnowflakeAdapter
adapter = SnowflakeAdapter(
warehouse="COMPUTE_WH",
auto_resume=True,
auto_suspend=300 # 5 minutes
)
Getting Help¶
If you encounter issues not covered here:
Check logs: Enable verbose logging with
--verboseflagTest connection: Use platform’s native client to verify connectivity
Review documentation: See platform-specific guides below
Check GitHub issues: Search for similar problems
Create an issue: Report bugs with reproducible examples
See Also¶
Platform Documentation¶
DataFrame Platforms Guide - Native DataFrame API benchmarking (Polars, Pandas, PySpark, DataFusion)
Platform Selection Guide - Comprehensive platform comparison and selection criteria
Platform Comparison Matrix - Feature and performance comparison table
ClickHouse Local Mode - Running ClickHouse locally for development
Future Platforms - Upcoming platform support
API Reference¶
Python API Overview - Complete Python API documentation
DuckDB Adapter API - DuckDB adapter reference
Base Benchmark API - Core benchmark interface
Getting Started¶
Getting Started Guide - Run your first benchmark in 5 minutes
Installation Guide - Installation and setup instructions
CLI Quick Reference - Command-line usage guide
Configuration Handbook - Advanced configuration options
Benchmarks¶
Benchmark Catalog - Available benchmarks overview
TPC-H Benchmark - Standard analytical benchmark
TPC-DS Benchmark - Complex decision support queries
ClickBench - Real-world analytics benchmark
Advanced Topics¶
Performance Guide - Performance tuning and optimization
Cloud Storage - S3, GCS, Azure Blob Storage integration
Compression - Data compression strategies
Dry Run Mode - Preview queries without execution
Troubleshooting Guide - Comprehensive troubleshooting