Multi-Platform Database Support

Tags beginner reference

BenchBox supports running benchmarks across multiple database platforms through its platform adapter architecture. This allows you to compare performance, validate query compatibility, and test your applications across different database systems.

Looking ahead? See potential future platforms for research notes on the engines we plan to evaluate next.

Supported Platforms

SQL Platforms

Platform

Status

Description

Installation

DuckDB

✅ Built-in

In-process analytical database

uv add duckdb

DataFusion

✅ Available

In-memory query engine (Apache Arrow)

uv add datafusion

ClickHouse

✅ Available

Column-oriented OLAP database

uv add clickhouse-driver

Databricks

✅ Available

Data Intelligence Platform (lakehouse)

uv add databricks-sql-connector

BigQuery

✅ Available

Serverless data warehouse (Google Cloud)

uv add google-cloud-bigquery google-cloud-storage

Redshift

✅ Available

Cloud data warehouse (AWS)

uv add redshift-connector boto3

Snowflake

✅ Available

Data Cloud / Multi-cloud data warehouse

uv add snowflake-connector-python

Trino

✅ Available

Distributed SQL (Trino/Starburst)

uv add benchbox[trino]

PrestoDB

✅ Available

Distributed SQL (Meta’s Presto)

uv add benchbox[presto]

SQLite

✅ Built-in

Embedded transactional database

(built-in)

Azure Platforms

🔄 Planned

Microsoft Fabric, Azure Synapse, Azure Data Explorer

See Azure Platforms

DataFrame Platforms (Native API)

BenchBox supports benchmarking DataFrame libraries using their native APIs instead of SQL. This enables direct performance comparison between SQL and DataFrame paradigms on identical workloads. See DataFrame Platforms for full details.

Platform

CLI Name

Status

Family

Description

Installation

Polars

polars-df

✅ Production

Expression

Fast Rust-based DataFrame library with lazy evaluation

(core dependency)

Pandas

pandas-df

✅ Production

Pandas

Reference Pandas implementation

uv add benchbox --extra dataframe-pandas

PySpark

pyspark-df

✅ Production

Expression

Apache Spark DataFrame API (distributed)

uv add benchbox --extra dataframe-pyspark

DataFusion

datafusion-df

✅ Production

Expression

Arrow-native query engine

uv add benchbox --extra dataframe-datafusion

Modin

modin-df

🔄 Planned

Pandas

Distributed Pandas replacement

uv add benchbox --extra dataframe-modin

Dask

dask-df

🔄 Planned

Pandas

Parallel computing DataFrames

uv add benchbox --extra dataframe-dask

cuDF

cudf-df

🔄 Planned

Pandas

NVIDIA GPU-accelerated DataFrames

uv add benchbox --extra dataframe-cudf

Quick Start:

# Run TPC-H on DataFrame platforms
benchbox run --platform polars-df --benchmark tpch --scale 0.1
benchbox run --platform pandas-df --benchmark tpch --scale 0.1
benchbox run --platform pyspark-df --benchmark tpch --scale 0.1
benchbox run --platform datafusion-df --benchmark tpch --scale 0.1

Quick Start

1. Install Dependencies

Install all cloud platforms at once:

uv add benchbox[cloud]

Or install individual platforms:

# ClickHouse
uv add benchbox[clickhouse]

# Databricks
uv add benchbox[databricks]

2. Use the Platform Management CLI

BenchBox now includes a dedicated CLI for managing database platforms. This simplifies installation, configuration, and validation.

List all available platforms and their status:

benchbox platforms list

Check the status of a specific platform (e.g., Databricks):

benchbox platforms status databricks

Install missing libraries for a platform (guided):

benchbox platforms install bigquery

Enable or disable a platform:

benchbox platforms enable snowflake
benchbox platforms disable sqlite

Run an interactive setup wizard:

benchbox platforms setup

3. Run Multi-Platform Benchmark

from benchbox.platforms import get_platform_adapter
from benchbox import TPCH

# Create benchmark
benchmark = TPCH(scale_factor=0.1)

# Test on multiple platforms
platforms = ["duckdb", "clickhouse"]

for platform_name in platforms:
    print(f"Running on {platform_name}...")
    
    try:
        adapter = get_platform_adapter(platform_name)
        results = adapter.run_benchmark(benchmark)
        
        print(f"Completed in {results.duration_seconds:.2f}s")
        print(f"Average query time: {results.average_query_time:.3f}s")
    except Exception as e:
        print(f"Could not run on {platform_name}: {e}")

DuckDB

Type: In-process analytical database Common Use Cases: Development, testing, small to medium-scale analytics workloads

Configuration

from benchbox.platforms.duckdb import DuckDBAdapter

# In-memory database (default)
adapter = DuckDBAdapter()

# Persistent file database
adapter = DuckDBAdapter(database_path="benchmark.duckdb")

Apache DataFusion

Type: In-memory query engine (Apache Arrow-based) Common Use Cases: In-process analytics, rapid prototyping, PyArrow workflows, memory-constrained OLAP

Configuration

from benchbox.platforms.datafusion import DataFusionAdapter

# In-memory analytics with Parquet (recommended)
adapter = DataFusionAdapter(
    working_dir="./datafusion_working",
    memory_limit="16G",
    data_format="parquet"  # or "csv" for lower memory
)

# Memory-constrained configuration
adapter = DataFusionAdapter(
    memory_limit="4G",
    data_format="csv",
    target_partitions=4
)

ClickHouse

Type: Column-oriented OLAP database Common Use Cases: Analytical workloads, OLAP queries, real-time analytics

Configuration

from benchbox.platforms.clickhouse import ClickHouseAdapter

adapter = ClickHouseAdapter(
    host="localhost",
    port=9000,
    database="benchmark",
    username="default",
    password=""
)

Databricks

Type: Data Intelligence Platform (lakehouse architecture) Common Use Cases: SQL analytics, ML/data science workflows, lakehouse deployments

Configuration

from benchbox.platforms.databricks import DatabricksAdapter

adapter = DatabricksAdapter(
    server_hostname="dbc-12345678-abcd.cloud.databricks.com",
    http_path="/sql/1.0/warehouses/abcd1234efgh5678",
    access_token="dapi1234567890abcdef",
    catalog="hive_metastore",
    schema="default"
)

BigQuery

Type: Serverless data warehouse (Google Cloud) Common Use Cases: Large-scale analytics, petabyte-scale datasets, GCP-native applications

Configuration

from benchbox.platforms.bigquery import BigQueryAdapter

adapter = BigQueryAdapter(
    project_id="my-benchbox-project",
    dataset_id="benchbox_test",
    credentials_path="/path/to/service-account-key.json",
    location="US"
)

Redshift

Type: Cloud data warehouse (AWS) Common Use Cases: AWS-native analytics, variable workloads, serverless or provisioned deployments

Configuration

from benchbox.platforms.redshift import RedshiftAdapter

adapter = RedshiftAdapter(
    host="benchbox-workgroup.123456.us-east-1.redshift-serverless.amazonaws.com",
    port=5439,
    database="benchbox",
    username="admin",
    password="SecurePassword123",
    is_serverless=True,
    workgroup_name="benchbox-workgroup"
)

Snowflake

Type: Data Cloud (multi-cloud data warehouse) Common Use Cases: Enterprise analytics, multi-cloud deployments, elastic scaling workloads

Configuration

from benchbox.platforms.snowflake import SnowflakeAdapter

adapter = SnowflakeAdapter(
    account="xy12345.us-east-1",
    username="benchbox_user",
    password="secure_password_123",
    warehouse="COMPUTE_WH",
    database="BENCHBOX",
    schema="PUBLIC"
)

SQLite

Type: Embedded transactional database Common Use Cases: Testing, development, small datasets, CI/CD validation

Configuration

from benchbox.platforms.sqlite import SQLiteAdapter

# In-memory database
adapter = SQLiteAdapter()

# File-based database
adapter = SQLiteAdapter(database_path="benchmark.db")

Troubleshooting

Common Issues Across Platforms

Connection Errors

Problem: Unable to connect to database

Solutions by Platform:

DuckDB:

# Check file permissions
import os
os.access("benchmark.duckdb", os.W_OK)

# Use absolute path
adapter = DuckDBAdapter(database_path="/full/path/to/benchmark.duckdb")

ClickHouse:

# Verify server is running
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
result = sock.connect_ex(('localhost', 9000))
if result == 0:
    print("ClickHouse is running")

# Check credentials
adapter = ClickHouseAdapter(
    host="localhost",
    port=9000,
    username="default",
    password=""  # Empty password for default user
)

Cloud Platforms (Databricks, BigQuery, Snowflake, Redshift):

# Verify environment variables
import os
print(f"DATABRICKS_TOKEN: {'SET' if os.getenv('DATABRICKS_TOKEN') else 'NOT SET'}")
print(f"DATABRICKS_HOST: {os.getenv('DATABRICKS_HOST')}")

# Test connection before running benchmark
from benchbox.platforms.databricks import DatabricksAdapter
adapter = DatabricksAdapter()
try:
    adapter.test_connection()
    print("Connection successful")
except Exception as e:
    print(f"Connection failed: {e}")

Authentication Issues

BigQuery:

# Set credentials
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"

# Or use application default credentials
gcloud auth application-default login

Databricks:

# Personal access token
export DATABRICKS_TOKEN="dapi..."
export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"

# Or use Databricks CLI config
databricks configure --token

Snowflake:

# Use key-pair authentication
from benchbox.platforms.snowflake import SnowflakeAdapter
adapter = SnowflakeAdapter(
    account="xy12345",
    username="user",
    private_key_path="/path/to/rsa_key.p8",
    private_key_passphrase="passphrase"
)

Out of Memory Errors

DuckDB:

# Set memory limit
adapter = DuckDBAdapter(memory_limit="4GB")

# Use persistent database for large datasets
adapter = DuckDBAdapter(
    database_path="large_dataset.duckdb",
    memory_limit="8GB"
)

ClickHouse:

# Increase memory limits
adapter = ClickHouseAdapter(
    host="localhost",
    settings={
        "max_memory_usage": "10000000000",  # 10GB
        "max_bytes_before_external_sort": "5000000000"
    }
)

Cloud Platforms:

# BigQuery: Use query cache
from benchbox.platforms.bigquery import BigQueryAdapter
adapter = BigQueryAdapter(
    maximum_bytes_billed=10000000000,  # 10GB limit
    use_query_cache=True
)

# Snowflake: Increase warehouse size
from benchbox.platforms.snowflake import SnowflakeAdapter
adapter = SnowflakeAdapter(
    warehouse="LARGE_WH",  # or X-LARGE, 2X-LARGE
    warehouse_size="LARGE"
)

Slow Query Performance

General Debugging:

# Enable verbose logging
import logging
logging.basicConfig(level=logging.DEBUG)

# Run with profiling
adapter = DuckDBAdapter(enable_profiling=True)

# Test with smaller scale factor first
benchmark = TPCH(scale_factor=0.01)  # Start small

Platform-Specific Optimizations:

DuckDB:

# Increase thread count
adapter = DuckDBAdapter(thread_limit=8)

# Use persistent database
adapter = DuckDBAdapter(database_path="cached.duckdb")

ClickHouse:

# Enable query optimizations
adapter = ClickHouseAdapter(
    host="localhost",
    settings={
        "max_threads": 8,
        "optimize_read_in_order": 1,
        "enable_filesystem_cache": 1
    }
)

Cloud Platforms:

# Databricks: Use larger cluster
adapter = DatabricksAdapter(
    http_path="/sql/1.0/warehouses/large-warehouse"
)

# BigQuery: Use batch priority for cost savings
adapter = BigQueryAdapter(
    job_priority="BATCH",  # Slower but cheaper
    use_legacy_sql=False
)

Data Loading Failures

Check file format:

# DuckDB supports multiple formats
conn.execute("""
    CREATE TABLE test AS
    SELECT * FROM read_parquet('data/*.parquet')
""")

# Or CSV with explicit schema
conn.execute("""
    CREATE TABLE test AS
    SELECT * FROM read_csv('data/*.csv',
                          delim='|',
                          header=false,
                          auto_detect=true)
""")

Verify file paths:

from pathlib import Path

data_dir = Path("./tpch_data")
if not data_dir.exists():
    print(f"Data directory not found: {data_dir}")
else:
    files = list(data_dir.glob("*.parquet"))
    print(f"Found {len(files)} data files")

Platform-Specific Issues

DuckDB

Issue: Database file is locked

# Solution: Ensure no other process is using the file
# Or use separate database files
adapter1 = DuckDBAdapter(database_path="db1.duckdb")
adapter2 = DuckDBAdapter(database_path="db2.duckdb")

ClickHouse

Issue: “Memory limit exceeded” errors

# Solution: Increase limits or enable external operations
adapter = ClickHouseAdapter(
    settings={
        "max_memory_usage": "20000000000",
        "max_bytes_before_external_group_by": "10000000000",
        "max_bytes_before_external_sort": "10000000000"
    }
)

Databricks

Issue: “Cluster not found” or “Warehouse not available”

# Solution: Verify HTTP path
from benchbox.platforms.databricks import DatabricksAdapter

# List available warehouses
adapter = DatabricksAdapter()
warehouses = adapter.list_warehouses()  # If implemented
print(f"Available warehouses: {warehouses}")

# Use correct HTTP path format
adapter = DatabricksAdapter(
    http_path="/sql/1.0/warehouses/abc123def456"
)

BigQuery

Issue: “Exceeded quota” or billing errors

# Solution: Set cost controls
from benchbox.platforms.bigquery import BigQueryAdapter

adapter = BigQueryAdapter(
    maximum_bytes_billed=5000000000,  # 5GB limit
    job_priority="BATCH",  # Lower cost
    use_query_cache=True,  # Reuse cached results
    dry_run=True  # Test without execution first
)

Snowflake

Issue: Warehouse auto-suspended

# Solution: Configure auto-resume
from benchbox.platforms.snowflake import SnowflakeAdapter

adapter = SnowflakeAdapter(
    warehouse="COMPUTE_WH",
    auto_resume=True,
    auto_suspend=300  # 5 minutes
)

Getting Help

If you encounter issues not covered here:

  1. Check logs: Enable verbose logging with --verbose flag

  2. Test connection: Use platform’s native client to verify connectivity

  3. Review documentation: See platform-specific guides below

  4. Check GitHub issues: Search for similar problems

  5. Create an issue: Report bugs with reproducible examples

See Also

Platform Documentation

API Reference

Getting Started

Benchmarks

Advanced Topics