Benchmarking Workflows¶

Common patterns and workflows for running benchmarks with BenchBox.

Quick Start Workflow¶

The simplest path from installation to results:

# 1. Install
uv pip install benchbox

# 2. Run benchmark
benchbox run --benchmark tpch --platform duckdb --scale 0.1

# 3. View results
cat benchmark_runs/tpch_0.1_duckdb_*/results.json

What happens:

BenchBox generates TPC-H data at scale factor 0.1 (~100MB)
Loads data into DuckDB (an in-memory database)
Executes all 22 TPC-H queries
Saves timing results to benchmark_runs/ directory

Time to complete: ~2-3 minutes

See: Getting Started Guide

Development Workflow¶

Typical workflow for developers testing query changes or experimenting:

# 1. Generate data once
benchbox datagen --benchmark tpch --scale 0.01 --output ./data/tpch_0.01

# 2. Run specific queries during development
benchbox run --benchmark tpch --platform duckdb --scale 0.01 \
  --data-dir ./data/tpch_0.01 \
  --query-subset q1,q3,q7 \
  --verbose

# 3. Run full suite when ready
benchbox run --benchmark tpch --platform duckdb --scale 0.01 \
  --data-dir ./data/tpch_0.01

Benefits:

Data generation happens once (can be slow for large scale factors)
Iterate quickly on specific queries
Full validation before committing changes

Use Cases:

Testing query modifications
Debugging platform adapter issues
Developing custom benchmarks

Multi-Platform Comparison Workflow¶

Compare performance across different database platforms:

# Generate data once
benchbox datagen --benchmark tpch --scale 1 --output ./data/tpch_1

# Run on multiple platforms
for platform in duckdb clickhouse; do
  benchbox run --benchmark tpch --platform $platform --scale 1 \
    --data-dir ./data/tpch_1 \
    --output benchmark_runs/tpch_1_${platform}
done

# Compare results
benchbox compare \
  benchmark_runs/tpch_1_duckdb/results.json \
  benchmark_runs/tpch_1_clickhouse/results.json

Output:

Side-by-side query timing comparison
Geometric mean calculations
Performance regression detection

See: Platform Comparison Matrix

Cloud Platform Workflow¶

Running benchmarks on cloud platforms (BigQuery, Snowflake, Databricks):

# 1. Check cloud credentials
export DATABRICKS_TOKEN="dapi..."
export DATABRICKS_HOST="https://....cloud.databricks.com"

# 2. Preview queries without execution (to avoid costs)
benchbox run --benchmark tpcds --platform databricks --scale 1 \
  --dry-run ./preview

# Review generated queries in ./preview/queries/

# 3. Run benchmark (queries will execute and incur costs)
benchbox run --benchmark tpcds --platform databricks --scale 1 \
  --catalog hive_metastore \
  --schema benchbox_test

# 4. Results saved with cloud execution metadata
cat benchmark_runs/tpcds_1_databricks_*/results.json

Cost Control:

Use --dry-run to preview queries before execution
Start with small scale factors (0.1, 1)
Set platform-specific cost limits (BigQuery maximum_bytes_billed)
Use auto-suspend/auto-resume for warehouse platforms

See: Platform Selection Guide

Dry Run Workflow¶

Preview benchmark execution without running queries:

# Generate queries and configuration
benchbox run --benchmark tpcds --platform bigquery --scale 10 \
  --dry-run ./preview

# Inspect outputs
tree ./preview
# preview/
# ├── queries/
# │   ├── q1.sql
# │   ├── q2.sql
# │   └── ...
# ├── schema/
# │   ├── store_sales.ddl
# │   └── ...
# └── summary.json

# Review summary
cat ./preview/summary.json

Use Cases:

Query validation before cloud execution
Cost estimation (query complexity, data scanned)
Debugging query generation logic
Sharing queries with team members

See: Dry Run Guide

CI/CD Workflow¶

Automated benchmarking in continuous integration:

# .github/workflows/benchmarks.yml
name: Benchmark Tests

on: [pull_request]

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install BenchBox
        run: uv pip install benchbox

      - name: Run regression test
        run: |
          benchbox run --benchmark tpch --platform duckdb --scale 0.01 \
            --output results/

      - name: Compare against baseline
        run: |
          benchbox compare \
            baseline/tpch_0.01_duckdb.json \
            results/results.json \
            --fail-on-regression 10%

Benefits:

Catch performance regressions before merge
Validate query compatibility across platforms
Track performance trends over time

See: Testing Guide

Compliance Workflow (Official TPC Benchmarks)¶

Running benchmarks according to TPC specifications for official results:

# 1. Power Test (single query stream)
benchbox run-official tpch --platform snowflake --scale 100 \
  --test-type power \
  --seed 42 \
  --output results/power/

# 2. Throughput Test (concurrent streams)
benchbox run-official tpch --platform snowflake --scale 100 \
  --test-type throughput \
  --streams 4 \
  --seed 42 \
  --output results/throughput/

# 3. Calculate composite metric
benchbox calculate-qphh \
  --power-results results/power/results.json \
  --throughput-results results/throughput/results.json

Requirements:

Specific scale factors (1, 10, 30, 100, 300, 1000, 3000, 10000, 30000, 100000)
Random seed management
Refresh function execution
Full validation

See:

Performance Tuning Workflow¶

Optimizing query performance with platform-specific tunings:

# 1. Baseline run (no tunings)
benchbox run --benchmark tpcds --platform clickhouse --scale 10 \
  --output baseline/

# 2. Apply tunings (partitioning, sorting, indexes)
benchbox run --benchmark tpcds --platform clickhouse --scale 10 \
  --tuning-config tunings/clickhouse_tpcds.yaml \
  --output tuned/

# 3. Compare results
benchbox compare baseline/results.json tuned/results.json

Example Tuning Config (tunings/clickhouse_tpcds.yaml):

tables:
  store_sales:
    partition_by: ss_sold_date_sk
    order_by: [ss_customer_sk, ss_item_sk]
    primary_key: [ss_item_sk, ss_ticket_number]

  customer:
    order_by: c_customer_sk
    primary_key: c_customer_sk

See: Performance Guide

Data Generation Workflow¶

Generating benchmark data separately from execution:

# Generate multiple scale factors
for sf in 0.01 0.1 1 10; do
  benchbox datagen --benchmark tpch --scale $sf \
    --output ./data/tpch_${sf} \
    --format parquet
done

# Compress for storage
tar -czf tpch_data.tar.gz ./data/

# Later: use pre-generated data
benchbox run --benchmark tpch --platform duckdb --scale 1 \
  --data-dir ./data/tpch_1

Benefits:

Generate data once, use many times
Share data across team members
Version control data generation parameters
Faster iteration on query/platform testing

See: Data Generation Guide

Validation Workflow¶

Verifying benchmark results for correctness:

# Run with validation enabled
benchbox run --benchmark tpch --platform duckdb --scale 0.1 \
  --validate-results \
  --validation-mode strict

# Validation checks:
# - Row count verification
# - Result checksum validation
# - Data type compliance
# - Constraint satisfaction

Validation Modes:

none: No validation (fastest)
basic: Row count checks only
strict: Full result validation (slowest, most thorough)

See: TPC Validation Guide

Monitoring Workflow¶

Tracking benchmark performance over time:

# Run benchmark daily
benchbox run --benchmark tpch --platform duckdb --scale 1 \
  --output benchmark_runs/$(date +%Y%m%d)/

# Aggregate results
benchbox aggregate \
  --input-dir benchmark_runs/ \
  --output-file performance_trends.csv

# Visualize trends
benchbox plot performance_trends.csv \
  --output performance_chart.png

Metrics Tracked:

Query execution times (p50, p95, p99)
Geometric mean
Total execution time
Data loading time
Memory usage
Failure rates

See: Performance Monitoring

Debugging Workflow¶

Troubleshooting benchmark issues:

# 1. Enable verbose logging
benchbox run --benchmark tpch --platform duckdb --scale 0.01 \
  --verbose \
  --log-file debug.log

# 2. Run single query with maximum detail
benchbox run --benchmark tpch --platform duckdb --scale 0.01 \
  --query-subset q1 \
  --verbose \
  --explain  # Shows query plan

# 3. Inspect database state
benchbox shell --platform duckdb --database benchmark.duckdb
# Interactive SQL shell opens

Common Issues:

Data generation failures → Check disk space, permissions
Query failures → Check SQL dialect compatibility
Performance issues → Check scale factor vs. available memory
Connection errors → Verify credentials, network access

See: Troubleshooting Guide

Custom Benchmark Workflow¶

Creating and running a custom benchmark:

# 1. Define benchmark class
from benchbox.base import BaseBenchmark

class MyBenchmark(BaseBenchmark):
    def generate_data(self):
        # Custom data generation logic
        ...

    def get_queries(self):
        return {
            "q1": "SELECT ...",
            "q2": "SELECT ...",
        }

    def get_query(self, query_id, params=None):
        queries = self.get_queries()
        return queries[query_id]

# 2. Run benchmark
from benchbox.platforms.duckdb import DuckDBAdapter

benchmark = MyBenchmark(scale_factor=0.1)
adapter = DuckDBAdapter()
results = benchmark.run_with_platform(adapter)

print(f"Completed in {results.duration_seconds:.2f}s")

See: Custom Benchmarks Guide

Best Practices¶

General Recommendations¶

Start Small: Begin with a scale factor of 0.01 or 0.1 for testing
Use Dry Run: Preview queries before expensive cloud execution
Generate Once: Reuse generated data across multiple runs
Version Control: Track benchmark configurations and results
Monitor Costs: Set budget alerts for cloud platforms

Performance Optimization¶

Match Platform to Workload: Use DuckDB for development, cloud platforms for production
Optimize Data Loading: Use platform-specific bulk loading (COPY, external tables)
Apply Tunings: Leverage partitioning, clustering, indexes for large datasets
Measure Baselines: Establish baseline performance before optimizations

Compliance and Validation¶

Follow TPC Specs: Use official runners for compliance testing
Validate Results: Enable validation for correctness verification
Document Configuration: Record all benchmark parameters and system info
Reproduce Results: Use fixed seeds and version-locked dependencies

Benchmarking Workflows¶

Quick Start Workflow¶

Development Workflow¶

Multi-Platform Comparison Workflow¶

Cloud Platform Workflow¶

Dry Run Workflow¶

CI/CD Workflow¶

Compliance Workflow (Official TPC Benchmarks)¶

Performance Tuning Workflow¶

Data Generation Workflow¶

Validation Workflow¶

Monitoring Workflow¶

Debugging Workflow¶

Custom Benchmark Workflow¶

Best Practices¶

General Recommendations¶

Performance Optimization¶

Compliance and Validation¶

Related Documentation¶