TPC-Havoc Benchmark¶

Tags advanced concept tpc-havoc tpc-h custom-benchmark

Overview¶

TPC-Havoc is an experimental benchmark designed to stress-test query optimizer robustness through syntactic variations. It generates 10 semantically equivalent but syntactically different versions of each TPC-H query, creating 220 total query variants that should produce identical results but may exhibit different execution plans and performance characteristics.

Status: Experimental - Under active development

Purpose and Motivation¶

The Optimizer Consistency Problem¶

Modern query optimizers are expected to recognize semantically equivalent queries and produce similar execution plans regardless of how the SQL is written. However, in practice, optimizers often exhibit significant performance variations based on syntax choices that shouldn’t matter.

Example: These three queries are semantically identical but may perform very differently:

-- Variant 1: Correlated subquery
SELECT c_name, c_acctbal
FROM customer
WHERE c_acctbal > (SELECT AVG(c_acctbal) FROM customer);

-- Variant 2: WITH clause
WITH avg_balance AS (SELECT AVG(c_acctbal) AS avg_bal FROM customer)
SELECT c_name, c_acctbal
FROM customer, avg_balance
WHERE c_acctbal > avg_bal;

-- Variant 3: Window function
SELECT c_name, c_acctbal
FROM (
    SELECT c_name, c_acctbal, AVG(c_acctbal) OVER () AS avg_bal
    FROM customer
) WHERE c_acctbal > avg_bal;

A mature optimizer should recognize these equivalences and produce similar plans. TPC-Havoc systematically tests whether this holds across hundreds of query variations.

Why This Matters¶

Optimizer Validation: Identifies weaknesses in query optimization logic
Performance Predictability: Reveals when syntax choices cause unexpected slowdowns
SQL Portability: Tests whether queries perform consistently across syntax styles
Developer Guidance: Shows which SQL patterns are well-optimized vs problematic
Regression Detection: Catches optimizer regressions across database versions

How It Works¶

Transformation Patterns¶

TPC-Havoc applies 10 universal SQL transformation patterns to each TPC-H query:

Scalar Subqueries: Convert aggregations to correlated scalar subqueries in SELECT
HAVING Clauses: Add post-aggregation filtering even when logically redundant
Explicit JOINs: Convert implicit joins to explicit INNER/LEFT JOIN syntax
UNION Operations: Split complex logic into separate queries combined via UNION
Common Table Expressions: Use WITH clauses to pre-compute intermediate results
Derived Tables: Wrap queries in inline views for step-by-step computation
Computed Columns: Pre-calculate complex expressions in CTEs
WHERE IN/EXISTS: Convert between IN lists, EXISTS, and JOIN patterns
Window Functions: Replace GROUP BY aggregations with window functions
CASE Expressions: Add conditional processing throughout the query

Variant Generation¶

Each TPC-H query undergoes systematic transformation:

TPC-H Query 1 (Revenue by lineitem)
    ├── Q1_V1: Scalar subqueries variant
    ├── Q1_V2: HAVING clause variant
    ├── Q1_V3: Explicit JOINs variant
    ├── ...
    └── Q1_V10: CASE expressions variant

Result: 10 variants that must produce identical results

This process repeats for all 22 TPC-H queries, yielding 220 total query variants.

Quick Start¶

from benchbox import TPCHavoc

# Initialize benchmark (inherits TPC-H data and schema)
benchmark = TPCHavoc(scale_factor=1.0)

# Generate TPC-H data (same as TPC-H benchmark)
benchmark.generate_data()

# Load to database
import duckdb
conn = duckdb.connect(":memory:")
benchmark.load_data_to_database(conn)

# Get original TPC-H Query 1
original_q1 = benchmark.get_query("Q1")

# Get all 10 variants of Query 1
variants_q1 = [benchmark.get_query(f"Q1_V{i}") for i in range(1, 11)]

# Run and compare performance
for i, variant in enumerate(variants_q1, 1):
    result = conn.execute(variant).fetchdf()
    print(f"Q1_V{i}: {len(result)} rows")

Use Cases¶

1. Optimizer Robustness Testing¶

Test whether your database optimizer produces consistent plans across syntax variations:

from benchbox import TPCHavoc
import time

benchmark = TPCHavoc(scale_factor=1.0)
conn = setup_database()  # Your database connection

# Test Query 8 variants (complex 8-way join)
results = {}
for variant_id in range(1, 11):
    query = benchmark.get_query(f"Q8_V{variant_id}")

    start = time.time()
    result = conn.execute(query).fetchdf()
    elapsed = time.time() - start

    results[f"Q8_V{variant_id}"] = elapsed

# Analyze variance
times = list(results.values())
print(f"Min: {min(times):.2f}s, Max: {max(times):.2f}s")
print(f"Variance: {max(times) / min(times):.2f}x")

Expected Result: Low variance (< 2x) indicates robust optimizer Problem Indicator: High variance (> 5x) reveals optimizer weaknesses

2. Performance Regression Detection¶

Catch optimizer regressions between database versions:

# Baseline (current version)
benchbox run tpchavoc --platform postgres --scale-factor 1.0 --output baseline.json

# After upgrade (new version)
benchbox run tpchavoc --platform postgres --scale-factor 1.0 --output new.json

# Compare results
benchbox compare baseline.json new.json --threshold 1.5

3. Cross-Database Optimizer Comparison¶

Compare optimizer maturity across database systems:

from benchbox import TPCHavoc

benchmark = TPCHavoc(scale_factor=0.1)
benchmark.generate_data()

databases = {
    "DuckDB": duckdb_connection,
    "ClickHouse": clickhouse_connection,
    "Snowflake": snowflake_connection
}

# Run same variants on different databases
for db_name, conn in databases.items():
    benchmark.load_data_to_database(conn)

    variances = []
    for query_num in range(1, 23):
        times = []
        for variant in range(1, 11):
            query = benchmark.get_query(f"Q{query_num}_V{variant}")
            elapsed = measure_query_time(conn, query)
            times.append(elapsed)

        variance = max(times) / min(times)
        variances.append(variance)

    avg_variance = sum(variances) / len(variances)
    print(f"{db_name}: Average variance {avg_variance:.2f}x")

4. SQL Pattern Analysis¶

Identify which SQL patterns are well-optimized:

from collections import defaultdict

pattern_performance = defaultdict(list)

for query_num in range(1, 23):
    for variant_num in range(1, 11):
        query_id = f"Q{query_num}_V{variant_num}"
        pattern = benchmark.get_variant_pattern(variant_num)  # "CTE", "Window Function", etc.

        elapsed = measure_query(query_id)
        pattern_performance[pattern].append(elapsed)

# Analyze which patterns are fastest/slowest
for pattern, times in pattern_performance.items():
    avg_time = sum(times) / len(times)
    print(f"{pattern}: {avg_time:.3f}s average")

Key Features¶

Result Equivalence Guarantees¶

Every variant is validated to produce identical results to the original TPC-H query:

Exact row counts match
Column values match within floating-point tolerance
Sort order preserved where specified
Automated validation in test suite

SQL Standard Compliance¶

All transformations use SQL:2016 standard features:

No database-specific extensions
No vendor-specific optimizations
Portable across DuckDB, ClickHouse, Snowflake, BigQuery, Redshift, and other SQL:2016 compliant databases
No proprietary hints or directives

TPC-H Integration¶

TPC-Havoc inherits all TPC-H infrastructure:

Uses identical data and schema as TPC-H
Same scale factors (0.001 to 1000+)
Same data generation tools
No additional data preparation required

Query Catalog¶

Simple Aggregation Queries¶

Q1, Q6: Aggregation syntax variations, window functions, CTEs

Join-Heavy Queries¶

Q3, Q5, Q7, Q8, Q9, Q10, Q12, Q14: JOIN syntax variations, subquery conversions

Subquery-Intensive Queries¶

Q2, Q4, Q11, Q16, Q17, Q18, Q20, Q21, Q22: Subquery transformations, EXISTS ↔ IN conversions

Complex Aggregation Queries¶

Q13, Q15: Multi-level aggregation patterns, derived tables

Set Operation Queries¶

Q19: Complex OR conditions, UNION splits, CASE expressions

CLI Usage¶

# Run all TPC-Havoc queries
benchbox run tpchavoc --platform duckdb --scale-factor 1.0

# Run specific query variants
benchbox run tpchavoc --platform clickhouse --queries Q1_V1,Q1_V2,Q1_V3

# Run all variants of specific queries
benchbox run tpchavoc --platform duckdb --query-pattern "Q8_V*"

# Compare variance across queries
benchbox run tpchavoc --platform snowflake --analyze-variance

Performance Characteristics¶

Expected Query Time Variance¶

Optimizer Quality	Typical Variance	Description
Excellent	1.0-1.5x	Recognizes most equivalences
Good	1.5-3.0x	Handles common patterns well
Fair	3.0-5.0x	Struggles with some syntax
Poor	5.0x+	Significant optimization gaps

Scale Factor Recommendations¶

0.01: Development, fast iteration (< 1 minute per variant)
0.1: Integration testing (< 5 minutes per variant)
1.0: Standard benchmarking (< 30 minutes per variant)
10.0: Stress testing (hours per variant)

Limitations and Considerations¶

Experimental Status¶

TPC-Havoc is under active development. Current limitations:

Not all 220 variants implemented yet (progressive rollout)
Variant coverage varies by query complexity
Some transformations may not be applicable to all queries
Performance baselines still being established

Interpretation Guidelines¶

High variance doesn’t always mean bad optimizer:

Expected variance: Some syntax genuinely requires different execution strategies
Hardware effects: I/O, caching, and parallelism can introduce variance
Data characteristics: Skewed data may cause legitimate plan differences
Query complexity: Complex queries naturally show more variance

Use TPC-Havoc to:

Identify problematic patterns for further investigation
Track optimizer improvements over time
Compare relative maturity across databases

Don’t use TPC-Havoc to:

Make absolute “Database X is better” claims
Optimize individual queries (use TPC-H for that)
Replace comprehensive performance testing

API Reference¶

TPCHavoc Class¶

class TPCHavoc(BaseBenchmark):
    """TPC-Havoc optimizer stress testing benchmark."""

    def __init__(self, scale_factor: float = 1.0, output_dir: Optional[str] = None):
        """Initialize TPC-Havoc benchmark (inherits from TPC-H)."""

    def get_query(self, query_id: str, *, params: Optional[dict] = None) -> str:
        """Get a query variant by ID (e.g., 'Q1', 'Q1_V5', 'Q8_V10')."""

    def get_all_variants(self, query_num: int) -> dict[str, str]:
        """Get all 10 variants for a specific query number (1-22)."""

    def get_variant_description(self, query_id: str) -> str:
        """Get description of the transformation pattern used."""

    def validate_variant(self, query_id: str, connection) -> bool:
        """Validate that variant produces results identical to original."""

Research and References¶

TPC-Havoc is inspired by research on query optimizer testing:

Equivalence testing: Ensuring semantically equivalent queries produce same results
Performance variance analysis: Measuring optimizer sensitivity to syntax
SQL transformation patterns: Cataloging equivalent query forms

Future Development¶

Planned enhancements:

Complete implementation of all 220 variants
Automated variance analysis and reporting
Query plan comparison tools
Extended pattern library beyond 10 universal patterns
Integration with query plan visualization

License¶

Licensed under the MIT License.