TPC-H Skew Benchmark¶

Tags advanced concept tpch-skew tpc-h custom-benchmark

Overview¶

The TPC-H Skew benchmark extends the standard TPC-H benchmark with configurable data skew patterns to test database performance under realistic, non-uniform data distributions. Based on the research paper “Introducing Skew into the TPC-H Benchmark”, this benchmark helps evaluate how query optimizers handle skewed data distributions that are common in real-world scenarios.

The benchmark is ideal for testing cardinality estimation accuracy, join strategy selection, and overall optimizer robustness when data doesn’t follow uniform distributions.

Key Features¶

Configurable skew levels - From uniform (standard TPC-H) to extreme Zipfian distributions
Multiple distribution types - Zipfian, normal, and exponential patterns
Preset configurations - Ready-to-use skew profiles for common scenarios
Same 22 queries - Uses standard TPC-H queries for direct comparison
Concurrent stream support - Full TPC-H compliance testing with skewed data
Research-backed methodology - Based on TPC-TC 2011 research

Skew Presets¶

BenchBox provides predefined skew configurations for common testing scenarios:

Preset	Skew Factor (z)	Description	Use Case
`none`	0.0	Uniform distribution (standard TPC-H)	Baseline comparison
`light`	0.2	Light skew	Initial optimizer stress testing
`moderate`	0.5	Moderate skew (default)	Balanced skew testing
`heavy`	0.8	Heavy skew	Advanced optimizer testing
`extreme`	1.0	Zipf’s law distribution	Edge case testing
`realistic`	varies	E-commerce patterns	Real-world simulation

Data Model¶

The TPC-H Skew benchmark uses the same schema as standard TPC-H:

Tables¶

Table	Purpose	Approximate Rows (SF 1)
lineitem	Order line items (largest table)	~6,000,000
orders	Customer orders	~1,500,000
partsupp	Part-supplier relationships	~800,000
customer	Customer information	~150,000
part	Part catalog	~200,000
supplier	Supplier information	~10,000
nation	Nation reference	25
region	Region reference	5

Skew Applied To¶

The skew distribution is applied to key columns that affect join cardinality and selectivity:

o_custkey (orders → customer join)
l_partkey (lineitem → part join)
l_suppkey (lineitem → supplier join)
ps_partkey (partsupp → part join)
ps_suppkey (partsupp → supplier join)

Query Categories¶

The benchmark uses the standard 22 TPC-H queries, unchanged from the base benchmark. This allows direct comparison between uniform and skewed data performance:

Scan-Heavy Queries¶

Q1, Q6: Large aggregations on lineitem
Most affected by data volume, less by skew

Join-Heavy Queries¶

Q3, Q5, Q7, Q8, Q9, Q10, Q21: Multi-table joins
Most affected by skew - cardinality estimation errors compound

Subquery Queries¶

Q2, Q4, Q11, Q15, Q17, Q18, Q20, Q22: Correlated subqueries
Affected by both skew and optimizer strategy selection

Complex Analytics¶

Q13, Q14, Q16, Q19: Aggregation with complex predicates
Variable impact depending on selectivity

Usage Examples¶

Basic Usage with Preset¶

from benchbox import TPCHSkew

# Use moderate skew (default)
benchmark = TPCHSkew(scale_factor=1.0)
data_files = benchmark.generate_data()

# Get queries (same as standard TPC-H)
queries = benchmark.get_queries()

# Check skew configuration
print(benchmark.get_skew_info())

Using Different Presets¶

from benchbox import TPCHSkew

# Light skew for initial testing
light_benchmark = TPCHSkew(scale_factor=0.1, skew_preset="light")

# Heavy skew for optimizer stress testing
heavy_benchmark = TPCHSkew(scale_factor=1.0, skew_preset="heavy")

# Extreme skew for edge cases
extreme_benchmark = TPCHSkew(scale_factor=1.0, skew_preset="extreme")

# Realistic e-commerce patterns
realistic_benchmark = TPCHSkew(scale_factor=1.0, skew_preset="realistic")

Custom Skew Configuration¶

from benchbox import TPCHSkew
from benchbox.core.tpch_skew import SkewConfiguration

# Create custom configuration
custom_config = SkewConfiguration(
    skew_factor=0.7,
    distribution_type="zipfian",
    attribute_skew_enabled=True,
    join_skew_enabled=True,
    temporal_skew_enabled=False,
)

benchmark = TPCHSkew(
    scale_factor=1.0,
    skew_config=custom_config
)
data_files = benchmark.generate_data()

DuckDB Integration¶

import duckdb
from benchbox import TPCHSkew
from benchbox.platforms.duckdb import DuckDBAdapter

# Initialize benchmark with heavy skew
benchmark = TPCHSkew(scale_factor=0.1, skew_preset="heavy")

# Generate skewed data
benchmark.generate_data()

# Use DuckDB adapter
adapter = DuckDBAdapter(database=":memory:")
adapter.load_benchmark(benchmark)

# Run queries and compare with baseline
results = adapter.run_benchmark(benchmark)

# Analyze query performance
for query_id, result in results.items():
    print(f"Q{query_id}: {result['execution_time']:.3f}s")

Comparing Uniform vs Skewed Performance¶

from benchbox import TPCH, TPCHSkew
from benchbox.platforms.duckdb import DuckDBAdapter

# Baseline: Standard TPC-H (uniform)
uniform = TPCH(scale_factor=1.0)
uniform.generate_data()

# Test: TPC-H with heavy skew
skewed = TPCHSkew(scale_factor=1.0, skew_preset="heavy")
skewed.generate_data()

# Compare performance
for benchmark, name in [(uniform, "Uniform"), (skewed, "Skewed")]:
    adapter = DuckDBAdapter()
    adapter.load_benchmark(benchmark)

    # Run a join-heavy query (most affected by skew)
    query = benchmark.get_query(3)  # Q3: Shipping Priority
    result = adapter.execute_query(query)
    print(f"{name} Q3: {result['execution_time']:.3f}s")

Scale Factor Guidelines¶

Scale Factor	Data Size	Memory Usage	Use Case
0.01	~10 MB	< 100 MB	Quick testing
0.1	~100 MB	< 500 MB	Development
1.0	~1 GB	< 4 GB	Standard benchmark
10.0	~10 GB	< 20 GB	Performance testing

Performance Characteristics¶

Impact of Skew on Query Performance¶

Low Impact:

Q1, Q6: Simple scans and aggregations
Performance dominated by I/O, not join strategies

Medium Impact:

Q2, Q4, Q11, Q15: Subqueries with moderate join complexity
Cardinality estimation errors affect plan selection

High Impact:

Q3, Q5, Q7, Q8, Q9, Q10, Q21: Complex multi-way joins
Skew causes significant cardinality estimation errors
Wrong join order selection common with naive optimizers

Common Optimizer Failures¶

Hash join overflow: Skewed build side exceeds memory estimates
Wrong join order: Estimated cardinalities differ significantly from actual
Suboptimal parallelism: Work imbalance due to skewed partitions
Index selection errors: Skewed data changes selectivity assumptions

Best Practices¶

Testing Methodology¶

Establish baseline - Run standard TPC-H first for comparison
Start with moderate skew - Use moderate preset initially
Progress to heavy skew - Identify optimizer breaking points
Monitor query plans - Compare plans between uniform and skewed runs

Interpreting Results¶

Calculate slowdown ratio - skewed_time / uniform_time
Identify outliers - Queries with >2x slowdown indicate optimizer issues
Examine plans - Look for join order changes, hash spills
Check cardinality estimates - Compare estimated vs actual row counts

Optimization Recommendations¶

Update statistics frequently - Histogram-based statistics help
Consider adaptive execution - Runtime join strategy switching
Use skew hints if available - Some databases support skew-aware hints
Test with realistic skew - Use realistic preset for production planning

External Resources¶

TPC-H Official Specification - TPC-H standard
Introducing Skew into TPC-H - Research methodology
Cardinality Estimation in Query Optimizers - Related research