TPC-H Skew Benchmark¶
Overview¶
The TPC-H Skew benchmark extends the standard TPC-H benchmark with configurable data skew patterns to test database performance under realistic, non-uniform data distributions. Based on the research paper “Introducing Skew into the TPC-H Benchmark”, this benchmark helps evaluate how query optimizers handle skewed data distributions that are common in real-world scenarios.
The benchmark is ideal for testing cardinality estimation accuracy, join strategy selection, and overall optimizer robustness when data doesn’t follow uniform distributions.
Key Features¶
Configurable skew levels - From uniform (standard TPC-H) to extreme Zipfian distributions
Multiple distribution types - Zipfian, normal, and exponential patterns
Preset configurations - Ready-to-use skew profiles for common scenarios
Same 22 queries - Uses standard TPC-H queries for direct comparison
Concurrent stream support - Full TPC-H compliance testing with skewed data
Research-backed methodology - Based on TPC-TC 2011 research
Skew Presets¶
BenchBox provides predefined skew configurations for common testing scenarios:
Preset |
Skew Factor (z) |
Description |
Use Case |
|---|---|---|---|
|
0.0 |
Uniform distribution (standard TPC-H) |
Baseline comparison |
|
0.2 |
Light skew |
Initial optimizer stress testing |
|
0.5 |
Moderate skew (default) |
Balanced skew testing |
|
0.8 |
Heavy skew |
Advanced optimizer testing |
|
1.0 |
Zipf’s law distribution |
Edge case testing |
|
varies |
E-commerce patterns |
Real-world simulation |
Data Model¶
The TPC-H Skew benchmark uses the same schema as standard TPC-H:
Tables¶
Table |
Purpose |
Approximate Rows (SF 1) |
|---|---|---|
lineitem |
Order line items (largest table) |
~6,000,000 |
orders |
Customer orders |
~1,500,000 |
partsupp |
Part-supplier relationships |
~800,000 |
customer |
Customer information |
~150,000 |
part |
Part catalog |
~200,000 |
supplier |
Supplier information |
~10,000 |
nation |
Nation reference |
25 |
region |
Region reference |
5 |
Skew Applied To¶
The skew distribution is applied to key columns that affect join cardinality and selectivity:
o_custkey (orders → customer join)
l_partkey (lineitem → part join)
l_suppkey (lineitem → supplier join)
ps_partkey (partsupp → part join)
ps_suppkey (partsupp → supplier join)
Query Categories¶
The benchmark uses the standard 22 TPC-H queries, unchanged from the base benchmark. This allows direct comparison between uniform and skewed data performance:
Scan-Heavy Queries¶
Q1, Q6: Large aggregations on lineitem
Most affected by data volume, less by skew
Join-Heavy Queries¶
Q3, Q5, Q7, Q8, Q9, Q10, Q21: Multi-table joins
Most affected by skew - cardinality estimation errors compound
Subquery Queries¶
Q2, Q4, Q11, Q15, Q17, Q18, Q20, Q22: Correlated subqueries
Affected by both skew and optimizer strategy selection
Complex Analytics¶
Q13, Q14, Q16, Q19: Aggregation with complex predicates
Variable impact depending on selectivity
Usage Examples¶
Basic Usage with Preset¶
from benchbox import TPCHSkew
# Use moderate skew (default)
benchmark = TPCHSkew(scale_factor=1.0)
data_files = benchmark.generate_data()
# Get queries (same as standard TPC-H)
queries = benchmark.get_queries()
# Check skew configuration
print(benchmark.get_skew_info())
Using Different Presets¶
from benchbox import TPCHSkew
# Light skew for initial testing
light_benchmark = TPCHSkew(scale_factor=0.1, skew_preset="light")
# Heavy skew for optimizer stress testing
heavy_benchmark = TPCHSkew(scale_factor=1.0, skew_preset="heavy")
# Extreme skew for edge cases
extreme_benchmark = TPCHSkew(scale_factor=1.0, skew_preset="extreme")
# Realistic e-commerce patterns
realistic_benchmark = TPCHSkew(scale_factor=1.0, skew_preset="realistic")
Custom Skew Configuration¶
from benchbox import TPCHSkew
from benchbox.core.tpch_skew import SkewConfiguration
# Create custom configuration
custom_config = SkewConfiguration(
skew_factor=0.7,
distribution_type="zipfian",
attribute_skew_enabled=True,
join_skew_enabled=True,
temporal_skew_enabled=False,
)
benchmark = TPCHSkew(
scale_factor=1.0,
skew_config=custom_config
)
data_files = benchmark.generate_data()
DuckDB Integration¶
import duckdb
from benchbox import TPCHSkew
from benchbox.platforms.duckdb import DuckDBAdapter
# Initialize benchmark with heavy skew
benchmark = TPCHSkew(scale_factor=0.1, skew_preset="heavy")
# Generate skewed data
benchmark.generate_data()
# Use DuckDB adapter
adapter = DuckDBAdapter(database=":memory:")
adapter.load_benchmark(benchmark)
# Run queries and compare with baseline
results = adapter.run_benchmark(benchmark)
# Analyze query performance
for query_id, result in results.items():
print(f"Q{query_id}: {result['execution_time']:.3f}s")
Comparing Uniform vs Skewed Performance¶
from benchbox import TPCH, TPCHSkew
from benchbox.platforms.duckdb import DuckDBAdapter
# Baseline: Standard TPC-H (uniform)
uniform = TPCH(scale_factor=1.0)
uniform.generate_data()
# Test: TPC-H with heavy skew
skewed = TPCHSkew(scale_factor=1.0, skew_preset="heavy")
skewed.generate_data()
# Compare performance
for benchmark, name in [(uniform, "Uniform"), (skewed, "Skewed")]:
adapter = DuckDBAdapter()
adapter.load_benchmark(benchmark)
# Run a join-heavy query (most affected by skew)
query = benchmark.get_query(3) # Q3: Shipping Priority
result = adapter.execute_query(query)
print(f"{name} Q3: {result['execution_time']:.3f}s")
Scale Factor Guidelines¶
Scale Factor |
Data Size |
Memory Usage |
Use Case |
|---|---|---|---|
0.01 |
~10 MB |
< 100 MB |
Quick testing |
0.1 |
~100 MB |
< 500 MB |
Development |
1.0 |
~1 GB |
< 4 GB |
Standard benchmark |
10.0 |
~10 GB |
< 20 GB |
Performance testing |
Performance Characteristics¶
Impact of Skew on Query Performance¶
Low Impact:
Q1, Q6: Simple scans and aggregations
Performance dominated by I/O, not join strategies
Medium Impact:
Q2, Q4, Q11, Q15: Subqueries with moderate join complexity
Cardinality estimation errors affect plan selection
High Impact:
Q3, Q5, Q7, Q8, Q9, Q10, Q21: Complex multi-way joins
Skew causes significant cardinality estimation errors
Wrong join order selection common with naive optimizers
Common Optimizer Failures¶
Hash join overflow: Skewed build side exceeds memory estimates
Wrong join order: Estimated cardinalities differ significantly from actual
Suboptimal parallelism: Work imbalance due to skewed partitions
Index selection errors: Skewed data changes selectivity assumptions
Best Practices¶
Testing Methodology¶
Establish baseline - Run standard TPC-H first for comparison
Start with moderate skew - Use
moderatepreset initiallyProgress to heavy skew - Identify optimizer breaking points
Monitor query plans - Compare plans between uniform and skewed runs
Interpreting Results¶
Calculate slowdown ratio - skewed_time / uniform_time
Identify outliers - Queries with >2x slowdown indicate optimizer issues
Examine plans - Look for join order changes, hash spills
Check cardinality estimates - Compare estimated vs actual row counts
Optimization Recommendations¶
Update statistics frequently - Histogram-based statistics help
Consider adaptive execution - Runtime join strategy switching
Use skew hints if available - Some databases support skew-aware hints
Test with realistic skew - Use
realisticpreset for production planning
External Resources¶
TPC-H Official Specification - TPC-H standard
Introducing Skew into TPC-H - Research methodology
Cardinality Estimation in Query Optimizers - Related research