TPC-H Benchmark¶
Overview¶
TPC-H tests data warehouse performance using 22 analytical queries.
The queries simulate business intelligence workloads and test complex SQL operations on large datasets.
Key Features¶
22 analytical queries
8 table wholesale supplier schema
Parameterized queries for reproducible tests
Multiple scale factors (SF 1 = ~1GB)
OLAP operation coverage
SQL dialect support
Stream processing for throughput tests
Schema Description¶
TPC-H uses 8 tables modeling a wholesale supplier business:
Core Tables¶
Table |
Purpose |
Approximate Rows (SF 1) |
|---|---|---|
REGION |
Geographic regions |
5 |
NATION |
Countries within regions |
25 |
SUPPLIER |
Supplier information |
10,000 |
CUSTOMER |
Customer information |
150,000 |
PART |
Parts catalog |
200,000 |
PARTSUPP |
Part-supplier relationships |
800,000 |
ORDERS |
Order header information |
1,500,000 |
LINEITEM |
Order line items (fact table) |
6,000,000 |
Schema Relationships¶
erDiagram
REGION ||--o{ NATION : contains
NATION ||--o{ CUSTOMER : located_in
NATION ||--o{ SUPPLIER : located_in
CUSTOMER ||--o{ ORDERS : places
SUPPLIER ||--o{ PARTSUPP : supplies
PART ||--o{ PARTSUPP : supplied_as
ORDERS ||--o{ LINEITEM : contains
PART ||--o{ LINEITEM : ordered_as
SUPPLIER ||--o{ LINEITEM : supplied_by
Key Columns¶
LINEITEM (Fact Table):
L_ORDERKEY,L_LINENUMBER(Composite Primary Key)L_PARTKEY,L_SUPPKEY(Foreign Keys)L_QUANTITY,L_EXTENDEDPRICE,L_DISCOUNT,L_TAX(Measures)L_SHIPDATE,L_COMMITDATE,L_RECEIPTDATE(Temporal)
ORDERS (Order Header):
O_ORDERKEY(Primary Key)O_CUSTKEY(Foreign Key to Customer)O_ORDERDATE,O_TOTALPRICE,O_ORDERSTATUS
Query Characteristics¶
The 22 TPC-H queries test different aspects of query processing:
Query Categories¶
Query |
Type |
Key Features |
|---|---|---|
Q1 |
Pricing Summary Report |
Aggregation, GROUP BY, no joins |
Q2 |
Minimum Cost Supplier |
Nested subqueries, correlated queries |
Q3 |
Shipping Priority |
3-table join, ORDER BY, LIMIT |
Q4 |
Order Priority Checking |
EXISTS subquery, temporal predicates |
Q5 |
Local Supplier Volume |
6-table join, regional aggregation |
Q6 |
Forecasting Revenue |
Simple filter, aggregation |
Q7 |
Volume Shipping |
Nation-to-nation shipping analysis |
Q8 |
National Market Share |
Complex aggregation, CASE expressions |
Q9 |
Product Type Profit |
Multi-year profit analysis |
Q10 |
Returned Item Reporting |
Customer analysis with joins |
Q11 |
Important Stock |
Subquery with aggregation |
Q12 |
Shipping Modes |
Mail type analysis |
Q13 |
Customer Distribution |
Outer join, customer classification |
Q14 |
Promotion Effect |
Revenue analysis with CASE |
Q15 |
Top Supplier |
VIEW creation, supplier ranking |
Q16 |
Parts/Supplier Relationship |
Complex filtering, NOT IN |
Q17 |
Small-Quantity-Order Revenue |
Subquery with aggregation |
Q18 |
Large Volume Customer |
Customer ranking by volume |
Q19 |
Discounted Revenue |
Complex OR conditions |
Q20 |
Potential Part Promotion |
Multi-level subqueries |
Q21 |
Suppliers Who Kept Orders |
Multiple EXISTS/NOT EXISTS |
Q22 |
Global Sales Opportunity |
Geographic analysis |
Complexity Patterns¶
Simple Queries (Q1, Q6): Single table aggregation
Medium Queries (Q3, Q5, Q10): Multi-table joins with aggregation
Complex Queries (Q2, Q15, Q20, Q21): Nested subqueries, correlated queries
Analytical Queries (Q7, Q8, Q9): Time-series and trend analysis
Usage Examples¶
Basic Query Generation¶
from benchbox import TPCH
# Initialize TPC-H benchmark
tpch = TPCH(scale_factor=1.0, output_dir="tpch_data")
# Generate data
data_files = tpch.generate_data()
# Get all queries
queries = tpch.get_queries()
print(f"Generated {len(queries)} queries")
# Get specific query with parameters
query_1 = tpch.get_query(1, seed=42)
print(query_1)
Data Generation with Custom Scale¶
# Generate smaller dataset for testing
tpch_small = TPCH(scale_factor=0.1, output_dir="tpch_small")
data_files = tpch_small.generate_data()
# Check generated files
for file_path in data_files:
table_name = file_path.stem # Get filename without extension
print(f"{table_name}: {file_path}")
SQL Dialect Translation¶
# Get query in different SQL dialects
query_postgres = tpch.get_query(1, dialect="postgres")
query_duckdb = tpch.get_query(1, dialect="duckdb")
query_mysql = tpch.get_query(1, dialect="mysql")
# Queries are automatically translated via sqlglot
DuckDB Integration Example¶
import duckdb
from benchbox import TPCH
# Initialize benchmark and generate data
tpch = TPCH(scale_factor=0.01, output_dir="tpch_tiny")
data_files = tpch.generate_data()
# Create DuckDB connection and schema
conn = duckdb.connect("tpch.duckdb")
schema_sql = tpch.get_create_tables_sql()
conn.execute(schema_sql)
# Load data (simplified example)
for table_name in tpch.get_available_tables():
file_path = tpch.tables[table_name.upper()]
table_name_lower = table_name.lower()
# Load TBL file with pipe delimiter
conn.execute(f"""
INSERT INTO {table_name_upper}
SELECT * FROM read_csv('{file_path}',
delim='|',
header=false)
""")
# Run queries
for query_id in range(1, 23):
query_sql = tpch.get_query(query_id, dialect="duckdb")
result = conn.execute(query_sql).fetchall()
print(f"Q{query_id}: {len(result)} rows")
Stream Processing¶
# Generate query streams for throughput testing
stream_files = tpch.generate_streams(
num_streams=4,
rng_seed=42,
streams_output_dir="streams"
)
# Get stream information
for i, stream_info in enumerate(tpch.get_all_streams_info()):
print(f"Stream {i}: {stream_info['query_count']} queries")
print(f" Permutation: {stream_info['permutation']}")
print(f" Output file: {stream_info['output_file']}")
Query Characteristics¶
Query Complexity Categories¶
Simple Queries (low complexity):
Q1, Q6: Single-table aggregation with simple predicates
Q14: Simple join with CASE expression
Moderate Queries (medium complexity):
Q3, Q5, Q7, Q10, Q12: Multi-table joins with moderate complexity
Q4, Q13: EXISTS/outer join patterns
Complex Queries (high complexity):
Q2, Q15, Q17, Q20, Q21: Complex nested subqueries
Q8, Q9: Multi-year analytical queries
Q16, Q18, Q19, Q22: Complex filtering and ranking
Note: Actual execution times vary significantly based on platform, hardware, configuration, and data characteristics. Run benchmarks to measure performance for your specific environment.
Resource Utilization¶
Query Type |
CPU Usage |
Memory Usage |
I/O Pattern |
|---|---|---|---|
Aggregation |
High |
Medium |
Sequential |
Multi-join |
Medium |
High |
Random |
Subquery |
High |
High |
Mixed |
Analytical |
Medium |
Medium |
Sequential |
Configuration Options¶
Scale Factor Guidelines¶
Scale Factor |
Data Size |
Use Case |
|---|---|---|
0.01 |
~10 MB |
Development, unit testing |
0.1 |
~100 MB |
Integration testing |
1 |
~1 GB |
Standard benchmark |
10 |
~10 GB |
Performance testing |
100+ |
~100+ GB |
Production simulation |
Advanced Configuration¶
tpch = TPCH(
scale_factor=1.0,
output_dir="tpch_data",
verbose=True, # Enable detailed logging
parallel=4 # Parallel data generation
)
# Custom query parameters
query = tpch.get_query(
query_id=1,
seed=42, # Reproducible parameters
scale_factor=1.0, # Override scale factor
dialect="duckdb" # Target dialect
)
Integration Examples¶
Databricks Integration¶
from benchbox import TPCH
# Generate TPC-H data for Databricks
tpch = TPCH(scale_factor=10, output_dir="/dbfs/tpch_sf10")
data_files = tpch.generate_data()
# Create tables in Databricks
for table_name in tpch.get_available_tables():
spark.sql(f"""
CREATE TABLE IF NOT EXISTS {table_name}
USING DELTA
LOCATION '/dbfs/tpch_sf10/{table_name.lower()}.tbl'
OPTIONS (
'path' '/dbfs/tpch_sf10/{table_name.lower()}.tbl',
'delimiter' '|',
'header' 'false'
)
""")
# Run queries with Spark SQL
for query_id in range(1, 23):
query_sql = tpch.get_query(query_id, dialect="spark")
df = spark.sql(query_sql)
df.show()
Performance Testing Framework¶
import time
from statistics import mean, median
def benchmark_queries(tpch, connection, query_list=None):
"""Run TPC-H performance benchmark."""
if query_list is None:
query_list = range(1, 23)
results = {}
for query_id in query_list:
times = []
# Run each query 3 times
for run in range(3):
query_sql = tpch.get_query(query_id, dialect="duckdb")
start_time = time.time()
result = connection.execute(query_sql).fetchall()
execution_time = time.time() - start_time
times.append(execution_time)
results[query_id] = {
'times': times,
'mean': mean(times),
'median': median(times),
'rows': len(result)
}
return results
# Usage
results = benchmark_queries(tpch, conn)
for query_id, stats in results.items():
print(f"Q{query_id}: {stats['median']:.2f}s ({stats['rows']} rows)")
Best Practices¶
Data Generation¶
Use appropriate scale factors for your testing needs
Generate data once and reuse for multiple test runs
Parallel generation for large scale factors (SF ≥ 10)
Validate data integrity after generation
Query Execution¶
Warm up queries before timing measurements
Use consistent seeds for reproducible parameter generation
Test incremental complexity (start with simple queries)
Monitor resource usage during execution
Performance Testing¶
Run multiple iterations to account for variance
Clear caches between runs for cold performance
Use representative scale factors for your use case
Profile both power and throughput metrics
Common Issues and Solutions¶
Data Generation Issues¶
Issue: Out of memory during generation
# Solution: Use a smaller scale factor or increase parallel processes
tpch = TPCH(scale_factor=1.0, parallel=8) # More parallel processes
Issue: Slow data generation
# Solution: Check disk I/O and use SSD storage
tpch = TPCH(output_dir="/fast/ssd/path")
Query Execution Issues¶
Issue: Query timeout on large scale factors
# Solution: Start with smaller scale and optimize queries
tpch = TPCH(scale_factor=0.1) # Start small
query = tpch.get_query(21, dialect="duckdb") # Use configured dialect
Issue: Parameter substitution errors
# Solution: Use explicit seeds for reproducible parameters
query = tpch.get_query(1, seed=42) # Reproducible
See Also¶
Understanding BenchBox¶
Architecture Overview - How BenchBox works
Workflow Patterns - Common benchmarking workflows
Data Model - Result schema and analysis
Glossary - Benchmark terminology
Practical Guides¶
Getting Started - Run your first benchmark
CLI Reference - Complete command documentation
Platform Selection Guide - Choose the right database
Platform Quick Reference - Setup for each platform
Data Generation Guide - Advanced generation options
Dry Run Mode - Preview queries before execution
Official TPC-H Resources¶
Official TPC-H Benchmark Guide - Compliance testing with BenchBox
TPC Validation Guide - Result validation and compliance
TPC Patterns Usage - Parameter generation and stream patterns
External Resources¶
TPC-H Specification - Official specification document
TPC-H Published Results - Vendor-submitted results
TPC White Papers - Research using TPC benchmarks