TPC-H Benchmark

Tags intermediate concept tpc-h

Overview

TPC-H tests data warehouse performance using 22 analytical queries.

The queries simulate business intelligence workloads and test complex SQL operations on large datasets.

Key Features

  • 22 analytical queries

  • 8 table wholesale supplier schema

  • Parameterized queries for reproducible tests

  • Multiple scale factors (SF 1 = ~1GB)

  • OLAP operation coverage

  • SQL dialect support

  • Stream processing for throughput tests

Schema Description

TPC-H uses 8 tables modeling a wholesale supplier business:

Core Tables

Table

Purpose

Approximate Rows (SF 1)

REGION

Geographic regions

5

NATION

Countries within regions

25

SUPPLIER

Supplier information

10,000

CUSTOMER

Customer information

150,000

PART

Parts catalog

200,000

PARTSUPP

Part-supplier relationships

800,000

ORDERS

Order header information

1,500,000

LINEITEM

Order line items (fact table)

6,000,000

Schema Relationships

        erDiagram
    REGION ||--o{ NATION : contains
    NATION ||--o{ CUSTOMER : located_in
    NATION ||--o{ SUPPLIER : located_in
    CUSTOMER ||--o{ ORDERS : places
    SUPPLIER ||--o{ PARTSUPP : supplies
    PART ||--o{ PARTSUPP : supplied_as
    ORDERS ||--o{ LINEITEM : contains
    PART ||--o{ LINEITEM : ordered_as
    SUPPLIER ||--o{ LINEITEM : supplied_by
    

Key Columns

LINEITEM (Fact Table):

  • L_ORDERKEY, L_LINENUMBER (Composite Primary Key)

  • L_PARTKEY, L_SUPPKEY (Foreign Keys)

  • L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX (Measures)

  • L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE (Temporal)

ORDERS (Order Header):

  • O_ORDERKEY (Primary Key)

  • O_CUSTKEY (Foreign Key to Customer)

  • O_ORDERDATE, O_TOTALPRICE, O_ORDERSTATUS

Query Characteristics

The 22 TPC-H queries test different aspects of query processing:

Query Categories

Query

Type

Key Features

Q1

Pricing Summary Report

Aggregation, GROUP BY, no joins

Q2

Minimum Cost Supplier

Nested subqueries, correlated queries

Q3

Shipping Priority

3-table join, ORDER BY, LIMIT

Q4

Order Priority Checking

EXISTS subquery, temporal predicates

Q5

Local Supplier Volume

6-table join, regional aggregation

Q6

Forecasting Revenue

Simple filter, aggregation

Q7

Volume Shipping

Nation-to-nation shipping analysis

Q8

National Market Share

Complex aggregation, CASE expressions

Q9

Product Type Profit

Multi-year profit analysis

Q10

Returned Item Reporting

Customer analysis with joins

Q11

Important Stock

Subquery with aggregation

Q12

Shipping Modes

Mail type analysis

Q13

Customer Distribution

Outer join, customer classification

Q14

Promotion Effect

Revenue analysis with CASE

Q15

Top Supplier

VIEW creation, supplier ranking

Q16

Parts/Supplier Relationship

Complex filtering, NOT IN

Q17

Small-Quantity-Order Revenue

Subquery with aggregation

Q18

Large Volume Customer

Customer ranking by volume

Q19

Discounted Revenue

Complex OR conditions

Q20

Potential Part Promotion

Multi-level subqueries

Q21

Suppliers Who Kept Orders

Multiple EXISTS/NOT EXISTS

Q22

Global Sales Opportunity

Geographic analysis

Complexity Patterns

  • Simple Queries (Q1, Q6): Single table aggregation

  • Medium Queries (Q3, Q5, Q10): Multi-table joins with aggregation

  • Complex Queries (Q2, Q15, Q20, Q21): Nested subqueries, correlated queries

  • Analytical Queries (Q7, Q8, Q9): Time-series and trend analysis

Usage Examples

Basic Query Generation

from benchbox import TPCH

# Initialize TPC-H benchmark
tpch = TPCH(scale_factor=1.0, output_dir="tpch_data")

# Generate data
data_files = tpch.generate_data()

# Get all queries
queries = tpch.get_queries()
print(f"Generated {len(queries)} queries")

# Get specific query with parameters
query_1 = tpch.get_query(1, seed=42)
print(query_1)

Data Generation with Custom Scale

# Generate smaller dataset for testing
tpch_small = TPCH(scale_factor=0.1, output_dir="tpch_small")
data_files = tpch_small.generate_data()

# Check generated files
for file_path in data_files:
    table_name = file_path.stem  # Get filename without extension
    print(f"{table_name}: {file_path}")

SQL Dialect Translation

# Get query in different SQL dialects
query_postgres = tpch.get_query(1, dialect="postgres")
query_duckdb = tpch.get_query(1, dialect="duckdb") 
query_mysql = tpch.get_query(1, dialect="mysql")

# Queries are automatically translated via sqlglot

DuckDB Integration Example

import duckdb
from benchbox import TPCH

# Initialize benchmark and generate data
tpch = TPCH(scale_factor=0.01, output_dir="tpch_tiny")
data_files = tpch.generate_data()

# Create DuckDB connection and schema
conn = duckdb.connect("tpch.duckdb")
schema_sql = tpch.get_create_tables_sql()
conn.execute(schema_sql)

# Load data (simplified example)
for table_name in tpch.get_available_tables():
    file_path = tpch.tables[table_name.upper()]
    table_name_lower = table_name.lower()
    
    # Load TBL file with pipe delimiter
    conn.execute(f"""
        INSERT INTO {table_name_upper}
        SELECT * FROM read_csv('{file_path}', 
                              delim='|', 
                              header=false)
    """)

# Run queries
for query_id in range(1, 23):
    query_sql = tpch.get_query(query_id, dialect="duckdb")
    result = conn.execute(query_sql).fetchall()
    print(f"Q{query_id}: {len(result)} rows")

Stream Processing

# Generate query streams for throughput testing
stream_files = tpch.generate_streams(
    num_streams=4,
    rng_seed=42,
    streams_output_dir="streams"
)

# Get stream information
for i, stream_info in enumerate(tpch.get_all_streams_info()):
    print(f"Stream {i}: {stream_info['query_count']} queries")
    print(f"  Permutation: {stream_info['permutation']}")
    print(f"  Output file: {stream_info['output_file']}")

Query Characteristics

Query Complexity Categories

Simple Queries (low complexity):

  • Q1, Q6: Single-table aggregation with simple predicates

  • Q14: Simple join with CASE expression

Moderate Queries (medium complexity):

  • Q3, Q5, Q7, Q10, Q12: Multi-table joins with moderate complexity

  • Q4, Q13: EXISTS/outer join patterns

Complex Queries (high complexity):

  • Q2, Q15, Q17, Q20, Q21: Complex nested subqueries

  • Q8, Q9: Multi-year analytical queries

  • Q16, Q18, Q19, Q22: Complex filtering and ranking

Note: Actual execution times vary significantly based on platform, hardware, configuration, and data characteristics. Run benchmarks to measure performance for your specific environment.

Resource Utilization

Query Type

CPU Usage

Memory Usage

I/O Pattern

Aggregation

High

Medium

Sequential

Multi-join

Medium

High

Random

Subquery

High

High

Mixed

Analytical

Medium

Medium

Sequential

Configuration Options

Scale Factor Guidelines

Scale Factor

Data Size

Use Case

0.01

~10 MB

Development, unit testing

0.1

~100 MB

Integration testing

1

~1 GB

Standard benchmark

10

~10 GB

Performance testing

100+

~100+ GB

Production simulation

Advanced Configuration

tpch = TPCH(
    scale_factor=1.0,
    output_dir="tpch_data",
    verbose=True,          # Enable detailed logging
    parallel=4             # Parallel data generation
)

# Custom query parameters
query = tpch.get_query(
    query_id=1,
    seed=42,               # Reproducible parameters
    scale_factor=1.0,      # Override scale factor
    dialect="duckdb"       # Target dialect
)

Integration Examples

Databricks Integration

from benchbox import TPCH

# Generate TPC-H data for Databricks
tpch = TPCH(scale_factor=10, output_dir="/dbfs/tpch_sf10")
data_files = tpch.generate_data()

# Create tables in Databricks
for table_name in tpch.get_available_tables():
    spark.sql(f"""
        CREATE TABLE IF NOT EXISTS {table_name}
        USING DELTA
        LOCATION '/dbfs/tpch_sf10/{table_name.lower()}.tbl'
        OPTIONS (
            'path' '/dbfs/tpch_sf10/{table_name.lower()}.tbl',
            'delimiter' '|',
            'header' 'false'
        )
    """)

# Run queries with Spark SQL
for query_id in range(1, 23):
    query_sql = tpch.get_query(query_id, dialect="spark")
    df = spark.sql(query_sql)
    df.show()

Performance Testing Framework

import time
from statistics import mean, median

def benchmark_queries(tpch, connection, query_list=None):
    """Run TPC-H performance benchmark."""
    if query_list is None:
        query_list = range(1, 23)
    
    results = {}
    
    for query_id in query_list:
        times = []
        
        # Run each query 3 times
        for run in range(3):
            query_sql = tpch.get_query(query_id, dialect="duckdb")
            
            start_time = time.time()
            result = connection.execute(query_sql).fetchall()
            execution_time = time.time() - start_time
            
            times.append(execution_time)
        
        results[query_id] = {
            'times': times,
            'mean': mean(times),
            'median': median(times),
            'rows': len(result)
        }
    
    return results

# Usage
results = benchmark_queries(tpch, conn)
for query_id, stats in results.items():
    print(f"Q{query_id}: {stats['median']:.2f}s ({stats['rows']} rows)")

Best Practices

Data Generation

  1. Use appropriate scale factors for your testing needs

  2. Generate data once and reuse for multiple test runs

  3. Parallel generation for large scale factors (SF ≥ 10)

  4. Validate data integrity after generation

Query Execution

  1. Warm up queries before timing measurements

  2. Use consistent seeds for reproducible parameter generation

  3. Test incremental complexity (start with simple queries)

  4. Monitor resource usage during execution

Performance Testing

  1. Run multiple iterations to account for variance

  2. Clear caches between runs for cold performance

  3. Use representative scale factors for your use case

  4. Profile both power and throughput metrics

Common Issues and Solutions

Data Generation Issues

Issue: Out of memory during generation

# Solution: Use a smaller scale factor or increase parallel processes
tpch = TPCH(scale_factor=1.0, parallel=8)  # More parallel processes

Issue: Slow data generation

# Solution: Check disk I/O and use SSD storage
tpch = TPCH(output_dir="/fast/ssd/path")

Query Execution Issues

Issue: Query timeout on large scale factors

# Solution: Start with smaller scale and optimize queries
tpch = TPCH(scale_factor=0.1)  # Start small
query = tpch.get_query(21, dialect="duckdb")  # Use configured dialect

Issue: Parameter substitution errors

# Solution: Use explicit seeds for reproducible parameters
query = tpch.get_query(1, seed=42)  # Reproducible

See Also

Understanding BenchBox

Practical Guides

Official TPC-H Resources

External Resources