TPC-DS Benchmark¶

Overview¶

The TPC-DS (Transaction Processing Performance Council - Decision Support) benchmark is the most systematic and complex decision support benchmark available today. TPC-DS models a retail data warehouse environment with sophisticated analytical queries that test advanced database features including window functions, recursive queries, OLAP operations, and complex join patterns.

TPC-DS represents the evolution of decision support benchmarks, designed to address the limitations of earlier benchmarks by incorporating modern data warehouse patterns, complex analytical queries, and realistic data distributions that challenge all aspects of a database system’s query processing capabilities.

Key Features¶

99 analytical queries with varying complexity and selectivity
24 table schema representing a systematic retail data warehouse
Advanced-level SQL features including window functions, recursive CTEs, ROLLUP/CUBE
Complex query patterns with nested subqueries and multiple aggregation levels
Realistic data distributions based on real-world retail scenarios
Stream processing support for throughput and concurrency testing
Parameter substitution with sophisticated business logic
Multiple scale factors (SF 1 = ~1GB, SF 100 = ~100GB, etc.)
SQL dialect support via translation capabilities

Schema Description¶

The TPC-DS schema models a systematic retail data warehouse with three main areas: Store Sales, Catalog Sales, and Web Sales. The schema includes fact tables, dimension tables, and return tables that capture the complexity of modern retail operations.

Fact Tables (Core Business Events)¶

Table	Purpose	Approximate Rows (SF 1)
STORE_SALES	In-store retail transactions	2,880,404
CATALOG_SALES	Mail-order catalog sales	1,441,548
WEB_SALES	Online web sales	719,384
STORE_RETURNS	In-store returns	287,514
CATALOG_RETURNS	Catalog returns	144,067
WEB_RETURNS	Web returns	71,763
INVENTORY	Daily inventory snapshots	11,745,000

Dimension Tables (Reference Data)¶

Table	Purpose	Approximate Rows (SF 1)
DATE_DIM	Calendar and date attributes	73,049
TIME_DIM	Time of day breakdown	86,400
ITEM	Product catalog	18,000
CUSTOMER	Customer information	100,000
CUSTOMER_ADDRESS	Customer addresses	50,000
CUSTOMER_DEMOGRAPHICS	Customer demographic profiles	1,920,800
HOUSEHOLD_DEMOGRAPHICS	Household characteristics	7,200
STORE	Store location details	12
CATALOG_PAGE	Catalog page information	11,718
WEB_SITE	Website information	30
WEB_PAGE	Web page details	60
WAREHOUSE	Distribution centers	5
PROMOTION	Marketing promotions	300
INCOME_BAND	Income brackets	20
REASON	Return/refund reasons	35
CALL_CENTER	Customer service centers	6
SHIP_MODE	Shipping methods	20

Schema Relationships¶

        erDiagram
    DATE_DIM ||--o{ STORE_SALES : sold_on
    DATE_DIM ||--o{ CATALOG_SALES : sold_on  
    DATE_DIM ||--o{ WEB_SALES : sold_on
    TIME_DIM ||--o{ STORE_SALES : sold_at
    ITEM ||--o{ STORE_SALES : item_sold
    ITEM ||--o{ CATALOG_SALES : item_sold
    ITEM ||--o{ WEB_SALES : item_sold
    CUSTOMER ||--o{ STORE_SALES : purchased_by
    CUSTOMER ||--o{ CATALOG_SALES : purchased_by
    CUSTOMER ||--o{ WEB_SALES : purchased_by
    STORE ||--o{ STORE_SALES : sold_at
    CATALOG_PAGE ||--o{ CATALOG_SALES : from_page
    WEB_SITE ||--o{ WEB_SALES : from_site
    WEB_PAGE ||--o{ WEB_SALES : from_page
    PROMOTION ||--o{ STORE_SALES : promoted_by
    STORE_SALES ||--o{ STORE_RETURNS : returned_from
    CATALOG_SALES ||--o{ CATALOG_RETURNS : returned_from
    WEB_SALES ||--o{ WEB_RETURNS : returned_from

Query Characteristics¶

The 99 TPC-DS queries are organized into different complexity categories and test various aspects of analytical query processing:

Query Categories by Complexity¶

Reporting Queries (Simple Aggregation):

Q1, Q6, Q11, Q13: Basic aggregation with simple grouping
Focus on fundamental aggregation performance

Interactive Queries (Medium Complexity):

Q3, Q7, Q19, Q27, Q42: Multi-table joins with moderate complexity
Window functions and ranking operations
Typical business intelligence patterns

Deep Analytics (High Complexity):

Q14a/14b, Q23a/23b, Q24a/24b: Advanced-level analytical patterns
Recursive queries and complex mathematical calculations
Multi-level aggregation and correlation analysis

Data Mining Queries (Very High Complexity):

Q64, Q67, Q70, Q86: Statistical analysis and data mining patterns
Customer segmentation and behavior analysis
Advanced-level window functions and analytical operations

Key Query Features¶

Feature	Example Queries	Description
Window Functions	Q17, Q25, Q47, Q51	ROW_NUMBER(), RANK(), LAG/LEAD
Recursive Queries	Q14a, Q23a	WITH RECURSIVE CTEs
ROLLUP/CUBE	Q5, Q18, Q77	OLAP aggregation operations
Complex Joins	Q64, Q78, Q95	6+ table joins with multiple predicates
Correlation Analysis	Q35, Q38, Q87	Cross-channel correlation patterns
Time Series	Q12, Q20, Q21	Year-over-year and temporal analysis
Ranking/Top-N	Q37, Q40, Q52	Customer and product ranking
Statistical Functions	Q9, Q46, Q98	STDDEV, VARIANCE, CORR

Query Variants¶

Some queries have variants (e.g., Q14a/Q14b) that test slightly different patterns:

Query 14a/14b: Different approaches to cross-channel analysis
Query 23a/23b: Alternative recursive query patterns
Query 24a/24b: Different aggregation strategies

Usage Examples¶

Basic Query Generation¶

from benchbox import TPCDS

# Initialize TPC-DS benchmark
tpcds = TPCDS(scale_factor=1.0, output_dir="tpcds_data")

# Generate data using C tools
data_files = tpcds.generate_data()

# Get all queries (1-99)
queries = tpcds.get_queries()
print(f"Generated {len(queries)} queries")

# Get specific query with parameters
query_1 = tpcds.get_query(1, seed=42)
print(query_1)

# Query variants
query_14a = tpcds.get_query("14a", seed=42)
query_14b = tpcds.get_query("14b", seed=42)

Data Generation with C Tools¶

# TPC-DS requires the official C tools (dsdgen/dsqgen)
tpcds = TPCDS(scale_factor=1.0, output_dir="tpcds_sf1", verbose=True)

# Generate all tables using dsdgen
data_files = tpcds.generate_data()

# Check generated files 
for table_name in tpcds.get_available_tables():
    table_path = tpcds.output_dir / f"{table_name.lower()}.dat"
    print(f"{table_name}: {table_path} ({table_path.stat().st_size} bytes)")

Stream Generation and Testing¶

# Generate query streams for throughput testing
tpcds = TPCDS(scale_factor=1.0, output_dir="tpcds_data")

# Generate multiple concurrent streams
streams = tpcds.generate_streams(
    num_streams=4,
    rng_seed=42,
    streams_output_dir="streams"
)

# Each stream contains a random permutation of queries
for i, stream_info in enumerate(tpcds.get_all_streams_info()):
    print(f"Stream {i}:")
    print(f"  Queries: {stream_info['query_count']}")
    print(f"  File: {stream_info['output_file']}")

DuckDB Integration Example¶

import duckdb
from benchbox import TPCDS

# Initialize and generate data
tpcds = TPCDS(scale_factor=0.1, output_dir="tpcds_small")
data_files = tpcds.generate_data()

# Create DuckDB connection and schema
conn = duckdb.connect("tpcds.duckdb")
schema_sql = tpcds.get_create_tables_sql()
conn.execute(schema_sql)

# Load data from DAT files
table_mappings = {
    'call_center': 'call_center.dat',
    'catalog_page': 'catalog_page.dat',
    'catalog_returns': 'catalog_returns.dat',
    'catalog_sales': 'catalog_sales.dat',
    'customer': 'customer.dat',
    'customer_address': 'customer_address.dat',
    'customer_demographics': 'customer_demographics.dat',
    'date_dim': 'date_dim.dat',
    'household_demographics': 'household_demographics.dat',
    'income_band': 'income_band.dat',
    'inventory': 'inventory.dat',
    'item': 'item.dat',
    'promotion': 'promotion.dat',
    'reason': 'reason.dat',
    'ship_mode': 'ship_mode.dat',
    'store': 'store.dat',
    'store_returns': 'store_returns.dat',
    'store_sales': 'store_sales.dat',
    'time_dim': 'time_dim.dat',
    'warehouse': 'warehouse.dat',
    'web_page': 'web_page.dat',
    'web_returns': 'web_returns.dat',
    'web_sales': 'web_sales.dat',
    'web_site': 'web_site.dat'
}

for table_name, file_name in table_mappings.items():
    file_path = tpcds.output_dir / file_name
    if file_path.exists():
        conn.execute(f"""
            INSERT INTO {table_name}
            SELECT * FROM read_csv('{file_path}', 
                                  delim='|', 
                                  header=false,
                                  nullstr='',
                                  ignore_errors=true)
        """)
        print(f"Loaded {table_name}")

# Run sample queries
sample_queries = [1, 3, 7, 19, 27, 42]
for query_id in sample_queries:
    try:
        query_sql = tpcds.get_query(query_id, dialect="duckdb")
        result = conn.execute(query_sql).fetchall()
        print(f"Q{query_id}: {len(result)} rows")
    except Exception as e:
        print(f"Q{query_id}: Error - {str(e)[:100]}...")

Advanced-level Parameter Management¶

# Use specific seeds for reproducible results
query_reproducible = tpcds.get_query(42, seed=12345)

# Scale-aware parameter generation
query_large = tpcds.get_query(42, 
                              seed=12345, 
                              scale_factor=100.0)

# Stream-based parameter generation
query_stream = tpcds.get_query(42,
                               seed=12345,
                               stream_id=0)

Query Characteristics¶

Query Complexity Categories¶

Simple Queries (low complexity):

Q1, Q6, Q11, Q13: Simple aggregation queries
Q16, Q21, Q32: Basic joins with simple predicates

Moderate Queries (medium complexity):

Q3, Q7, Q19, Q27: Multi-table joins with moderate complexity
Q25, Q29, Q43: Window functions and ranking
Q52, Q55, Q59: Customer analysis patterns

Complex Queries (high complexity):

Q14a/b, Q23a/b, Q24a/b: Advanced-level analytical patterns
Q64, Q67, Q78: Complex joins and correlations
Q95, Q98: Statistical and data mining operations

Very Complex Queries (highest complexity):

Q70, Q86: Extremely complex analytical patterns
Queries with large intermediate result sets
Multi-level aggregation with complex predicates

Note: Actual execution times vary significantly based on platform, hardware, configuration, and data characteristics. Run benchmarks to measure performance for your specific environment.

Resource Utilization Patterns¶

Query Type	CPU Usage	Memory Usage	I/O Pattern	Optimization Challenges
Simple Aggregation	Medium	Low	Sequential	Index selection
Multi-Join	High	Medium	Random	Join order optimization
Window Functions	High	High	Mixed	Partition management
Recursive Queries	Very High	Very High	Complex	Recursive plan optimization
Data Mining	Very High	Very High	Mixed	Statistical computation

Configuration Options¶

Scale Factor Guidelines¶

Scale Factor	Data Size	Use Case
0.01	~10 MB	Unit testing, development
0.1	~100 MB	Integration testing
1	~1 GB	Standard benchmark
10	~10 GB	Performance testing
100	~100 GB	Enterprise simulation
1000+	~1+ TB	Large-scale testing

Note: Query execution times depend on platform, hardware, query complexity, and configuration. Run benchmarks to establish baselines for your environment.

Advanced-level Configuration¶

tpcds = TPCDS(
    scale_factor=1.0,
    output_dir="tpcds_data",
    verbose=True,          # Enable detailed logging
    parallel=8,            # Parallel data generation
    use_c_tools=True       # Use official TPC-DS C tools
)

# Access to internal components
c_tools = tpcds.c_tools           # Direct C tool access
query_manager = tpcds.queries     # Query management
data_generator = tpcds.generator  # Data generation

Integration Examples¶

Distributed Database Testing¶

# Large scale data generation for distributed systems
tpcds_large = TPCDS(scale_factor=100, output_dir="/shared/tpcds_sf100")
data_files = tpcds_large.generate_data()

# Partition-aware data loading (example for Spark)
for table_name in tpcds_large.get_available_tables():
    table_path = f"/shared/tpcds_sf100/{table_name.lower()}.dat"
    
    # Create partitioned tables
    spark.sql(f"""
        CREATE TABLE {table_name}
        USING DELTA
        PARTITIONED BY (partition_col)
        LOCATION '/delta/{table_name}'
        AS SELECT *, 
               CASE 
                   WHEN {table_name}_sk % 100 < 10 THEN 'p0'
                   WHEN {table_name}_sk % 100 < 20 THEN 'p1'
                   -- ... more partitions
                   ELSE 'p9'
               END as partition_col
        FROM read_csv('{table_path}', delimiter='|', header=false)
    """)

Performance Benchmarking Framework¶

import time
from statistics import mean, median
from typing import Dict, List

class TPCDSBenchmark:
    def __init__(self, tpcds: TPCDS, connection):
        self.tpcds = tpcds
        self.connection = connection
        
    def benchmark_query_set(self, query_ids: List[int], 
                           iterations: int = 3) -> Dict:
        """Benchmark a set of TPC-DS queries."""
        results = {}
        
        for query_id in query_ids:
            print(f"Benchmarking Q{query_id}...")
            query_results = []
            
            for iteration in range(iterations):
                query_sql = self.tpcds.get_query(query_id, 
                                                seed=42 + iteration,
                                                dialect="duckdb")
                
                start_time = time.time()
                try:
                    result = self.connection.execute(query_sql).fetchall()
                    execution_time = time.time() - start_time
                    
                    query_results.append({
                        'iteration': iteration,
                        'time': execution_time,
                        'rows': len(result),
                        'status': 'success'
                    })
                except Exception as e:
                    execution_time = time.time() - start_time
                    query_results.append({
                        'iteration': iteration, 
                        'time': execution_time,
                        'rows': 0,
                        'status': 'error',
                        'error': str(e)
                    })
            
            # Calculate statistics
            successful_runs = [r for r in query_results if r['status'] == 'success']
            if successful_runs:
                times = [r['time'] for r in successful_runs]
                results[query_id] = {
                    'mean_time': mean(times),
                    'median_time': median(times),
                    'min_time': min(times),
                    'max_time': max(times),
                    'success_rate': len(successful_runs) / len(query_results),
                    'avg_rows': mean([r['rows'] for r in successful_runs]),
                    'runs': query_results
                }
            else:
                results[query_id] = {
                    'mean_time': 0,
                    'median_time': 0,
                    'success_rate': 0,
                    'runs': query_results
                }
        
        return results

# Usage
benchmark = TPCDSBenchmark(tpcds, conn)

# Test different query categories
fast_queries = [1, 6, 11, 13, 16, 21]
medium_queries = [3, 7, 19, 27, 42, 52]
complex_queries = [14, 23, 24, 64, 78]

fast_results = benchmark.benchmark_query_set(fast_queries)
medium_results = benchmark.benchmark_query_set(medium_queries)
complex_results = benchmark.benchmark_query_set(complex_queries)

Best Practices¶

Data Generation¶

Use official C tools - TPC-DS requires dsdgen/dsqgen for compliance
Parallel generation - Use multiple processes for large scale factors
Validate completeness - Ensure all 24 tables are generated
Check file formats - Verify pipe-delimited format and encoding

Query Execution¶

Start with simple queries - Test Q1, Q6, Q11 first
Use appropriate timeouts - Complex queries can run for hours
Monitor resource usage - Watch memory consumption for large queries
Test query variants - Include both Q14a and Q14b in testing

Performance Testing¶

Warm-up runs - Execute queries multiple times for stable timings
Resource monitoring - Track CPU, memory, and I/O during execution
Incremental complexity - Start with simple queries, progress to complex
Statistical analysis - Use geometric mean for overall performance scores

Common Issues and Solutions¶

Data Generation Issues¶

Issue: dsdgen/dsqgen not found

# Ensure TPC-DS C tools are compiled and available
cd _sources/tpc-ds/tools
make

Issue: Out of disk space during generation

# Use smaller scale factor or external storage
tpcds = TPCDS(scale_factor=0.1, output_dir="/external/storage")

Query Execution Issues¶

Issue: Query timeout on complex queries

# Increase timeout and use smaller scale factor for testing
conn.execute("SET query_timeout='1800s'")  # 30 minutes
tpcds = TPCDS(scale_factor=0.1)  # Smaller dataset

Issue: Out of memory on analytical queries

# Configure memory limits and optimize queries
conn.execute("SET memory_limit='8GB'")
conn.execute("SET max_memory='8GB'")

Issue: Parameter substitution errors

# Use explicit seeds and validate query IDs
query = tpcds.get_query(42, seed=12345)  # Reproducible parameters

Streaming Data Generation¶

BenchBox includes patched TPC-DS binaries that support streaming data generation via stdout. This enables efficient workflows that reduce disk I/O and memory usage.

Using the FILTER Flag¶

The -FILTER Y flag outputs generated data to stdout instead of creating files:

# Generate ship_mode table to stdout (can be piped to compression)
dsdgen -TABLE ship_mode -SCALE 1 -FILTER Y | zstd > ship_mode.dat.zst

# Generate with fixed seed for reproducibility
dsdgen -TABLE date_dim -SCALE 1 -FILTER Y -RNGSEED 12345 | gzip > date_dim.dat.gz

# Direct loading into database (example with DuckDB)
dsdgen -TABLE customer -SCALE 1 -FILTER Y | \
  duckdb -c "COPY customer FROM '/dev/stdin' (DELIMITER '|')"

Benefits¶

Reduced disk I/O: Data streams directly to compression or database
Lower memory usage: No intermediate files needed
Faster pipelines: Eliminates write-then-read overhead
Parallel compression: Each table can be compressed independently

Streaming in Python¶

import subprocess
import gzip
from pathlib import Path

def generate_table_compressed(table_name: str, scale: int, output_path: Path):
    """Generate TPC-DS table directly to compressed file."""
    dsdgen = Path("_binaries/tpc-ds/darwin-arm64/dsdgen")
    tools_dir = Path("_sources/tpc-ds/tools")

    with gzip.open(output_path, "wt") as f:
        proc = subprocess.Popen(
            [str(dsdgen), "-TABLE", table_name, "-SCALE", str(scale),
             "-FILTER", "Y", "-RNGSEED", "1", "-TERMINATE", "N"],
            cwd=tools_dir,
            stdout=subprocess.PIPE,
            text=True
        )
        for line in proc.stdout:
            f.write(line)
        proc.wait()

# Generate compressed data
generate_table_compressed("ship_mode", 1, Path("ship_mode.dat.gz"))

Source Patches¶

The streaming capability requires patches to the TPC-DS source that fix two bugs in the official TPC-DS tools. See _sources/tpc-ds/PATCHES.md for details.

These patches were originally developed by Greg Rahn in the tpcds-kit repository.

External Resources¶

TPC-DS Specification - Official TPC-DS documentation
TPC-DS Results - Published benchmark results
TPC-DS Tools - Official C tools download
Query Analysis Papers - Academic research using TPC-DS

TPC-DS Benchmark¶

Overview¶

Key Features¶

Schema Description¶

Fact Tables (Core Business Events)¶

Dimension Tables (Reference Data)¶

Schema Relationships¶

Query Characteristics¶

Query Categories by Complexity¶

Key Query Features¶

Query Variants¶

Usage Examples¶

Basic Query Generation¶

Data Generation with C Tools¶

Stream Generation and Testing¶

DuckDB Integration Example¶

Advanced-level Parameter Management¶

Query Characteristics¶

Query Complexity Categories¶

Resource Utilization Patterns¶

Configuration Options¶

Scale Factor Guidelines¶

Advanced-level Configuration¶

Integration Examples¶

Distributed Database Testing¶

Performance Benchmarking Framework¶

Best Practices¶

Data Generation¶

Query Execution¶

Performance Testing¶

Common Issues and Solutions¶

Data Generation Issues¶

Query Execution Issues¶

Streaming Data Generation¶

Using the FILTER Flag¶

Benefits¶

Streaming in Python¶

Source Patches¶

Related Documentation¶

External Resources¶