TPC-DS Benchmark

Tags intermediate concept tpc-ds

Overview

The TPC-DS (Transaction Processing Performance Council - Decision Support) benchmark is the most systematic and complex decision support benchmark available today. TPC-DS models a retail data warehouse environment with sophisticated analytical queries that test advanced database features including window functions, recursive queries, OLAP operations, and complex join patterns.

TPC-DS represents the evolution of decision support benchmarks, designed to address the limitations of earlier benchmarks by incorporating modern data warehouse patterns, complex analytical queries, and realistic data distributions that challenge all aspects of a database system’s query processing capabilities.

Key Features

  • 99 analytical queries with varying complexity and selectivity

  • 24 table schema representing a systematic retail data warehouse

  • Advanced-level SQL features including window functions, recursive CTEs, ROLLUP/CUBE

  • Complex query patterns with nested subqueries and multiple aggregation levels

  • Realistic data distributions based on real-world retail scenarios

  • Stream processing support for throughput and concurrency testing

  • Parameter substitution with sophisticated business logic

  • Multiple scale factors (SF 1 = ~1GB, SF 100 = ~100GB, etc.)

  • SQL dialect support via translation capabilities

Schema Description

The TPC-DS schema models a systematic retail data warehouse with three main areas: Store Sales, Catalog Sales, and Web Sales. The schema includes fact tables, dimension tables, and return tables that capture the complexity of modern retail operations.

Fact Tables (Core Business Events)

Table

Purpose

Approximate Rows (SF 1)

STORE_SALES

In-store retail transactions

2,880,404

CATALOG_SALES

Mail-order catalog sales

1,441,548

WEB_SALES

Online web sales

719,384

STORE_RETURNS

In-store returns

287,514

CATALOG_RETURNS

Catalog returns

144,067

WEB_RETURNS

Web returns

71,763

INVENTORY

Daily inventory snapshots

11,745,000

Dimension Tables (Reference Data)

Table

Purpose

Approximate Rows (SF 1)

DATE_DIM

Calendar and date attributes

73,049

TIME_DIM

Time of day breakdown

86,400

ITEM

Product catalog

18,000

CUSTOMER

Customer information

100,000

CUSTOMER_ADDRESS

Customer addresses

50,000

CUSTOMER_DEMOGRAPHICS

Customer demographic profiles

1,920,800

HOUSEHOLD_DEMOGRAPHICS

Household characteristics

7,200

STORE

Store location details

12

CATALOG_PAGE

Catalog page information

11,718

WEB_SITE

Website information

30

WEB_PAGE

Web page details

60

WAREHOUSE

Distribution centers

5

PROMOTION

Marketing promotions

300

INCOME_BAND

Income brackets

20

REASON

Return/refund reasons

35

CALL_CENTER

Customer service centers

6

SHIP_MODE

Shipping methods

20

Schema Relationships

        erDiagram
    DATE_DIM ||--o{ STORE_SALES : sold_on
    DATE_DIM ||--o{ CATALOG_SALES : sold_on  
    DATE_DIM ||--o{ WEB_SALES : sold_on
    TIME_DIM ||--o{ STORE_SALES : sold_at
    ITEM ||--o{ STORE_SALES : item_sold
    ITEM ||--o{ CATALOG_SALES : item_sold
    ITEM ||--o{ WEB_SALES : item_sold
    CUSTOMER ||--o{ STORE_SALES : purchased_by
    CUSTOMER ||--o{ CATALOG_SALES : purchased_by
    CUSTOMER ||--o{ WEB_SALES : purchased_by
    STORE ||--o{ STORE_SALES : sold_at
    CATALOG_PAGE ||--o{ CATALOG_SALES : from_page
    WEB_SITE ||--o{ WEB_SALES : from_site
    WEB_PAGE ||--o{ WEB_SALES : from_page
    PROMOTION ||--o{ STORE_SALES : promoted_by
    STORE_SALES ||--o{ STORE_RETURNS : returned_from
    CATALOG_SALES ||--o{ CATALOG_RETURNS : returned_from
    WEB_SALES ||--o{ WEB_RETURNS : returned_from
    

Query Characteristics

The 99 TPC-DS queries are organized into different complexity categories and test various aspects of analytical query processing:

Query Categories by Complexity

Reporting Queries (Simple Aggregation):

  • Q1, Q6, Q11, Q13: Basic aggregation with simple grouping

  • Focus on fundamental aggregation performance

Interactive Queries (Medium Complexity):

  • Q3, Q7, Q19, Q27, Q42: Multi-table joins with moderate complexity

  • Window functions and ranking operations

  • Typical business intelligence patterns

Deep Analytics (High Complexity):

  • Q14a/14b, Q23a/23b, Q24a/24b: Advanced-level analytical patterns

  • Recursive queries and complex mathematical calculations

  • Multi-level aggregation and correlation analysis

Data Mining Queries (Very High Complexity):

  • Q64, Q67, Q70, Q86: Statistical analysis and data mining patterns

  • Customer segmentation and behavior analysis

  • Advanced-level window functions and analytical operations

Key Query Features

Feature

Example Queries

Description

Window Functions

Q17, Q25, Q47, Q51

ROW_NUMBER(), RANK(), LAG/LEAD

Recursive Queries

Q14a, Q23a

WITH RECURSIVE CTEs

ROLLUP/CUBE

Q5, Q18, Q77

OLAP aggregation operations

Complex Joins

Q64, Q78, Q95

6+ table joins with multiple predicates

Correlation Analysis

Q35, Q38, Q87

Cross-channel correlation patterns

Time Series

Q12, Q20, Q21

Year-over-year and temporal analysis

Ranking/Top-N

Q37, Q40, Q52

Customer and product ranking

Statistical Functions

Q9, Q46, Q98

STDDEV, VARIANCE, CORR

Query Variants

Some queries have variants (e.g., Q14a/Q14b) that test slightly different patterns:

  • Query 14a/14b: Different approaches to cross-channel analysis

  • Query 23a/23b: Alternative recursive query patterns

  • Query 24a/24b: Different aggregation strategies

Usage Examples

Basic Query Generation

from benchbox import TPCDS

# Initialize TPC-DS benchmark
tpcds = TPCDS(scale_factor=1.0, output_dir="tpcds_data")

# Generate data using C tools
data_files = tpcds.generate_data()

# Get all queries (1-99)
queries = tpcds.get_queries()
print(f"Generated {len(queries)} queries")

# Get specific query with parameters
query_1 = tpcds.get_query(1, seed=42)
print(query_1)

# Query variants
query_14a = tpcds.get_query("14a", seed=42)
query_14b = tpcds.get_query("14b", seed=42)

Data Generation with C Tools

# TPC-DS requires the official C tools (dsdgen/dsqgen)
tpcds = TPCDS(scale_factor=1.0, output_dir="tpcds_sf1", verbose=True)

# Generate all tables using dsdgen
data_files = tpcds.generate_data()

# Check generated files 
for table_name in tpcds.get_available_tables():
    table_path = tpcds.output_dir / f"{table_name.lower()}.dat"
    print(f"{table_name}: {table_path} ({table_path.stat().st_size} bytes)")

Stream Generation and Testing

# Generate query streams for throughput testing
tpcds = TPCDS(scale_factor=1.0, output_dir="tpcds_data")

# Generate multiple concurrent streams
streams = tpcds.generate_streams(
    num_streams=4,
    rng_seed=42,
    streams_output_dir="streams"
)

# Each stream contains a random permutation of queries
for i, stream_info in enumerate(tpcds.get_all_streams_info()):
    print(f"Stream {i}:")
    print(f"  Queries: {stream_info['query_count']}")
    print(f"  File: {stream_info['output_file']}")

DuckDB Integration Example

import duckdb
from benchbox import TPCDS

# Initialize and generate data
tpcds = TPCDS(scale_factor=0.1, output_dir="tpcds_small")
data_files = tpcds.generate_data()

# Create DuckDB connection and schema
conn = duckdb.connect("tpcds.duckdb")
schema_sql = tpcds.get_create_tables_sql()
conn.execute(schema_sql)

# Load data from DAT files
table_mappings = {
    'call_center': 'call_center.dat',
    'catalog_page': 'catalog_page.dat',
    'catalog_returns': 'catalog_returns.dat',
    'catalog_sales': 'catalog_sales.dat',
    'customer': 'customer.dat',
    'customer_address': 'customer_address.dat',
    'customer_demographics': 'customer_demographics.dat',
    'date_dim': 'date_dim.dat',
    'household_demographics': 'household_demographics.dat',
    'income_band': 'income_band.dat',
    'inventory': 'inventory.dat',
    'item': 'item.dat',
    'promotion': 'promotion.dat',
    'reason': 'reason.dat',
    'ship_mode': 'ship_mode.dat',
    'store': 'store.dat',
    'store_returns': 'store_returns.dat',
    'store_sales': 'store_sales.dat',
    'time_dim': 'time_dim.dat',
    'warehouse': 'warehouse.dat',
    'web_page': 'web_page.dat',
    'web_returns': 'web_returns.dat',
    'web_sales': 'web_sales.dat',
    'web_site': 'web_site.dat'
}

for table_name, file_name in table_mappings.items():
    file_path = tpcds.output_dir / file_name
    if file_path.exists():
        conn.execute(f"""
            INSERT INTO {table_name}
            SELECT * FROM read_csv('{file_path}', 
                                  delim='|', 
                                  header=false,
                                  nullstr='',
                                  ignore_errors=true)
        """)
        print(f"Loaded {table_name}")

# Run sample queries
sample_queries = [1, 3, 7, 19, 27, 42]
for query_id in sample_queries:
    try:
        query_sql = tpcds.get_query(query_id, dialect="duckdb")
        result = conn.execute(query_sql).fetchall()
        print(f"Q{query_id}: {len(result)} rows")
    except Exception as e:
        print(f"Q{query_id}: Error - {str(e)[:100]}...")

Advanced-level Parameter Management

# Use specific seeds for reproducible results
query_reproducible = tpcds.get_query(42, seed=12345)

# Scale-aware parameter generation
query_large = tpcds.get_query(42, 
                              seed=12345, 
                              scale_factor=100.0)

# Stream-based parameter generation
query_stream = tpcds.get_query(42,
                               seed=12345,
                               stream_id=0)

Query Characteristics

Query Complexity Categories

Simple Queries (low complexity):

  • Q1, Q6, Q11, Q13: Simple aggregation queries

  • Q16, Q21, Q32: Basic joins with simple predicates

Moderate Queries (medium complexity):

  • Q3, Q7, Q19, Q27: Multi-table joins with moderate complexity

  • Q25, Q29, Q43: Window functions and ranking

  • Q52, Q55, Q59: Customer analysis patterns

Complex Queries (high complexity):

  • Q14a/b, Q23a/b, Q24a/b: Advanced-level analytical patterns

  • Q64, Q67, Q78: Complex joins and correlations

  • Q95, Q98: Statistical and data mining operations

Very Complex Queries (highest complexity):

  • Q70, Q86: Extremely complex analytical patterns

  • Queries with large intermediate result sets

  • Multi-level aggregation with complex predicates

Note: Actual execution times vary significantly based on platform, hardware, configuration, and data characteristics. Run benchmarks to measure performance for your specific environment.

Resource Utilization Patterns

Query Type

CPU Usage

Memory Usage

I/O Pattern

Optimization Challenges

Simple Aggregation

Medium

Low

Sequential

Index selection

Multi-Join

High

Medium

Random

Join order optimization

Window Functions

High

High

Mixed

Partition management

Recursive Queries

Very High

Very High

Complex

Recursive plan optimization

Data Mining

Very High

Very High

Mixed

Statistical computation

Configuration Options

Scale Factor Guidelines

Scale Factor

Data Size

Use Case

0.01

~10 MB

Unit testing, development

0.1

~100 MB

Integration testing

1

~1 GB

Standard benchmark

10

~10 GB

Performance testing

100

~100 GB

Enterprise simulation

1000+

~1+ TB

Large-scale testing

Note: Query execution times depend on platform, hardware, query complexity, and configuration. Run benchmarks to establish baselines for your environment.

Advanced-level Configuration

tpcds = TPCDS(
    scale_factor=1.0,
    output_dir="tpcds_data",
    verbose=True,          # Enable detailed logging
    parallel=8,            # Parallel data generation
    use_c_tools=True       # Use official TPC-DS C tools
)

# Access to internal components
c_tools = tpcds.c_tools           # Direct C tool access
query_manager = tpcds.queries     # Query management
data_generator = tpcds.generator  # Data generation

Integration Examples

Distributed Database Testing

# Large scale data generation for distributed systems
tpcds_large = TPCDS(scale_factor=100, output_dir="/shared/tpcds_sf100")
data_files = tpcds_large.generate_data()

# Partition-aware data loading (example for Spark)
for table_name in tpcds_large.get_available_tables():
    table_path = f"/shared/tpcds_sf100/{table_name.lower()}.dat"
    
    # Create partitioned tables
    spark.sql(f"""
        CREATE TABLE {table_name}
        USING DELTA
        PARTITIONED BY (partition_col)
        LOCATION '/delta/{table_name}'
        AS SELECT *, 
               CASE 
                   WHEN {table_name}_sk % 100 < 10 THEN 'p0'
                   WHEN {table_name}_sk % 100 < 20 THEN 'p1'
                   -- ... more partitions
                   ELSE 'p9'
               END as partition_col
        FROM read_csv('{table_path}', delimiter='|', header=false)
    """)

Performance Benchmarking Framework

import time
from statistics import mean, median
from typing import Dict, List

class TPCDSBenchmark:
    def __init__(self, tpcds: TPCDS, connection):
        self.tpcds = tpcds
        self.connection = connection
        
    def benchmark_query_set(self, query_ids: List[int], 
                           iterations: int = 3) -> Dict:
        """Benchmark a set of TPC-DS queries."""
        results = {}
        
        for query_id in query_ids:
            print(f"Benchmarking Q{query_id}...")
            query_results = []
            
            for iteration in range(iterations):
                query_sql = self.tpcds.get_query(query_id, 
                                                seed=42 + iteration,
                                                dialect="duckdb")
                
                start_time = time.time()
                try:
                    result = self.connection.execute(query_sql).fetchall()
                    execution_time = time.time() - start_time
                    
                    query_results.append({
                        'iteration': iteration,
                        'time': execution_time,
                        'rows': len(result),
                        'status': 'success'
                    })
                except Exception as e:
                    execution_time = time.time() - start_time
                    query_results.append({
                        'iteration': iteration, 
                        'time': execution_time,
                        'rows': 0,
                        'status': 'error',
                        'error': str(e)
                    })
            
            # Calculate statistics
            successful_runs = [r for r in query_results if r['status'] == 'success']
            if successful_runs:
                times = [r['time'] for r in successful_runs]
                results[query_id] = {
                    'mean_time': mean(times),
                    'median_time': median(times),
                    'min_time': min(times),
                    'max_time': max(times),
                    'success_rate': len(successful_runs) / len(query_results),
                    'avg_rows': mean([r['rows'] for r in successful_runs]),
                    'runs': query_results
                }
            else:
                results[query_id] = {
                    'mean_time': 0,
                    'median_time': 0,
                    'success_rate': 0,
                    'runs': query_results
                }
        
        return results

# Usage
benchmark = TPCDSBenchmark(tpcds, conn)

# Test different query categories
fast_queries = [1, 6, 11, 13, 16, 21]
medium_queries = [3, 7, 19, 27, 42, 52]
complex_queries = [14, 23, 24, 64, 78]

fast_results = benchmark.benchmark_query_set(fast_queries)
medium_results = benchmark.benchmark_query_set(medium_queries)
complex_results = benchmark.benchmark_query_set(complex_queries)

Best Practices

Data Generation

  1. Use official C tools - TPC-DS requires dsdgen/dsqgen for compliance

  2. Parallel generation - Use multiple processes for large scale factors

  3. Validate completeness - Ensure all 24 tables are generated

  4. Check file formats - Verify pipe-delimited format and encoding

Query Execution

  1. Start with simple queries - Test Q1, Q6, Q11 first

  2. Use appropriate timeouts - Complex queries can run for hours

  3. Monitor resource usage - Watch memory consumption for large queries

  4. Test query variants - Include both Q14a and Q14b in testing

Performance Testing

  1. Warm-up runs - Execute queries multiple times for stable timings

  2. Resource monitoring - Track CPU, memory, and I/O during execution

  3. Incremental complexity - Start with simple queries, progress to complex

  4. Statistical analysis - Use geometric mean for overall performance scores

Common Issues and Solutions

Data Generation Issues

Issue: dsdgen/dsqgen not found

# Ensure TPC-DS C tools are compiled and available
cd _sources/tpc-ds/tools
make

Issue: Out of disk space during generation

# Use smaller scale factor or external storage
tpcds = TPCDS(scale_factor=0.1, output_dir="/external/storage")

Query Execution Issues

Issue: Query timeout on complex queries

# Increase timeout and use smaller scale factor for testing
conn.execute("SET query_timeout='1800s'")  # 30 minutes
tpcds = TPCDS(scale_factor=0.1)  # Smaller dataset

Issue: Out of memory on analytical queries

# Configure memory limits and optimize queries
conn.execute("SET memory_limit='8GB'")
conn.execute("SET max_memory='8GB'")

Issue: Parameter substitution errors

# Use explicit seeds and validate query IDs
query = tpcds.get_query(42, seed=12345)  # Reproducible parameters

Streaming Data Generation

BenchBox includes patched TPC-DS binaries that support streaming data generation via stdout. This enables efficient workflows that reduce disk I/O and memory usage.

Using the FILTER Flag

The -FILTER Y flag outputs generated data to stdout instead of creating files:

# Generate ship_mode table to stdout (can be piped to compression)
dsdgen -TABLE ship_mode -SCALE 1 -FILTER Y | zstd > ship_mode.dat.zst

# Generate with fixed seed for reproducibility
dsdgen -TABLE date_dim -SCALE 1 -FILTER Y -RNGSEED 12345 | gzip > date_dim.dat.gz

# Direct loading into database (example with DuckDB)
dsdgen -TABLE customer -SCALE 1 -FILTER Y | \
  duckdb -c "COPY customer FROM '/dev/stdin' (DELIMITER '|')"

Benefits

  • Reduced disk I/O: Data streams directly to compression or database

  • Lower memory usage: No intermediate files needed

  • Faster pipelines: Eliminates write-then-read overhead

  • Parallel compression: Each table can be compressed independently

Streaming in Python

import subprocess
import gzip
from pathlib import Path

def generate_table_compressed(table_name: str, scale: int, output_path: Path):
    """Generate TPC-DS table directly to compressed file."""
    dsdgen = Path("_binaries/tpc-ds/darwin-arm64/dsdgen")
    tools_dir = Path("_sources/tpc-ds/tools")

    with gzip.open(output_path, "wt") as f:
        proc = subprocess.Popen(
            [str(dsdgen), "-TABLE", table_name, "-SCALE", str(scale),
             "-FILTER", "Y", "-RNGSEED", "1", "-TERMINATE", "N"],
            cwd=tools_dir,
            stdout=subprocess.PIPE,
            text=True
        )
        for line in proc.stdout:
            f.write(line)
        proc.wait()

# Generate compressed data
generate_table_compressed("ship_mode", 1, Path("ship_mode.dat.gz"))

Source Patches

The streaming capability requires patches to the TPC-DS source that fix two bugs in the official TPC-DS tools. See _sources/tpc-ds/PATCHES.md for details.

These patches were originally developed by Greg Rahn in the tpcds-kit repository.

External Resources