TPC-DS Benchmark¶
Overview¶
The TPC-DS (Transaction Processing Performance Council - Decision Support) benchmark is the most systematic and complex decision support benchmark available today. TPC-DS models a retail data warehouse environment with sophisticated analytical queries that test advanced database features including window functions, recursive queries, OLAP operations, and complex join patterns.
TPC-DS represents the evolution of decision support benchmarks, designed to address the limitations of earlier benchmarks by incorporating modern data warehouse patterns, complex analytical queries, and realistic data distributions that challenge all aspects of a database system’s query processing capabilities.
Key Features¶
99 analytical queries with varying complexity and selectivity
24 table schema representing a systematic retail data warehouse
Advanced-level SQL features including window functions, recursive CTEs, ROLLUP/CUBE
Complex query patterns with nested subqueries and multiple aggregation levels
Realistic data distributions based on real-world retail scenarios
Stream processing support for throughput and concurrency testing
Parameter substitution with sophisticated business logic
Multiple scale factors (SF 1 = ~1GB, SF 100 = ~100GB, etc.)
SQL dialect support via translation capabilities
Schema Description¶
The TPC-DS schema models a systematic retail data warehouse with three main areas: Store Sales, Catalog Sales, and Web Sales. The schema includes fact tables, dimension tables, and return tables that capture the complexity of modern retail operations.
Fact Tables (Core Business Events)¶
Table |
Purpose |
Approximate Rows (SF 1) |
|---|---|---|
STORE_SALES |
In-store retail transactions |
2,880,404 |
CATALOG_SALES |
Mail-order catalog sales |
1,441,548 |
WEB_SALES |
Online web sales |
719,384 |
STORE_RETURNS |
In-store returns |
287,514 |
CATALOG_RETURNS |
Catalog returns |
144,067 |
WEB_RETURNS |
Web returns |
71,763 |
INVENTORY |
Daily inventory snapshots |
11,745,000 |
Dimension Tables (Reference Data)¶
Table |
Purpose |
Approximate Rows (SF 1) |
|---|---|---|
DATE_DIM |
Calendar and date attributes |
73,049 |
TIME_DIM |
Time of day breakdown |
86,400 |
ITEM |
Product catalog |
18,000 |
CUSTOMER |
Customer information |
100,000 |
CUSTOMER_ADDRESS |
Customer addresses |
50,000 |
CUSTOMER_DEMOGRAPHICS |
Customer demographic profiles |
1,920,800 |
HOUSEHOLD_DEMOGRAPHICS |
Household characteristics |
7,200 |
STORE |
Store location details |
12 |
CATALOG_PAGE |
Catalog page information |
11,718 |
WEB_SITE |
Website information |
30 |
WEB_PAGE |
Web page details |
60 |
WAREHOUSE |
Distribution centers |
5 |
PROMOTION |
Marketing promotions |
300 |
INCOME_BAND |
Income brackets |
20 |
REASON |
Return/refund reasons |
35 |
CALL_CENTER |
Customer service centers |
6 |
SHIP_MODE |
Shipping methods |
20 |
Schema Relationships¶
erDiagram
DATE_DIM ||--o{ STORE_SALES : sold_on
DATE_DIM ||--o{ CATALOG_SALES : sold_on
DATE_DIM ||--o{ WEB_SALES : sold_on
TIME_DIM ||--o{ STORE_SALES : sold_at
ITEM ||--o{ STORE_SALES : item_sold
ITEM ||--o{ CATALOG_SALES : item_sold
ITEM ||--o{ WEB_SALES : item_sold
CUSTOMER ||--o{ STORE_SALES : purchased_by
CUSTOMER ||--o{ CATALOG_SALES : purchased_by
CUSTOMER ||--o{ WEB_SALES : purchased_by
STORE ||--o{ STORE_SALES : sold_at
CATALOG_PAGE ||--o{ CATALOG_SALES : from_page
WEB_SITE ||--o{ WEB_SALES : from_site
WEB_PAGE ||--o{ WEB_SALES : from_page
PROMOTION ||--o{ STORE_SALES : promoted_by
STORE_SALES ||--o{ STORE_RETURNS : returned_from
CATALOG_SALES ||--o{ CATALOG_RETURNS : returned_from
WEB_SALES ||--o{ WEB_RETURNS : returned_from
Query Characteristics¶
The 99 TPC-DS queries are organized into different complexity categories and test various aspects of analytical query processing:
Query Categories by Complexity¶
Reporting Queries (Simple Aggregation):
Q1, Q6, Q11, Q13: Basic aggregation with simple grouping
Focus on fundamental aggregation performance
Interactive Queries (Medium Complexity):
Q3, Q7, Q19, Q27, Q42: Multi-table joins with moderate complexity
Window functions and ranking operations
Typical business intelligence patterns
Deep Analytics (High Complexity):
Q14a/14b, Q23a/23b, Q24a/24b: Advanced-level analytical patterns
Recursive queries and complex mathematical calculations
Multi-level aggregation and correlation analysis
Data Mining Queries (Very High Complexity):
Q64, Q67, Q70, Q86: Statistical analysis and data mining patterns
Customer segmentation and behavior analysis
Advanced-level window functions and analytical operations
Key Query Features¶
Feature |
Example Queries |
Description |
|---|---|---|
Window Functions |
Q17, Q25, Q47, Q51 |
ROW_NUMBER(), RANK(), LAG/LEAD |
Recursive Queries |
Q14a, Q23a |
WITH RECURSIVE CTEs |
ROLLUP/CUBE |
Q5, Q18, Q77 |
OLAP aggregation operations |
Complex Joins |
Q64, Q78, Q95 |
6+ table joins with multiple predicates |
Correlation Analysis |
Q35, Q38, Q87 |
Cross-channel correlation patterns |
Time Series |
Q12, Q20, Q21 |
Year-over-year and temporal analysis |
Ranking/Top-N |
Q37, Q40, Q52 |
Customer and product ranking |
Statistical Functions |
Q9, Q46, Q98 |
STDDEV, VARIANCE, CORR |
Query Variants¶
Some queries have variants (e.g., Q14a/Q14b) that test slightly different patterns:
Query 14a/14b: Different approaches to cross-channel analysis
Query 23a/23b: Alternative recursive query patterns
Query 24a/24b: Different aggregation strategies
Usage Examples¶
Basic Query Generation¶
from benchbox import TPCDS
# Initialize TPC-DS benchmark
tpcds = TPCDS(scale_factor=1.0, output_dir="tpcds_data")
# Generate data using C tools
data_files = tpcds.generate_data()
# Get all queries (1-99)
queries = tpcds.get_queries()
print(f"Generated {len(queries)} queries")
# Get specific query with parameters
query_1 = tpcds.get_query(1, seed=42)
print(query_1)
# Query variants
query_14a = tpcds.get_query("14a", seed=42)
query_14b = tpcds.get_query("14b", seed=42)
Data Generation with C Tools¶
# TPC-DS requires the official C tools (dsdgen/dsqgen)
tpcds = TPCDS(scale_factor=1.0, output_dir="tpcds_sf1", verbose=True)
# Generate all tables using dsdgen
data_files = tpcds.generate_data()
# Check generated files
for table_name in tpcds.get_available_tables():
table_path = tpcds.output_dir / f"{table_name.lower()}.dat"
print(f"{table_name}: {table_path} ({table_path.stat().st_size} bytes)")
Stream Generation and Testing¶
# Generate query streams for throughput testing
tpcds = TPCDS(scale_factor=1.0, output_dir="tpcds_data")
# Generate multiple concurrent streams
streams = tpcds.generate_streams(
num_streams=4,
rng_seed=42,
streams_output_dir="streams"
)
# Each stream contains a random permutation of queries
for i, stream_info in enumerate(tpcds.get_all_streams_info()):
print(f"Stream {i}:")
print(f" Queries: {stream_info['query_count']}")
print(f" File: {stream_info['output_file']}")
DuckDB Integration Example¶
import duckdb
from benchbox import TPCDS
# Initialize and generate data
tpcds = TPCDS(scale_factor=0.1, output_dir="tpcds_small")
data_files = tpcds.generate_data()
# Create DuckDB connection and schema
conn = duckdb.connect("tpcds.duckdb")
schema_sql = tpcds.get_create_tables_sql()
conn.execute(schema_sql)
# Load data from DAT files
table_mappings = {
'call_center': 'call_center.dat',
'catalog_page': 'catalog_page.dat',
'catalog_returns': 'catalog_returns.dat',
'catalog_sales': 'catalog_sales.dat',
'customer': 'customer.dat',
'customer_address': 'customer_address.dat',
'customer_demographics': 'customer_demographics.dat',
'date_dim': 'date_dim.dat',
'household_demographics': 'household_demographics.dat',
'income_band': 'income_band.dat',
'inventory': 'inventory.dat',
'item': 'item.dat',
'promotion': 'promotion.dat',
'reason': 'reason.dat',
'ship_mode': 'ship_mode.dat',
'store': 'store.dat',
'store_returns': 'store_returns.dat',
'store_sales': 'store_sales.dat',
'time_dim': 'time_dim.dat',
'warehouse': 'warehouse.dat',
'web_page': 'web_page.dat',
'web_returns': 'web_returns.dat',
'web_sales': 'web_sales.dat',
'web_site': 'web_site.dat'
}
for table_name, file_name in table_mappings.items():
file_path = tpcds.output_dir / file_name
if file_path.exists():
conn.execute(f"""
INSERT INTO {table_name}
SELECT * FROM read_csv('{file_path}',
delim='|',
header=false,
nullstr='',
ignore_errors=true)
""")
print(f"Loaded {table_name}")
# Run sample queries
sample_queries = [1, 3, 7, 19, 27, 42]
for query_id in sample_queries:
try:
query_sql = tpcds.get_query(query_id, dialect="duckdb")
result = conn.execute(query_sql).fetchall()
print(f"Q{query_id}: {len(result)} rows")
except Exception as e:
print(f"Q{query_id}: Error - {str(e)[:100]}...")
Advanced-level Parameter Management¶
# Use specific seeds for reproducible results
query_reproducible = tpcds.get_query(42, seed=12345)
# Scale-aware parameter generation
query_large = tpcds.get_query(42,
seed=12345,
scale_factor=100.0)
# Stream-based parameter generation
query_stream = tpcds.get_query(42,
seed=12345,
stream_id=0)
Query Characteristics¶
Query Complexity Categories¶
Simple Queries (low complexity):
Q1, Q6, Q11, Q13: Simple aggregation queries
Q16, Q21, Q32: Basic joins with simple predicates
Moderate Queries (medium complexity):
Q3, Q7, Q19, Q27: Multi-table joins with moderate complexity
Q25, Q29, Q43: Window functions and ranking
Q52, Q55, Q59: Customer analysis patterns
Complex Queries (high complexity):
Q14a/b, Q23a/b, Q24a/b: Advanced-level analytical patterns
Q64, Q67, Q78: Complex joins and correlations
Q95, Q98: Statistical and data mining operations
Very Complex Queries (highest complexity):
Q70, Q86: Extremely complex analytical patterns
Queries with large intermediate result sets
Multi-level aggregation with complex predicates
Note: Actual execution times vary significantly based on platform, hardware, configuration, and data characteristics. Run benchmarks to measure performance for your specific environment.
Resource Utilization Patterns¶
Query Type |
CPU Usage |
Memory Usage |
I/O Pattern |
Optimization Challenges |
|---|---|---|---|---|
Simple Aggregation |
Medium |
Low |
Sequential |
Index selection |
Multi-Join |
High |
Medium |
Random |
Join order optimization |
Window Functions |
High |
High |
Mixed |
Partition management |
Recursive Queries |
Very High |
Very High |
Complex |
Recursive plan optimization |
Data Mining |
Very High |
Very High |
Mixed |
Statistical computation |
Configuration Options¶
Scale Factor Guidelines¶
Scale Factor |
Data Size |
Use Case |
|---|---|---|
0.01 |
~10 MB |
Unit testing, development |
0.1 |
~100 MB |
Integration testing |
1 |
~1 GB |
Standard benchmark |
10 |
~10 GB |
Performance testing |
100 |
~100 GB |
Enterprise simulation |
1000+ |
~1+ TB |
Large-scale testing |
Note: Query execution times depend on platform, hardware, query complexity, and configuration. Run benchmarks to establish baselines for your environment.
Advanced-level Configuration¶
tpcds = TPCDS(
scale_factor=1.0,
output_dir="tpcds_data",
verbose=True, # Enable detailed logging
parallel=8, # Parallel data generation
use_c_tools=True # Use official TPC-DS C tools
)
# Access to internal components
c_tools = tpcds.c_tools # Direct C tool access
query_manager = tpcds.queries # Query management
data_generator = tpcds.generator # Data generation
Integration Examples¶
Distributed Database Testing¶
# Large scale data generation for distributed systems
tpcds_large = TPCDS(scale_factor=100, output_dir="/shared/tpcds_sf100")
data_files = tpcds_large.generate_data()
# Partition-aware data loading (example for Spark)
for table_name in tpcds_large.get_available_tables():
table_path = f"/shared/tpcds_sf100/{table_name.lower()}.dat"
# Create partitioned tables
spark.sql(f"""
CREATE TABLE {table_name}
USING DELTA
PARTITIONED BY (partition_col)
LOCATION '/delta/{table_name}'
AS SELECT *,
CASE
WHEN {table_name}_sk % 100 < 10 THEN 'p0'
WHEN {table_name}_sk % 100 < 20 THEN 'p1'
-- ... more partitions
ELSE 'p9'
END as partition_col
FROM read_csv('{table_path}', delimiter='|', header=false)
""")
Performance Benchmarking Framework¶
import time
from statistics import mean, median
from typing import Dict, List
class TPCDSBenchmark:
def __init__(self, tpcds: TPCDS, connection):
self.tpcds = tpcds
self.connection = connection
def benchmark_query_set(self, query_ids: List[int],
iterations: int = 3) -> Dict:
"""Benchmark a set of TPC-DS queries."""
results = {}
for query_id in query_ids:
print(f"Benchmarking Q{query_id}...")
query_results = []
for iteration in range(iterations):
query_sql = self.tpcds.get_query(query_id,
seed=42 + iteration,
dialect="duckdb")
start_time = time.time()
try:
result = self.connection.execute(query_sql).fetchall()
execution_time = time.time() - start_time
query_results.append({
'iteration': iteration,
'time': execution_time,
'rows': len(result),
'status': 'success'
})
except Exception as e:
execution_time = time.time() - start_time
query_results.append({
'iteration': iteration,
'time': execution_time,
'rows': 0,
'status': 'error',
'error': str(e)
})
# Calculate statistics
successful_runs = [r for r in query_results if r['status'] == 'success']
if successful_runs:
times = [r['time'] for r in successful_runs]
results[query_id] = {
'mean_time': mean(times),
'median_time': median(times),
'min_time': min(times),
'max_time': max(times),
'success_rate': len(successful_runs) / len(query_results),
'avg_rows': mean([r['rows'] for r in successful_runs]),
'runs': query_results
}
else:
results[query_id] = {
'mean_time': 0,
'median_time': 0,
'success_rate': 0,
'runs': query_results
}
return results
# Usage
benchmark = TPCDSBenchmark(tpcds, conn)
# Test different query categories
fast_queries = [1, 6, 11, 13, 16, 21]
medium_queries = [3, 7, 19, 27, 42, 52]
complex_queries = [14, 23, 24, 64, 78]
fast_results = benchmark.benchmark_query_set(fast_queries)
medium_results = benchmark.benchmark_query_set(medium_queries)
complex_results = benchmark.benchmark_query_set(complex_queries)
Best Practices¶
Data Generation¶
Use official C tools - TPC-DS requires dsdgen/dsqgen for compliance
Parallel generation - Use multiple processes for large scale factors
Validate completeness - Ensure all 24 tables are generated
Check file formats - Verify pipe-delimited format and encoding
Query Execution¶
Start with simple queries - Test Q1, Q6, Q11 first
Use appropriate timeouts - Complex queries can run for hours
Monitor resource usage - Watch memory consumption for large queries
Test query variants - Include both Q14a and Q14b in testing
Performance Testing¶
Warm-up runs - Execute queries multiple times for stable timings
Resource monitoring - Track CPU, memory, and I/O during execution
Incremental complexity - Start with simple queries, progress to complex
Statistical analysis - Use geometric mean for overall performance scores
Common Issues and Solutions¶
Data Generation Issues¶
Issue: dsdgen/dsqgen not found
# Ensure TPC-DS C tools are compiled and available
cd _sources/tpc-ds/tools
make
Issue: Out of disk space during generation
# Use smaller scale factor or external storage
tpcds = TPCDS(scale_factor=0.1, output_dir="/external/storage")
Query Execution Issues¶
Issue: Query timeout on complex queries
# Increase timeout and use smaller scale factor for testing
conn.execute("SET query_timeout='1800s'") # 30 minutes
tpcds = TPCDS(scale_factor=0.1) # Smaller dataset
Issue: Out of memory on analytical queries
# Configure memory limits and optimize queries
conn.execute("SET memory_limit='8GB'")
conn.execute("SET max_memory='8GB'")
Issue: Parameter substitution errors
# Use explicit seeds and validate query IDs
query = tpcds.get_query(42, seed=12345) # Reproducible parameters
Streaming Data Generation¶
BenchBox includes patched TPC-DS binaries that support streaming data generation via stdout. This enables efficient workflows that reduce disk I/O and memory usage.
Using the FILTER Flag¶
The -FILTER Y flag outputs generated data to stdout instead of creating files:
# Generate ship_mode table to stdout (can be piped to compression)
dsdgen -TABLE ship_mode -SCALE 1 -FILTER Y | zstd > ship_mode.dat.zst
# Generate with fixed seed for reproducibility
dsdgen -TABLE date_dim -SCALE 1 -FILTER Y -RNGSEED 12345 | gzip > date_dim.dat.gz
# Direct loading into database (example with DuckDB)
dsdgen -TABLE customer -SCALE 1 -FILTER Y | \
duckdb -c "COPY customer FROM '/dev/stdin' (DELIMITER '|')"
Benefits¶
Reduced disk I/O: Data streams directly to compression or database
Lower memory usage: No intermediate files needed
Faster pipelines: Eliminates write-then-read overhead
Parallel compression: Each table can be compressed independently
Streaming in Python¶
import subprocess
import gzip
from pathlib import Path
def generate_table_compressed(table_name: str, scale: int, output_path: Path):
"""Generate TPC-DS table directly to compressed file."""
dsdgen = Path("_binaries/tpc-ds/darwin-arm64/dsdgen")
tools_dir = Path("_sources/tpc-ds/tools")
with gzip.open(output_path, "wt") as f:
proc = subprocess.Popen(
[str(dsdgen), "-TABLE", table_name, "-SCALE", str(scale),
"-FILTER", "Y", "-RNGSEED", "1", "-TERMINATE", "N"],
cwd=tools_dir,
stdout=subprocess.PIPE,
text=True
)
for line in proc.stdout:
f.write(line)
proc.wait()
# Generate compressed data
generate_table_compressed("ship_mode", 1, Path("ship_mode.dat.gz"))
Source Patches¶
The streaming capability requires patches to the TPC-DS source that fix two bugs
in the official TPC-DS tools. See _sources/tpc-ds/PATCHES.md for details.
These patches were originally developed by Greg Rahn in the tpcds-kit repository.
External Resources¶
TPC-DS Specification - Official TPC-DS documentation
TPC-DS Results - Published benchmark results
TPC-DS Tools - Official C tools download
Query Analysis Papers - Academic research using TPC-DS