Star Schema Benchmark (SSB)¶

Overview¶

The Star Schema Benchmark (SSB) is a simplified variant of TPC-H specifically designed for testing OLAP (Online Analytical Processing) systems and data warehouses. Created by O’Neil et al., SSB transforms the normalized TPC-H schema into a denormalized star schema that better represents typical data warehouse designs and enables more focused testing of analytical query performance.

SSB is particularly valuable for testing columnar databases, in-memory analytics systems, and OLAP engines because it eliminates complex join patterns in favor of simple star schema joins that are more representative of real-world data warehouse queries.

Key Features¶

Star schema design - Single fact table with dimension tables
13 analytical queries organized into 4 logical flights
Simplified query patterns - Focus on aggregation and filtering performance
OLAP-oriented - Tests typical data warehouse query patterns
Parameterized queries - Configurable selectivity and date ranges
Performance-focused - Designed to stress aggregation and scan performance
Multiple scale factors - Scalable from small test datasets to large benchmarks

Schema Description¶

The SSB schema consists of a single fact table (LINEORDER) surrounded by four dimension tables in a classic star schema pattern. This design eliminates the complex join patterns found in TPC-H’s normalized schema.

Fact Table¶

Table	Purpose	Approximate Rows (SF 1)
LINEORDER	Sales transactions (fact table)	~6,000,000

Dimension Tables¶

Table	Purpose	Approximate Rows (SF 1)
DATE	Calendar information	~2,556 (7 years)
CUSTOMER	Customer master data	~30,000
SUPPLIER	Supplier information	~2,000
PART	Product catalog	~200,000

Schema Transformation from TPC-H¶

SSB transforms TPC-H’s normalized schema by:

Combining ORDERS and LINEITEM → LINEORDER fact table
Denormalizing CUSTOMER → Includes region and nation information
Denormalizing SUPPLIER → Includes region and nation information
Flattening PART hierarchy → Simplified part categorization
Removing PARTSUPP → Part-supplier relationships embedded in fact table
Simplified DATE → Calendar dimension without complex hierarchies

Schema Relationships¶

        erDiagram
    DATE ||--o{ LINEORDER : lo_orderdate
    CUSTOMER ||--o{ LINEORDER : lo_custkey
    SUPPLIER ||--o{ LINEORDER : lo_suppkey
    PART ||--o{ LINEORDER : lo_partkey
    
    DATE {
        int d_datekey PK
        string d_date
        string d_dayofweek
        string d_month
        int d_year
        int d_yearmonthnum
        string d_yearmonth
        int d_weeknuminyear
        string d_sellingseason
        int d_holidayfl
        int d_weekdayfl
    }
    
    CUSTOMER {
        int c_custkey PK
        string c_name
        string c_address
        string c_city
        string c_nation
        string c_region
        string c_phone
        string c_mktsegment
    }
    
    SUPPLIER {
        int s_suppkey PK
        string s_name
        string s_address
        string s_city
        string s_nation
        string s_region
        string s_phone
    }
    
    PART {
        int p_partkey PK
        string p_name
        string p_mfgr
        string p_category
        string p_brand1
        string p_color
        string p_type
        int p_size
        string p_container
    }
    
    LINEORDER {
        int lo_orderkey
        int lo_linenumber
        int lo_custkey FK
        int lo_partkey FK
        int lo_suppkey FK
        int lo_orderdate FK
        string lo_orderpriority
        int lo_shippriority
        int lo_quantity
        int lo_extendedprice
        int lo_ordtotalprice
        int lo_discount
        int lo_revenue
        int lo_supplycost
        int lo_tax
        int lo_commitdate
        string lo_shipmode
    }

Query Characteristics¶

The 13 SSB queries are organized into 4 flights that test different aspects of analytical query performance:

Flight 1: Simple Aggregation Queries (Q1.1 - Q1.3)¶

Purpose: Test basic aggregation and filtering performance on the fact table.

Query	Focus	Key Features
Q1.1	Year-based aggregation	Simple date filter, discount range, quantity filter
Q1.2	Month-based aggregation	Month granularity, quantity range filter
Q1.3	Week-based aggregation	Week granularity, refined quantity range

Example Query (Q1.1):

SELECT sum(lo_extendedprice*lo_discount) as revenue
FROM lineorder, date
WHERE lo_orderdate = d_datekey
  AND d_year = 1993
  AND lo_discount between 1 and 3
  AND lo_quantity < 25;

Flight 2: Customer-Supplier Analysis (Q2.1 - Q2.3)¶

Purpose: Test queries with dimension table joins and drill-down analysis.

Query	Focus	Key Features
Q2.1	Brand category analysis	Part category filter, supplier region
Q2.2	Brand range analysis	Brand range filter, supplier region
Q2.3	Specific brand analysis	Single brand focus, supplier region

Example Query (Q2.1):

SELECT sum(lo_revenue), d_year, p_brand1
FROM lineorder, date, part, supplier
WHERE lo_orderdate = d_datekey
  AND lo_partkey = p_partkey
  AND lo_suppkey = s_suppkey
  AND p_category = 'MFGR#12'
  AND s_region = 'AMERICA'
GROUP BY d_year, p_brand1
ORDER BY d_year, p_brand1;

Flight 3: Customer Behavior Analysis (Q3.1 - Q3.4)¶

Purpose: Test complex multi-dimension analysis with customer geography.

Query	Focus	Key Features
Q3.1	Customer nation analysis	Customer and supplier nations
Q3.2	City-level analysis	Customer city, supplier city
Q3.3	Specific city analysis	Single customer city focus
Q3.4	Regional analysis	Customer and supplier regions

Example Query (Q3.1):

SELECT c_nation, s_nation, d_year, sum(lo_revenue) as revenue
FROM customer, lineorder, supplier, date
WHERE lo_custkey = c_custkey
  AND lo_suppkey = s_suppkey
  AND lo_orderdate = d_datekey
  AND c_region = 'ASIA'
  AND s_region = 'ASIA'
  AND d_year >= 1992 AND d_year <= 1997
GROUP BY c_nation, s_nation, d_year
ORDER BY d_year asc, revenue desc;

Flight 4: Profit Analysis (Q4.1 - Q4.3)¶

Purpose: Test complex aggregation with profit calculations and temporal analysis.

Query	Focus	Key Features
Q4.1	Regional profit analysis	Customer and supplier regions, profit calculation
Q4.2	Nation-specific profit	Single customer nation, profit trends
Q4.3	City-specific profit	Customer city, part category, profit analysis

Example Query (Q4.1):

SELECT d_year, c_nation, sum(lo_revenue - lo_supplycost) as profit
FROM date, customer, supplier, part, lineorder
WHERE lo_custkey = c_custkey
  AND lo_suppkey = s_suppkey
  AND lo_partkey = p_partkey
  AND lo_orderdate = d_datekey
  AND c_region = 'AMERICA'
  AND s_region = 'AMERICA'
  AND (p_mfgr = 'MFGR#1' OR p_mfgr = 'MFGR#2')
GROUP BY d_year, c_nation
ORDER BY d_year, c_nation;

Usage Examples¶

Basic Query Generation¶

from benchbox import SSB

# Initialize SSB benchmark
ssb = SSB(scale_factor=1.0, output_dir="ssb_data")

# Generate data
data_files = ssb.generate_data()

# Get all queries
queries = ssb.get_queries()
print(f"Generated {len(queries)} SSB queries")

# Get specific query with parameters
query_1_1 = ssb.get_query("Q1.1", params={
    'year': 1993,
    'discount_min': 1,
    'discount_max': 3,
    'quantity': 25
})
print(query_1_1)

Data Generation and Loading¶

# Generate SSB data at different scales
ssb_small = SSB(scale_factor=0.1, output_dir="ssb_small")
data_files = ssb_small.generate_data()

# Check generated tables
available_tables = ssb_small.get_available_tables()
print(f"Available tables: {available_tables}")

# Get schema information
schema = ssb_small.get_schema()
for table in schema:
    print(f"Table {table['name']}: {len(table['columns'])} columns")

DuckDB Integration Example¶

import duckdb
from benchbox import SSB

# Initialize and generate data
ssb = SSB(scale_factor=0.01, output_dir="ssb_tiny")
data_files = ssb.generate_data()

# Create DuckDB connection and schema
conn = duckdb.connect("ssb.duckdb")
schema_sql = ssb.get_create_tables_sql()
conn.execute(schema_sql)

# Load SSB tables
table_mappings = {
    'date': 'date.csv',
    'customer': 'customer.csv', 
    'supplier': 'supplier.csv',
    'part': 'part.csv',
    'lineorder': 'lineorder.csv'
}

for table_name, file_name in table_mappings.items():
    file_path = ssb.output_dir / file_name
    if file_path.exists():
        conn.execute(f"""
            INSERT INTO {table_name}
            SELECT * FROM read_csv('{file_path}', 
                                  header=true,
                                  auto_detect=true)
        """)
        print(f"Loaded {table_name}")

# Run SSB query flights
flight_1_queries = ["Q1.1", "Q1.2", "Q1.3"]
for query_id in flight_1_queries:
    query_sql = ssb.get_query(query_id, params={
        'year': 1993,
        'year_month': 199401,
        'week': 6,
        'discount_min': 1,
        'discount_max': 3,
        'quantity': 25,
        'quantity_min': 26,
        'quantity_max': 35
    })
    
    result = conn.execute(query_sql).fetchall()
    print(f"{query_id}: {len(result)} rows, Revenue: {result[0][0] if result else 0}")

Performance Testing Framework¶

import time
from typing import Dict, List

class SSBPerformanceTester:
    def __init__(self, ssb: SSB, connection):
        self.ssb = ssb
        self.connection = connection
        
    def run_flight_benchmark(self, flight_number: int, iterations: int = 3) -> Dict:
        """Benchmark a specific SSB flight."""
        flight_queries = {
            1: ["Q1.1", "Q1.2", "Q1.3"],
            2: ["Q2.1", "Q2.2", "Q2.3"],
            3: ["Q3.1", "Q3.2", "Q3.3", "Q3.4"],
            4: ["Q4.1", "Q4.2", "Q4.3"]
        }
        
        if flight_number not in flight_queries:
            raise ValueError(f"Invalid flight number: {flight_number}")
            
        query_ids = flight_queries[flight_number]
        results = {}
        
        for query_id in query_ids:
            print(f"Benchmarking {query_id}...")
            
            times = []
            for iteration in range(iterations):
                # Get parameterized query
                query_sql = self.ssb.get_query(query_id, seed=42)
                
                start_time = time.time()
                result = self.connection.execute(query_sql).fetchall()
                execution_time = time.time() - start_time
                
                times.append(execution_time)
            
            results[query_id] = {
                'avg_time': sum(times) / len(times),
                'min_time': min(times),
                'max_time': max(times),
                'rows_returned': len(result),
                'times': times
            }
        
        return results
    
    def _get_query_with_custom_params(self, query_id: str) -> str:
        """Example of getting query with custom parameters."""
        # Example custom parameters for each query type
        custom_params = {
            'year': 1993,
            'year_month': 199401,
            'week': 6,
            'discount_min': 1,
            'discount_max': 3,
            'quantity': 25,
            'quantity_min': 26,
            'quantity_max': 35,
            'category': 'MFGR#12',
            'region': 'AMERICA',
            'brand_min': 'MFGR#2221',
            'brand_max': 'MFGR#2228',
            'brand': 'MFGR#2221',
            'nation1': 'UNITED STATES',
            'nation2': 'JAPAN',
            'city': 'UNITED KI1'
        }
        
        return self.ssb.get_query(query_id, params=custom_params)
    
    def run_complete_benchmark(self) -> Dict:
        """Run all SSB flights and return systematic results."""
        complete_results = {}
        
        for flight_num in range(1, 5):
            print(f"Running Flight {flight_num}...")
            flight_results = self.run_flight_benchmark(flight_num)
            complete_results[f"Flight_{flight_num}"] = flight_results
        
        # Calculate summary statistics
        all_times = []
        for flight_data in complete_results.values():
            for query_data in flight_data.values():
                all_times.extend(query_data['times'])
        
        complete_results['summary'] = {
            'total_queries': sum(len(flight) for flight in complete_results.values() if isinstance(flight, dict)),
            'total_avg_time': sum(all_times) / len(all_times),
            'total_min_time': min(all_times),
            'total_max_time': max(all_times)
        }
        
        return complete_results

# Usage
performance_tester = SSBPerformanceTester(ssb, conn)

# Test individual flights
flight_1_results = performance_tester.run_flight_benchmark(1)
print("Flight 1 Results:")
for query_id, stats in flight_1_results.items():
    print(f"  {query_id}: {stats['avg_time']:.3f}s avg, {stats['rows_returned']} rows")

# Run complete benchmark
complete_results = performance_tester.run_complete_benchmark()

Columnar Database Optimization¶

# SSB is particularly well-suited for columnar databases
# Example optimization for columnar systems

def optimize_for_columnar(ssb: SSB, connection):
    """Optimize SSB for columnar database performance."""
    
    # Create column-store configured tables
    optimization_sql = """
    -- Optimize fact table for columnar access
    CREATE TABLE lineorder_configured AS
    SELECT 
        lo_orderdate,
        lo_custkey, 
        lo_partkey,
        lo_suppkey,
        lo_quantity,
        lo_extendedprice,
        lo_discount,
        lo_revenue,
        lo_supplycost,
        lo_tax
    FROM lineorder
    ORDER BY lo_orderdate, lo_custkey;
    
    -- Create projection indices for common query patterns
    CREATE INDEX idx_lineorder_date ON lineorder_configured(lo_orderdate);
    CREATE INDEX idx_lineorder_cust ON lineorder_configured(lo_custkey);
    CREATE INDEX idx_lineorder_part ON lineorder_configured(lo_partkey);
    CREATE INDEX idx_lineorder_supp ON lineorder_configured(lo_suppkey);
    
    -- Optimize dimension tables
    CREATE INDEX idx_date_year ON date(d_year);
    CREATE INDEX idx_customer_region ON customer(c_region);
    CREATE INDEX idx_supplier_region ON supplier(s_region);
    CREATE INDEX idx_part_category ON part(p_category);
    """
    
    connection.execute(optimization_sql)
    print("Applied columnar optimizations")

# Apply optimizations
optimize_for_columnar(ssb, conn)

Performance Characteristics¶

Query Performance Patterns¶

Flight 1 (Simple Aggregation):

Fastest queries - Single table scans with aggregation
Performance depends on: Fact table scan speed, aggregation efficiency
Typical runtime: Milliseconds to seconds

Flight 2 (Basic Star Joins):

Medium complexity - 2-3 table joins with aggregation
Performance depends on: Join algorithm efficiency, dimension table size
Typical runtime: Seconds

Flight 3 (Complex Star Joins):

Higher complexity - 4 table joins with grouping
Performance depends on: Join order optimization, hash table construction
Typical runtime: Seconds to tens of seconds

Flight 4 (Analytical Queries):

Most complex - Multi-table joins with calculations
Performance depends on: Query optimization, parallel execution
Typical runtime: Seconds to minutes

System Optimization Opportunities¶

System Type	Optimization Focus	SSB Flight Emphasis
Columnar Stores	Column scanning, compression	Flight 1 (aggregation)
In-Memory Systems	Join algorithms, parallel execution	Flight 3 (multi-joins)
MPP Systems	Data distribution, parallel aggregation	Flight 4 (complex analytics)
OLAP Engines	Cube pre-computation, drill-down	All flights

Configuration Options¶

Scale Factor Guidelines¶

Scale Factor	Fact Table Rows	Data Size	Use Case
0.01	~60,000	~5 MB	Development, unit testing
0.1	~600,000	~50 MB	Integration testing
1	~6,000,000	~500 MB	Standard benchmark
10	~60,000,000	~5 GB	Performance testing
100	~600,000,000	~50 GB	Large-scale testing

Advanced-level Configuration¶

ssb = SSB(
    scale_factor=1.0,
    output_dir="ssb_data",
    # SSB-specific configuration
    date_range_years=7,      # Default date range
    enable_compression=True,  # Compress output files
    partition_fact_table=True, # Partition LINEORDER by date
    generate_indices=True    # Generate recommended indices
)

# Access query parameterization
query_params = {
    'year': 1995,           # Specific year for temporal queries
    'region': 'ASIA',       # Geographic focus
    'category': 'MFGR#13',  # Product category
    'discount_range': (2, 4) # Discount selectivity
}

query = ssb.get_query("Q2.1", params=query_params)

Integration Examples¶

Apache Spark Integration¶

from pyspark.sql import SparkSession
from benchbox import SSB

# Initialize Spark and SSB
spark = SparkSession.builder.appName("SSB-Benchmark").getOrCreate()
ssb = SSB(scale_factor=10, output_dir="/data/ssb_sf10")

# Generate and load data
data_files = ssb.generate_data()

# Create Spark tables
tables = ['date', 'customer', 'supplier', 'part', 'lineorder']
for table_name in tables:
    file_path = f"/data/ssb_sf10/{table_name}.csv"
    
    df = spark.read.csv(file_path, header=True, inferSchema=True)
    df.createOrReplaceTempView(table_name)
    
    # Cache dimension tables for better join performance
    if table_name != 'lineorder':
        df.cache()

# Run SSB queries with Spark SQL
flight_2_queries = ["Q2.1", "Q2.2", "Q2.3"]
for query_id in flight_2_queries:
    query_sql = ssb.get_query(query_id, params={
        'category': 'MFGR#12',
        'region': 'AMERICA',
        'brand_min': 'MFGR#2221',
        'brand_max': 'MFGR#2228',
        'brand': 'MFGR#2221'
    })
    
    result_df = spark.sql(query_sql)
    result_df.show(10)
    print(f"{query_id} execution plan:")
    result_df.explain(True)

ClickHouse Integration¶

import clickhouse_connect
from benchbox import SSB

# Initialize ClickHouse client and SSB
client = clickhouse_connect.get_client(host='localhost', port=8123)
ssb = SSB(scale_factor=1.0, output_dir="ssb_data")

# Generate data
data_files = ssb.generate_data()

# Create ClickHouse tables configured for analytics
create_tables_sql = """
-- Optimized fact table for ClickHouse
CREATE TABLE lineorder (
    lo_orderkey UInt32,
    lo_linenumber UInt8,
    lo_custkey UInt32,
    lo_partkey UInt32,
    lo_suppkey UInt32,
    lo_orderdate Date,
    lo_orderpriority String,
    lo_shippriority UInt8,
    lo_quantity UInt8,
    lo_extendedprice UInt32,
    lo_ordtotalprice UInt32,
    lo_discount UInt8,
    lo_revenue UInt32,
    lo_supplycost UInt32,
    lo_tax UInt8,
    lo_commitdate Date,
    lo_shipmode String
) ENGINE = MergeTree()
PARTITION BY toYear(lo_orderdate)
ORDER BY (lo_orderdate, lo_custkey, lo_partkey);

-- Dimension tables
CREATE TABLE customer (
    c_custkey UInt32,
    c_name String,
    c_address String,
    c_city String,
    c_nation String,
    c_region String,
    c_phone String,
    c_mktsegment String
) ENGINE = MergeTree()
ORDER BY c_custkey;

-- Add other dimension tables...
"""

client.execute(create_tables_sql)

# Load data using ClickHouse CSV import
for table_name in ['customer', 'supplier', 'part', 'date', 'lineorder']:
    file_path = ssb.output_dir / f"{table_name}.csv"
    
    # Use ClickHouse's configured CSV import
    with open(file_path, 'rb') as f:
        client.insert_file(table_name, f, fmt='CSV')

# Run configured SSB queries
flight_1_results = {}
for query_id in ["Q1.1", "Q1.2", "Q1.3"]:
    query_sql = ssb.get_query(query_id, params={
        'year': 1993,
        'year_month': 199401,
        'week': 6,
        'discount_min': 1,
        'discount_max': 3,
        'quantity': 25,
        'quantity_min': 26,
        'quantity_max': 35
    })
    
    result = client.query(query_sql)
    flight_1_results[query_id] = result.result_rows
    print(f"{query_id}: {len(result.result_rows)} rows")

Best Practices¶

Data Generation¶

Use appropriate scale factors - Start small for development, scale up for performance testing
Consider data distribution - SSB data follows specific distribution patterns
Optimize for target system - Generate data formats suited to your database
Partition fact tables - Consider date-based partitioning for large scale factors

Query Optimization¶

Start with Flight 1 - Test basic aggregation performance first
Optimize join orders - Ensure smallest dimension tables are accessed first
Use column indices - Create indices on commonly filtered columns
Test with different parameters - Vary selectivity to understand performance characteristics

Performance Testing¶

Warm-up queries - Run queries multiple times to account for caching
Monitor resource usage - Track CPU, memory, and I/O during execution
Test incremental complexity - Progress through flights methodically
Compare with baselines - Establish performance baselines for regression testing

Common Issues and Solutions¶

Performance Issues¶

Issue: Slow fact table scans in Flight 1

-- Solution: Ensure proper indexing and partitioning
CREATE INDEX idx_lineorder_orderdate ON lineorder(lo_orderdate);
CREATE INDEX idx_lineorder_discount ON lineorder(lo_discount);
CREATE INDEX idx_lineorder_quantity ON lineorder(lo_quantity);

Issue: Inefficient joins in Flight 3

-- Solution: Optimize join order (smallest tables first)
SELECT c_nation, s_nation, d_year, sum(lo_revenue) as revenue
FROM date, customer, supplier, lineorder  -- Small to large table order
WHERE lo_orderdate = d_datekey
  AND lo_custkey = c_custkey
  AND lo_suppkey = s_suppkey
  -- Other conditions...

Data Loading Issues¶

Issue: Memory issues with large scale factors

# Solution: Use streaming data generation
ssb = SSB(
    scale_factor=100,
    output_dir="/data/ssb_large",
    stream_generation=True,  # Generate in chunks
    chunk_size=1000000       # 1M rows per chunk
)

Issue: Incorrect query results

# Solution: Validate data generation and parameterization
validation_queries = [
    "SELECT COUNT(*) FROM lineorder",
    "SELECT MIN(d_year), MAX(d_year) FROM date",
    "SELECT COUNT(DISTINCT c_region) FROM customer"
]

for vq in validation_queries:
    result = conn.execute(vq).fetchone()
    print(f"Validation: {result}")

External Resources¶

Original SSB Paper - “Star Schema Benchmark” by O’Neil et al.
SSB Query Details - Complete query specifications
OLAP Performance Testing - Star schema design principles
Data Warehouse Benchmarking - Industry benchmarking practices