Join Order Benchmark Implementation¶
Overview¶
The Join Order Benchmark (JOB) has been successfully implemented in BenchBox as a systematic test suite for query optimizer join order selection capabilities. Based on the seminal paper “How Good Are Query Optimizers, Really?” by Viktor Leis et al. (VLDB 2015), this benchmark uses a complex IMDB-like schema to stress-test database query optimizers.
Key Features¶
Schema Complexity¶
21 interconnected tables modeling a movie database (IMDB-like)
Complex relationships between movies, people, companies, and metadata
Multiple join patterns: star joins, chain joins, and complex multi-table joins
Realistic foreign key relationships creating challenging optimization scenarios
Query Characteristics¶
13 core queries representing different join optimization challenges
Multi-table joins (3-8 tables per query)
Varying selectivity patterns to test cost-based optimization
Real-world query patterns based on actual IMDB data analysis
Synthetic Data Generation¶
Scalable data generation (scale factor 0.001 to 1.0+)
Realistic data distributions preserving join characteristics
Configurable sizes for testing and production use
CSV format output compatible with all major databases
Implementation Architecture¶
Core Components¶
JoinOrderSchema (
benchbox/core/joinorder/schema.py)Complete IMDB schema definition with 21 tables
Foreign key relationships and constraints
Multi-dialect SQL generation (SQLite, PostgreSQL, MySQL, DuckDB)
JoinOrderQueryManager (
benchbox/core/joinorder/queries.py)13 embedded core queries based on original benchmark
Query complexity categorization (simple, medium, complex)
Join pattern analysis (star, chain, complex)
Support for loading original benchmark query files
JoinOrderGenerator (
benchbox/core/joinorder/generator.py)Synthetic data generation preserving join characteristics
Configurable scale factors
Realistic data distributions for movies, people, companies
CSV output format
JoinOrderBenchmark (
benchbox/core/joinorder/benchmark.py)Main benchmark interface
Data generation and schema management
Query execution and performance measurement
Integration with BenchBox ecosystem
Usage Examples¶
Basic Usage¶
from benchbox import JoinOrder
# Initialize benchmark (uses default path: benchmark_runs/datagen/joinorder_sf01)
joinorder = JoinOrder(scale_factor=0.1)
# Generate data (compressed with zstd by default when using CLI)
data_files = joinorder.generate_data()
# Get schema
schema_sql = joinorder.get_create_tables_sql()
# Get queries
queries = joinorder.get_queries()
query_1a = joinorder.get_query("1a")
# Analyze query complexity
complexity_dist = joinorder.get_queries_by_complexity()
pattern_dist = joinorder.get_queries_by_pattern()
DuckDB Integration¶
import duckdb
from benchbox import JoinOrder
# Initialize and generate data (default path: benchmark_runs/datagen/joinorder_sf001)
joinorder = JoinOrder(scale_factor=0.01)
data_files = joinorder.generate_data()
# Create DuckDB database
conn = duckdb.connect("joinorder_test.duckdb")
# Create schema
for statement in joinorder.get_create_tables_sql().split(';'):
if statement.strip():
conn.execute(statement)
# Load data from generated files
for table_name in joinorder.get_table_names():
# Use the actual output directory path
csv_path = joinorder.output_dir / f"{table_name}.csv"
conn.execute(f"""
INSERT INTO {table_name}
SELECT * FROM read_csv('{csv_path}', header=false)
""")
# Run queries and analyze performance
for query_id in joinorder.get_query_ids():
query_sql = joinorder.get_query(query_id)
result = conn.execute(query_sql).fetchall()
print(f"Query {query_id}: {len(result)} rows")
Original Benchmark Query Files¶
# Load queries from original Join Order Benchmark repository
joinorder = JoinOrder(queries_dir="path/to/join-order-benchmark")
# This will load all .sql files from the directory
# providing access to the full 113-query benchmark
Database Schema¶
Main Tables¶
Dimension Tables:
title(2.5M rows @ SF=1.0) - Movies and TV showsname(4M rows @ SF=1.0) - People (actors, directors, etc.)company_name(300K rows @ SF=1.0) - Production companieskeyword(120K rows @ SF=1.0) - Movie keywordschar_name(3M rows @ SF=1.0) - Character names
Relationship Tables:
cast_info(35M rows @ SF=1.0) - Person-movie relationshipsmovie_companies(2.6M rows @ SF=1.0) - Movie-company relationshipsmovie_info(15M rows @ SF=1.0) - Movie metadatamovie_keyword(5M rows @ SF=1.0) - Movie-keyword relationships
Lookup Tables:
kind_type,company_type,info_type,role_type, etc.
Key Relationships¶
The schema creates a complex web of relationships:
Movies connect to people through
cast_infoMovies connect to companies through
movie_companiesMovies connect to keywords through
movie_keywordMultiple metadata tables (
movie_info,movie_info_idx)Alternative name tables (
aka_name,aka_title)
Query Analysis¶
Query Complexity Distribution¶
Simple Queries (3-4 tables):
Focus on basic join patterns
Test predicate pushdown
Minimal optimization challenges
Medium Queries (5-6 tables):
Multi-table star joins
Moderate selectivity
Cost-based optimization important
Complex Queries (7+ tables):
Multiple join types
High optimization potential
Stress-test advanced optimizers
Join Pattern Analysis¶
Star Joins:
Central table (usually
titleorcast_info)Multiple dimension table joins
Common in movie database queries
Chain Joins:
Sequential table connections
Foreign key following patterns
Test join reordering algorithms
Complex Joins:
Mixed star and chain patterns
Multiple relationship paths
Advanced-level optimization scenarios
Performance Characteristics¶
Scale Factor Impact¶
Scale Factor |
Data Size |
title rows |
cast_info rows |
Query Time (avg) |
|---|---|---|---|---|
0.001 |
~700 KB |
2,500 |
35,000 |
1-5 ms |
0.01 |
~17 MB |
25,000 |
350,000 |
2-10 ms |
0.1 |
~170 MB |
250,000 |
3,500,000 |
10-100 ms |
1.0 |
~1.7 GB |
2,500,000 |
35,000,000 |
100ms-10s |
Optimization Opportunities¶
The benchmark tests several key optimization patterns:
Join Reordering
Optimal join order selection based on cardinality
Cost model accuracy testing
Bushy vs. left-deep plan comparison
Predicate Pushdown
Filter placement optimization
Selectivity estimation accuracy
Join input size reduction
Index Usage
Foreign key index utilization
Covering index opportunities
Index intersection strategies
Runtime Filters
Bloom filter generation
Hash filter pushdown
Dynamic optimization
Integration with BenchBox¶
Ecosystem Compatibility¶
The JOB implementation follows BenchBox patterns:
Compatible with existing analysis tools
SQL dialect translation via SQLGlot
Consistent API with TPC-H, TPC-DS, etc.
Pluggable data generation
Complementary Benchmarks¶
Join Order Benchmark complements other BenchBox benchmarks:
TPC-H: Analytical query performance
TPC-DS: Complex OLAP workloads
Read Primitives: Individual operation testing
JoinOrder: Join order optimization focus
Validation and Testing¶
Test Coverage¶
✅ Schema creation across dialects
✅ Data generation at multiple scale factors
✅ Query execution validation
✅ DuckDB integration example
✅ Performance measurement
✅ Error handling and edge cases
Performance Validation¶
Tested against DuckDB with scale factor 0.01:
All 13 queries execute successfully
Average query time: 2.1ms
Data generation: 904ms for 17MB
Schema creation: <50ms
Future Enhancements¶
Additional Queries¶
Load full 113-query set from original benchmark
Add query variants with different selectivity
Include UPDATE/DELETE queries for optimizer testing
Features¶
Query plan analysis integration
Cost model validation tools
Cardinality estimation accuracy testing
Index recommendation based on workload
Extended Schema¶
Temporal tables for time-based optimization
Partitioning support for large-scale testing
Materialized views for complex query optimization
References¶
Paper: “How Good Are Query Optimizers, Really?” - VLDB 2015
Authors: Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, Thomas Neumann
Original Repository: https://github.com/gregrahn/join-order-benchmark
IMDB Dataset: http://www.imdb.com/interfaces
File Structure¶
benchbox/core/joinorder/
├── __init__.py # Module exports
├── benchmark.py # Main JoinOrderBenchmark class
├── generator.py # Synthetic data generation
├── queries.py # Query management and categorization
└── schema.py # IMDB schema definition
benchbox/
└── joinorder.py # Top-level JoinOrder export
examples/
└── duckdb_joinorder.py # Complete DuckDB example
The Join Order Benchmark implementation provides a valuable addition to BenchBox’s suite of database benchmarks, specifically targeting join order optimization - a critical component of modern query optimizer performance.