Join Order Benchmark API¶
Complete Python API reference for the Join Order Benchmark (JOB).
Overview¶
The Join Order Benchmark is designed to test query optimizer join order selection capabilities using a complex schema with 21 interconnected tables based on the Internet Movie Database (IMDB). This benchmark specifically stresses query optimizers by creating join optimization challenges that expose cardinality estimation and cost model weaknesses.
Key Features:
Complex Schema: 21 interconnected tables modeling a movie database
Join Optimization Focus: Queries designed to test join order selection
13 Core Queries: Representative join patterns and complexity levels
Synthetic Data Generation: Realistic data preserving join characteristics
Multi-Dialect Support: Compatible with all major SQL databases
Scalable: From development (SF 0.001) to production (SF 1.0+)
Reference: Viktor Leis et al. “How Good Are Query Optimizers, Really?” (VLDB 2015)
Quick Start¶
Simple example to get started with Join Order Benchmark:
from benchbox.joinorder import JoinOrder
from benchbox.platforms.duckdb import DuckDBAdapter
# Create benchmark with small scale for testing
benchmark = JoinOrder(scale_factor=0.01)
# Generate data
data_files = benchmark.generate_data()
# Run with DuckDB
adapter = DuckDBAdapter()
results = adapter.run_benchmark(benchmark)
print(f"Queries executed: {results.total_queries}")
print(f"Average time: {results.average_query_time:.3f}s")
API Reference¶
JoinOrder Class¶
- class JoinOrder(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Bases:
BaseBenchmarkJoin Order Benchmark implementation.
This class provides an implementation of the Join Order Benchmark, including data generation and access to complex join queries for cardinality estimation and join order optimization testing.
Reference: Viktor Leis et al. “How Good Are Query Optimizers, Really?”
- __init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Initialize a Join Order Benchmark instance.
- Parameters:
scale_factor (float) – Scale factor for the benchmark (1.0 = ~1GB)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs (Any) – Additional implementation-specific options
- generate_data()[source]¶
Generate Join Order Benchmark data.
- Returns:
A list of paths to the generated data files
- Return type:
list[Path]
- get_queries()[source]¶
Get all Join Order Benchmark queries.
- Returns:
A dictionary mapping query IDs to query strings
- Return type:
dict[str, str]
- get_query(query_id, *, params=None)[source]¶
Get a specific Join Order Benchmark query.
- Parameters:
query_id (int | str) – The ID of the query to retrieve
params (dict[str, Any] | None) – Optional parameters to customize the query
- Returns:
The query string
- Raises:
ValueError – If the query_id is invalid
- Return type:
str
- get_schema(dialect='sqlite')[source]¶
Get the Join Order Benchmark schema DDL.
- Parameters:
dialect (str) – Target SQL dialect
- Returns:
DDL statements for creating all tables
- Return type:
str
- get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶
Get SQL to create all Join Order Benchmark tables.
- Parameters:
dialect (str) – SQL dialect to use
tuning_config (Any) – Unified tuning configuration for constraint settings
- Returns:
SQL script for creating all tables
- Return type:
str
- apply_verbosity(settings)¶
Apply verbosity settings to the mixin consumer.
- property benchmark_name: str¶
Get the human-readable benchmark name.
- create_enhanced_benchmark_result(platform, query_results, execution_metadata=None, phases=None, resource_utilization=None, performance_characteristics=None, **kwargs)¶
Create a BenchmarkResults object with standardized fields.
This centralizes the logic for creating benchmark results that was previously duplicated across platform adapters and CLI orchestrator.
- Parameters:
platform (str) – Platform name (e.g., “DuckDB”, “ClickHouse”)
query_results (list[dict[str, Any]]) – List of query execution results
execution_metadata (dict[str, Any] | None) – Optional execution metadata
phases (dict[str, dict[str, Any]] | None) – Optional phase tracking information
resource_utilization (dict[str, Any] | None) – Optional resource usage metrics
performance_characteristics (dict[str, Any] | None) – Optional performance analysis
**kwargs (Any) – Additional fields to override defaults
- Returns:
Fully configured BenchmarkResults object
- Return type:
- format_results(benchmark_result)¶
Format benchmark results for display.
- Parameters:
benchmark_result (dict[str, Any]) – Result dictionary from run_benchmark()
- Returns:
Formatted string representation of the results
- Return type:
str
- get_data_source_benchmark()¶
Return the canonical source benchmark when data is shared.
Benchmarks that reuse data generated by another benchmark (for example,
PrimitivesreusingTPC-Hdatasets) should override this method and return the lower-case identifier of the source benchmark. Benchmarks that produce their own data should returnNone(default).
- log_debug_info(context='Debug')¶
Log comprehensive debug information including version details.
- log_error_with_debug_info(error, context='Error')¶
Log an error with comprehensive debug information.
- log_operation_complete(operation, duration=None, details='')¶
- log_operation_start(operation, details='')¶
- log_verbose(message)¶
- log_version_warning()¶
Log version consistency warnings if any exist.
- log_very_verbose(message)¶
- property logger: Logger¶
Return the logger configured for the verbosity mixin consumer.
- quiet: bool = False¶
- run_benchmark(connection, query_ids=None, fetch_results=False, setup_database=True)¶
Run the complete benchmark suite.
- Parameters:
connection (DatabaseConnection) – Database connection to execute queries on
query_ids (list[int | str] | None) – Optional list of specific query IDs to run (defaults to all)
fetch_results (bool) – Whether to fetch and return query results
setup_database (bool) – Whether to set up the database first
- Returns:
benchmark_name: Name of the benchmark
total_execution_time: Total time for all queries
total_queries: Number of queries executed
successful_queries: Number of queries that succeeded
failed_queries: Number of queries that failed
query_results: List of individual query results
setup_time: Time taken for database setup (if performed)
- Return type:
Dictionary containing
- Raises:
Exception – If benchmark execution fails
- run_query(query_id, connection, params=None, fetch_results=False)¶
Execute single query and return timing and results.
- Parameters:
query_id (int | str) – ID of the query to execute
connection (DatabaseConnection) – Database connection to execute query on
params (dict[str, Any] | None) – Optional parameters for query customization
fetch_results (bool) – Whether to fetch and return query results
- Returns:
query_id: Executed query ID
execution_time: Time taken to execute query in seconds
query_text: Executed query text
results: Query results if fetch_results=True, otherwise None
row_count: Number of rows returned (if results fetched)
- Return type:
Dictionary containing
- Raises:
ValueError – If query_id is invalid
Exception – If query execution fails
- run_with_platform(platform_adapter, **run_config)¶
Run complete benchmark using platform-specific optimizations.
This method provides a unified interface for running benchmarks using database platform adapters that handle connection management, data loading optimizations, and query execution.
This is the standard method that all benchmarks should support for integration with the CLI and other orchestration tools.
- Parameters:
platform_adapter – Platform adapter instance (e.g., DuckDBAdapter)
**run_config – Configuration options: - categories: List of query categories to run (if benchmark supports) - query_subset: List of specific query IDs to run - connection: Connection configuration - benchmark_type: Type hint for optimizations (‘olap’, ‘oltp’, etc.)
- Returns:
BenchmarkResults object with execution results
Example
from benchbox.platforms import DuckDBAdapter
benchmark = SomeBenchmark(scale_factor=0.1) adapter = DuckDBAdapter() results = benchmark.run_with_platform(adapter)
- setup_database(connection)¶
Set up database with schema and data.
Creates necessary database schema and loads benchmark data into the database.
- Parameters:
connection (DatabaseConnection) – Database connection to set up
- Raises:
ValueError – If data generation fails
Exception – If database setup fails
- translate_query(query_id, dialect)¶
Translate a query to a specific SQL dialect.
- Parameters:
query_id (int | str) – The ID of the query to translate
dialect (str) – The target SQL dialect
- Returns:
The translated query string
- Raises:
ValueError – If the query_id is invalid
ImportError – If sqlglot is not installed
ValueError – If the dialect is not supported
- Return type:
str
- verbose: bool = False¶
- verbose_enabled: bool = False¶
- verbose_level: int = 0¶
- property verbosity_settings: VerbositySettings¶
Return the current verbosity settings.
- very_verbose: bool = False¶
Main interface for the Join Order Benchmark with complete functionality.
Constructor:
JoinOrder(
scale_factor: float = 1.0,
output_dir: Optional[Union[str, Path]] = None,
queries_dir: Optional[str] = None,
**kwargs
)
Parameters:
scale_factor (float): Data size multiplier (1.0 ≈ 1.7GB, default: 1.0)
output_dir (str | Path | None): Directory for generated data files
queries_dir (str | None): Directory with Join Order Benchmark query files (loads full 113-query set if provided)
verbose (bool | int): Enable verbose output
parallel (int): Number of parallel generation threads
force_regenerate (bool): Force data regeneration even if files exist
Attributes:
scale_factor (float): Current scale factor
output_dir (Path): Data file output directory
queries_dir (str | None): External queries directory
Data Generation Methods¶
- generate_data() list[Path]¶
Generate Join Order Benchmark dataset with 21 tables.
Returns: list[Path] - Paths to generated CSV files
Example:
from benchbox.joinorder import JoinOrder benchmark = JoinOrder(scale_factor=0.1) data_files = benchmark.generate_data() print(f"Generated {len(data_files)} data files") for file in data_files: size_mb = file.stat().st_size / (1024 * 1024) print(f" {file.name}: {size_mb:.2f} MB") # Output: # Generated 21 data files # title.csv: 45.2 MB # name.csv: 62.3 MB # cast_info.csv: 580.1 MB # ...
Schema Methods¶
- get_create_tables_sql(dialect='standard', tuning_config=None) str¶
Get SQL to create all Join Order Benchmark tables.
Parameters:
dialect (str): SQL dialect (sqlite, postgres, mysql, duckdb)
tuning_config (UnifiedTuningConfiguration | None): Optional tuning configuration
Returns: str - DDL SQL statements
Example:
benchmark = JoinOrder(scale_factor=0.01) # Get schema for DuckDB ddl = benchmark.get_create_tables_sql(dialect="duckdb") # Create tables import duckdb conn = duckdb.connect(":memory:") conn.execute(ddl) print("Created 21 tables")
- get_schema() list[dict]¶
Get detailed schema information for all tables.
Returns: list[dict] - Table schema definitions
Example:
schema = benchmark.get_schema() for table in schema[:3]: # Show first 3 tables print(f"\nTable: {table['name']}") print(f" Columns: {len(table['columns'])}") print(f" Primary Key: {table.get('primary_key', 'None')}") print(f" Foreign Keys: {len(table.get('foreign_keys', []))}")
- get_table_names() list[str]¶
Get list of all table names in the schema.
Returns: list[str] - Table names
Example:
tables = benchmark.get_table_names() print(f"Total tables: {len(tables)}") print(f"Tables: {', '.join(sorted(tables))}") # Output: # Total tables: 21 # Tables: aka_name, aka_title, cast_info, char_name, ...
- get_table_info(table_name) dict¶
Get detailed information about a specific table.
Parameters:
table_name (str): Name of the table
Returns: dict - Table information (columns, constraints, row count estimate)
Example:
# Get info for cast_info table info = benchmark.get_table_info("cast_info") print(f"Table: {info['name']}") print(f"Columns: {len(info['columns'])}") print(f"Estimated rows (SF=0.1): {info['estimated_rows']:,}") print(f"Primary Key: {info.get('primary_key')}") # Show column details for col in info['columns'][:5]: print(f" - {col['name']}: {col['type']}")
- get_relationship_tables() list[str]¶
Get list of relationship/junction tables.
Returns: list[str] - Relationship table names
Example:
rel_tables = benchmark.get_relationship_tables() print(f"Relationship tables: {', '.join(rel_tables)}") # Output: # Relationship tables: cast_info, movie_companies, movie_keyword, movie_info, ...
- get_dimension_tables() list[str]¶
Get list of main dimension tables.
Returns: list[str] - Dimension table names
Example:
dim_tables = benchmark.get_dimension_tables() print(f"Dimension tables: {', '.join(dim_tables)}") # Output: # Dimension tables: title, name, company_name, keyword, char_name
- get_table_row_count(table_name) int¶
Get expected row count for a table at current scale factor.
Parameters:
table_name (str): Table name
Returns: int - Expected row count
Example:
# Check row counts at SF=0.1 tables = ["title", "name", "cast_info", "movie_companies"] for table in tables: count = benchmark.get_table_row_count(table) print(f"{table:20s}: {count:>10,} rows") # Output: # title : 250,000 rows # name : 400,000 rows # cast_info : 3,500,000 rows # movie_companies : 260,000 rows
Query Methods¶
- get_query(query_id, *, params=None) str¶
Get a specific Join Order Benchmark query.
Parameters:
query_id (str): Query identifier (e.g., “1a”, “2b”, “3c”)
params (dict | None): Not supported for JoinOrder (raises ValueError if provided)
Returns: str - SQL query text
Raises: ValueError if params provided
Example:
# Get query 1a query = benchmark.get_query("1a") print(f"Query 1a:") print(query) # Count number of joins join_count = query.upper().count("JOIN") print(f"\nJoins: {join_count}")
- get_queries() dict[str, str]¶
Get all Join Order Benchmark queries.
Returns: dict - Mapping of query IDs to SQL text
Example:
queries = benchmark.get_queries() print(f"Total queries: {len(queries)}") # Show query IDs and join counts for query_id, sql in sorted(queries.items()): join_count = sql.upper().count("JOIN") print(f" {query_id}: {join_count} joins")
- get_query_ids() list[str]¶
Get list of all query IDs.
Returns: list[str] - Query identifiers
Example:
query_ids = benchmark.get_query_ids() print(f"Available queries: {', '.join(sorted(query_ids))}") # Output: # Available queries: 1a, 1b, 1c, 1d, 2a, 2b, 2c, 2d, 3a, 3b, 3c, 4a, 4b
- get_query_count() int¶
Get total number of queries available.
Returns: int - Query count
Example:
count = benchmark.get_query_count() print(f"Total queries: {count}") # If loaded from directory benchmark_full = JoinOrder( scale_factor=0.01, queries_dir="/path/to/join-order-benchmark/queries" ) full_count = benchmark_full.get_query_count() print(f"Full benchmark queries: {full_count}") # 113 queries
Query Analysis Methods¶
- get_queries_by_complexity() dict[str, list[str]]¶
Get queries categorized by complexity level.
Returns: dict - Mapping of complexity levels to query IDs
Example:
by_complexity = benchmark.get_queries_by_complexity() for complexity, queries in sorted(by_complexity.items()): print(f"\n{complexity.capitalize()} queries ({len(queries)}):") print(f" {', '.join(queries)}") # Output: # Simple queries (4): # 1a, 1b, 1c, 1d # Medium queries (5): # 2a, 2b, 2c, 2d, 3a # Complex queries (4): # 3b, 3c, 4a, 4b
- get_queries_by_pattern() dict[str, list[str]]¶
Get queries categorized by join pattern type.
Returns: dict - Mapping of join patterns to query IDs
Example:
by_pattern = benchmark.get_queries_by_pattern() for pattern, queries in sorted(by_pattern.items()): print(f"\n{pattern.replace('_', ' ').title()} ({len(queries)} queries):") for query_id in queries: query = benchmark.get_query(query_id) tables = query.upper().count("FROM") + query.upper().count("JOIN") print(f" {query_id}: ~{tables} tables")
Utility Methods¶
- validate_query(query_id) bool¶
Validate that a query is syntactically correct.
Parameters:
query_id (str): Query identifier
Returns: bool - True if query is valid
Example:
for query_id in benchmark.get_query_ids(): is_valid = benchmark.validate_query(query_id) if not is_valid: print(f"⚠️ Query {query_id} validation failed")
- get_estimated_data_size() int¶
Get estimated data size in bytes at current scale factor.
Returns: int - Estimated total size in bytes
Example:
size_bytes = benchmark.get_estimated_data_size() size_mb = size_bytes / (1024 * 1024) size_gb = size_bytes / (1024 * 1024 * 1024) print(f"Estimated data size:") print(f" {size_mb:.1f} MB") print(f" {size_gb:.2f} GB")
- get_benchmark_info() dict¶
Get comprehensive benchmark information and metadata.
Returns: dict - Benchmark metadata
Example:
info = benchmark.get_benchmark_info() print(f"Benchmark: {info['benchmark_name']}") print(f"Description: {info['description']}") print(f"Scale Factor: {info['scale_factor']}") print(f"Total Queries: {info['total_queries']}") print(f"Total Tables: {info['total_tables']}") print(f"Estimated Size: {info['estimated_size_bytes'] / 1024 / 1024:.1f} MB") print(f"\nReference: {info['reference_paper']}") print(f"Authors: {info['authors']}")
- load_queries_from_directory(queries_dir) None¶
Load queries from original Join Order Benchmark repository.
Parameters:
queries_dir (str): Path to directory containing .sql query files
Example:
benchmark = JoinOrder(scale_factor=0.01) # Initially has 13 core queries print(f"Core queries: {benchmark.get_query_count()}") # Load full 113-query set benchmark.load_queries_from_directory("/path/to/join-order-benchmark/queries") print(f"After loading: {benchmark.get_query_count()} queries") # Output: # Core queries: 13 # After loading: 113 queries
Usage Examples¶
Basic Benchmark Execution¶
Complete workflow from data generation to query execution:
from benchbox.joinorder import JoinOrder
from benchbox.platforms.duckdb import DuckDBAdapter
# Initialize benchmark with small scale
benchmark = JoinOrder(scale_factor=0.01, verbose=True)
# Generate data
print("Generating data...")
data_files = benchmark.generate_data()
print(f"Generated {len(data_files)} files")
# Run with DuckDB
print("\nRunning benchmark...")
adapter = DuckDBAdapter(memory_limit="2GB")
results = adapter.run_benchmark(benchmark)
# Analyze results
print(f"\nResults:")
print(f" Total queries: {results.total_queries}")
print(f" Successful: {results.successful_queries}")
print(f" Failed: {results.failed_queries}")
print(f" Average time: {results.average_query_time:.3f}s")
print(f" Total time: {results.total_execution_time:.2f}s")
Query Complexity Analysis¶
Analyze query complexity distribution:
from benchbox.joinorder import JoinOrder
from benchbox.platforms.duckdb import DuckDBAdapter
import time
benchmark = JoinOrder(scale_factor=0.01)
benchmark.generate_data()
# Analyze by complexity
by_complexity = benchmark.get_queries_by_complexity()
adapter = DuckDBAdapter()
conn = adapter.create_connection()
adapter.create_schema(benchmark, conn)
adapter.load_data(benchmark, conn, benchmark.output_dir)
# Run queries by complexity level
complexity_times = {}
for complexity, query_ids in sorted(by_complexity.items()):
print(f"\n{complexity.upper()} Queries ({len(query_ids)}):")
times = []
for query_id in query_ids:
query = benchmark.get_query(query_id)
start = time.time()
result = conn.execute(query).fetchall()
elapsed = time.time() - start
times.append(elapsed)
print(f" {query_id}: {elapsed*1000:.1f} ms ({len(result)} rows)")
avg_time = sum(times) / len(times)
complexity_times[complexity] = avg_time
# Summary
print(f"\n{'='*50}")
print("Complexity Performance Summary:")
for complexity, avg_time in sorted(complexity_times.items()):
print(f" {complexity:10s}: {avg_time*1000:.1f} ms average")
Join Pattern Analysis¶
Analyze different join patterns:
from benchbox.joinorder import JoinOrder
from benchbox.platforms.duckdb import DuckDBAdapter
benchmark = JoinOrder(scale_factor=0.01)
benchmark.generate_data()
# Get queries by join pattern
by_pattern = benchmark.get_queries_by_pattern()
print("Join Pattern Distribution:")
for pattern, query_ids in sorted(by_pattern.items()):
print(f"\n{pattern.replace('_', ' ').title()}:")
print(f" Queries: {', '.join(query_ids)}")
print(f" Count: {len(query_ids)}")
# Analyze a sample query
sample_query = benchmark.get_query(query_ids[0])
join_count = sample_query.upper().count("JOIN")
from_count = sample_query.upper().count("FROM")
print(f" Example ({query_ids[0]}): {join_count} joins, {from_count} FROM clauses")
Optimizer Testing¶
Test query optimizer with different query variations:
from benchbox.joinorder import JoinOrder
from benchbox.platforms.duckdb import DuckDBAdapter
import time
benchmark = JoinOrder(scale_factor=0.1)
benchmark.generate_data()
adapter = DuckDBAdapter()
conn = adapter.create_connection()
adapter.create_schema(benchmark, conn)
adapter.load_data(benchmark, conn, benchmark.output_dir)
# Test optimizer on complex queries
complex_queries = benchmark.get_queries_by_complexity()["complex"]
print("Optimizer Analysis:")
for query_id in complex_queries:
query = benchmark.get_query(query_id)
# Enable query profiling
conn.execute("PRAGMA enable_profiling='query_tree'")
# Run query
start = time.time()
result = conn.execute(query).fetchall()
elapsed = time.time() - start
print(f"\n{query_id}:")
print(f" Execution time: {elapsed*1000:.1f} ms")
print(f" Rows returned: {len(result)}")
print(f" Tables joined: ~{query.upper().count('JOIN')}")
# Get query plan (DuckDB specific)
try:
plan = conn.execute("EXPLAIN " + query).fetchall()
print(f" Plan nodes: {len(plan)}")
except:
pass
Schema Exploration¶
Explore the complex schema structure:
from benchbox.joinorder import JoinOrder
benchmark = JoinOrder(scale_factor=0.1)
# Get all tables categorized
dim_tables = benchmark.get_dimension_tables()
rel_tables = benchmark.get_relationship_tables()
print("Schema Structure:")
print(f"\nDimension Tables ({len(dim_tables)}):")
for table in dim_tables:
info = benchmark.get_table_info(table)
row_count = benchmark.get_table_row_count(table)
print(f" {table:20s}: {row_count:>10,} rows, {len(info['columns'])} columns")
print(f"\nRelationship Tables ({len(rel_tables)}):")
for table in rel_tables:
info = benchmark.get_table_info(table)
row_count = benchmark.get_table_row_count(table)
fk_count = len(info.get('foreign_keys', []))
print(f" {table:20s}: {row_count:>10,} rows, {fk_count} foreign keys")
# Estimate total size
total_size = benchmark.get_estimated_data_size()
print(f"\nEstimated Total Size: {total_size / 1024 / 1024:.1f} MB")
Custom Query Execution¶
Run specific queries with custom analysis:
from benchbox.joinorder import JoinOrder
from benchbox.platforms.duckdb import DuckDBAdapter
from benchbox.core.results.timing import TimingCollector, TimingAnalyzer
benchmark = JoinOrder(scale_factor=0.01)
benchmark.generate_data()
adapter = DuckDBAdapter()
conn = adapter.create_connection()
adapter.create_schema(benchmark, conn)
adapter.load_data(benchmark, conn, benchmark.output_dir)
# Collect detailed timing
collector = TimingCollector(enable_detailed_timing=True)
# Run subset of queries
target_queries = ["1a", "2a", "3a", "4a"]
for query_id in target_queries:
query = benchmark.get_query(query_id)
with collector.time_query(query_id, f"Query {query_id}") as timing:
with collector.time_phase(query_id, "execute"):
result = conn.execute(query).fetchall()
collector.record_metric(query_id, "rows_returned", len(result))
collector.record_metric(query_id, "tables_joined", query.upper().count("JOIN"))
# Analyze timing
timings = collector.get_completed_timings()
analyzer = TimingAnalyzer(timings)
analysis = analyzer.analyze_query_performance()
print("\nPerformance Analysis:")
print(f" Mean: {analysis['basic_stats']['mean']*1000:.1f} ms")
print(f" Median: {analysis['basic_stats']['median']*1000:.1f} ms")
print(f" Min: {analysis['basic_stats']['min']*1000:.1f} ms")
print(f" Max: {analysis['basic_stats']['max']*1000:.1f} ms")
Best Practices¶
Start with Small Scale Factors
Begin with small scale factors for development and testing:
# Development - very fast dev_benchmark = JoinOrder(scale_factor=0.001) # ~700 KB # Testing - reasonable size test_benchmark = JoinOrder(scale_factor=0.01) # ~17 MB # Production - realistic workload prod_benchmark = JoinOrder(scale_factor=0.1) # ~170 MB
Analyze Query Complexity
Use complexity categorization to understand workload:
by_complexity = benchmark.get_queries_by_complexity() # Start with simple queries for query_id in by_complexity["simple"]: # Test basic join performance pass # Then move to complex queries for query_id in by_complexity["complex"]: # Test advanced optimizer capabilities pass
Use Join Pattern Analysis
Focus on specific join patterns for targeted testing:
by_pattern = benchmark.get_queries_by_pattern() # Test star joins star_queries = by_pattern.get("star_join", []) # Test chain joins chain_queries = by_pattern.get("chain_join", []) # Test complex patterns complex_queries = by_pattern.get("complex_join", [])
Monitor Optimizer Behavior
Enable query profiling to understand optimizer decisions:
# DuckDB example conn.execute("PRAGMA enable_profiling='query_tree'") # Run query result = conn.execute(query).fetchall() # Analyze plan plan = conn.execute("EXPLAIN " + query).fetchall()
Cache Generated Data
Reuse generated data across runs:
from pathlib import Path data_dir = Path("joinorder_cache/sf01") if data_dir.exists(): # Reuse existing data benchmark = JoinOrder(scale_factor=0.1, output_dir=data_dir) else: # Generate once benchmark = JoinOrder(scale_factor=0.1, output_dir=data_dir) benchmark.generate_data()
Common Issues¶
Large Data Size¶
Problem: Scale factor 1.0 generates ~1.7GB of data
Solution: Use appropriate scale factors:
# For CI/CD - use minimal scale
ci_benchmark = JoinOrder(scale_factor=0.001)
# For local testing - use small scale
test_benchmark = JoinOrder(scale_factor=0.01)
# For performance testing - use realistic scale
perf_benchmark = JoinOrder(scale_factor=0.1)
Slow Query Execution¶
Problem: Complex queries take too long
Solution: Start with simpler queries or smaller scale:
# Filter to simple queries only
simple_queries = benchmark.get_queries_by_complexity()["simple"]
for query_id in simple_queries:
query = benchmark.get_query(query_id)
# Execute fast queries first
result = conn.execute(query).fetchall()
Memory Issues¶
Problem: Large datasets cause out-of-memory errors
Solution: Use file-based database or limit memory:
from benchbox.platforms.duckdb import DuckDBAdapter
# Use file-based database
adapter = DuckDBAdapter(
database_path="joinorder.duckdb",
memory_limit="2GB"
)
See Also¶
Join Order Benchmark Implementation - Join Order Benchmark guide
TPC-H Benchmark API - TPC-H benchmark API
TPC-DS Benchmark API - TPC-DS benchmark API
Base Benchmark API - Base benchmark interface
DuckDB Platform Adapter - DuckDB adapter
Advanced Performance Optimization Guide - Performance optimization guide
External Resources¶
Join Order Benchmark Paper - Original VLDB 2015 paper
GitHub Repository - Original benchmark repository
IMDB Interfaces - IMDB dataset information