Join Order Benchmark API

Tags reference python-api custom-benchmark

Complete Python API reference for the Join Order Benchmark (JOB).

Overview

The Join Order Benchmark is designed to test query optimizer join order selection capabilities using a complex schema with 21 interconnected tables based on the Internet Movie Database (IMDB). This benchmark specifically stresses query optimizers by creating join optimization challenges that expose cardinality estimation and cost model weaknesses.

Key Features:

  • Complex Schema: 21 interconnected tables modeling a movie database

  • Join Optimization Focus: Queries designed to test join order selection

  • 13 Core Queries: Representative join patterns and complexity levels

  • Synthetic Data Generation: Realistic data preserving join characteristics

  • Multi-Dialect Support: Compatible with all major SQL databases

  • Scalable: From development (SF 0.001) to production (SF 1.0+)

Reference: Viktor Leis et al. “How Good Are Query Optimizers, Really?” (VLDB 2015)

Quick Start

Simple example to get started with Join Order Benchmark:

from benchbox.joinorder import JoinOrder
from benchbox.platforms.duckdb import DuckDBAdapter

# Create benchmark with small scale for testing
benchmark = JoinOrder(scale_factor=0.01)

# Generate data
data_files = benchmark.generate_data()

# Run with DuckDB
adapter = DuckDBAdapter()
results = adapter.run_benchmark(benchmark)

print(f"Queries executed: {results.total_queries}")
print(f"Average time: {results.average_query_time:.3f}s")

API Reference

JoinOrder Class

class JoinOrder(scale_factor=1.0, output_dir=None, **kwargs)[source]

Bases: BaseBenchmark

Join Order Benchmark implementation.

This class provides an implementation of the Join Order Benchmark, including data generation and access to complex join queries for cardinality estimation and join order optimization testing.

Reference: Viktor Leis et al. “How Good Are Query Optimizers, Really?”

__init__(scale_factor=1.0, output_dir=None, **kwargs)[source]

Initialize a Join Order Benchmark instance.

Parameters:
  • scale_factor (float) – Scale factor for the benchmark (1.0 = ~1GB)

  • output_dir (str | Path | None) – Directory to output generated data files

  • **kwargs (Any) – Additional implementation-specific options

generate_data()[source]

Generate Join Order Benchmark data.

Returns:

A list of paths to the generated data files

Return type:

list[Path]

get_queries()[source]

Get all Join Order Benchmark queries.

Returns:

A dictionary mapping query IDs to query strings

Return type:

dict[str, str]

get_query(query_id, *, params=None)[source]

Get a specific Join Order Benchmark query.

Parameters:
  • query_id (int | str) – The ID of the query to retrieve

  • params (dict[str, Any] | None) – Optional parameters to customize the query

Returns:

The query string

Raises:

ValueError – If the query_id is invalid

Return type:

str

get_schema(dialect='sqlite')[source]

Get the Join Order Benchmark schema DDL.

Parameters:

dialect (str) – Target SQL dialect

Returns:

DDL statements for creating all tables

Return type:

str

get_create_tables_sql(dialect='standard', tuning_config=None)[source]

Get SQL to create all Join Order Benchmark tables.

Parameters:
  • dialect (str) – SQL dialect to use

  • tuning_config (Any) – Unified tuning configuration for constraint settings

Returns:

SQL script for creating all tables

Return type:

str

apply_verbosity(settings)

Apply verbosity settings to the mixin consumer.

property benchmark_name: str

Get the human-readable benchmark name.

create_enhanced_benchmark_result(platform, query_results, execution_metadata=None, phases=None, resource_utilization=None, performance_characteristics=None, **kwargs)

Create a BenchmarkResults object with standardized fields.

This centralizes the logic for creating benchmark results that was previously duplicated across platform adapters and CLI orchestrator.

Parameters:
  • platform (str) – Platform name (e.g., “DuckDB”, “ClickHouse”)

  • query_results (list[dict[str, Any]]) – List of query execution results

  • execution_metadata (dict[str, Any] | None) – Optional execution metadata

  • phases (dict[str, dict[str, Any]] | None) – Optional phase tracking information

  • resource_utilization (dict[str, Any] | None) – Optional resource usage metrics

  • performance_characteristics (dict[str, Any] | None) – Optional performance analysis

  • **kwargs (Any) – Additional fields to override defaults

Returns:

Fully configured BenchmarkResults object

Return type:

BenchmarkResults

format_results(benchmark_result)

Format benchmark results for display.

Parameters:

benchmark_result (dict[str, Any]) – Result dictionary from run_benchmark()

Returns:

Formatted string representation of the results

Return type:

str

get_data_source_benchmark()

Return the canonical source benchmark when data is shared.

Benchmarks that reuse data generated by another benchmark (for example, Primitives reusing TPC-H datasets) should override this method and return the lower-case identifier of the source benchmark. Benchmarks that produce their own data should return None (default).

log_debug_info(context='Debug')

Log comprehensive debug information including version details.

log_error_with_debug_info(error, context='Error')

Log an error with comprehensive debug information.

log_operation_complete(operation, duration=None, details='')
log_operation_start(operation, details='')
log_verbose(message)
log_version_warning()

Log version consistency warnings if any exist.

log_very_verbose(message)
property logger: Logger

Return the logger configured for the verbosity mixin consumer.

quiet: bool = False
run_benchmark(connection, query_ids=None, fetch_results=False, setup_database=True)

Run the complete benchmark suite.

Parameters:
  • connection (DatabaseConnection) – Database connection to execute queries on

  • query_ids (list[int | str] | None) – Optional list of specific query IDs to run (defaults to all)

  • fetch_results (bool) – Whether to fetch and return query results

  • setup_database (bool) – Whether to set up the database first

Returns:

  • benchmark_name: Name of the benchmark

  • total_execution_time: Total time for all queries

  • total_queries: Number of queries executed

  • successful_queries: Number of queries that succeeded

  • failed_queries: Number of queries that failed

  • query_results: List of individual query results

  • setup_time: Time taken for database setup (if performed)

Return type:

Dictionary containing

Raises:

Exception – If benchmark execution fails

run_query(query_id, connection, params=None, fetch_results=False)

Execute single query and return timing and results.

Parameters:
  • query_id (int | str) – ID of the query to execute

  • connection (DatabaseConnection) – Database connection to execute query on

  • params (dict[str, Any] | None) – Optional parameters for query customization

  • fetch_results (bool) – Whether to fetch and return query results

Returns:

  • query_id: Executed query ID

  • execution_time: Time taken to execute query in seconds

  • query_text: Executed query text

  • results: Query results if fetch_results=True, otherwise None

  • row_count: Number of rows returned (if results fetched)

Return type:

Dictionary containing

Raises:
  • ValueError – If query_id is invalid

  • Exception – If query execution fails

run_with_platform(platform_adapter, **run_config)

Run complete benchmark using platform-specific optimizations.

This method provides a unified interface for running benchmarks using database platform adapters that handle connection management, data loading optimizations, and query execution.

This is the standard method that all benchmarks should support for integration with the CLI and other orchestration tools.

Parameters:
  • platform_adapter – Platform adapter instance (e.g., DuckDBAdapter)

  • **run_config – Configuration options: - categories: List of query categories to run (if benchmark supports) - query_subset: List of specific query IDs to run - connection: Connection configuration - benchmark_type: Type hint for optimizations (‘olap’, ‘oltp’, etc.)

Returns:

BenchmarkResults object with execution results

Example

from benchbox.platforms import DuckDBAdapter

benchmark = SomeBenchmark(scale_factor=0.1) adapter = DuckDBAdapter() results = benchmark.run_with_platform(adapter)

setup_database(connection)

Set up database with schema and data.

Creates necessary database schema and loads benchmark data into the database.

Parameters:

connection (DatabaseConnection) – Database connection to set up

Raises:
  • ValueError – If data generation fails

  • Exception – If database setup fails

translate_query(query_id, dialect)

Translate a query to a specific SQL dialect.

Parameters:
  • query_id (int | str) – The ID of the query to translate

  • dialect (str) – The target SQL dialect

Returns:

The translated query string

Raises:
  • ValueError – If the query_id is invalid

  • ImportError – If sqlglot is not installed

  • ValueError – If the dialect is not supported

Return type:

str

verbose: bool = False
verbose_enabled: bool = False
verbose_level: int = 0
property verbosity_settings: VerbositySettings

Return the current verbosity settings.

very_verbose: bool = False

Main interface for the Join Order Benchmark with complete functionality.

Constructor:

JoinOrder(
    scale_factor: float = 1.0,
    output_dir: Optional[Union[str, Path]] = None,
    queries_dir: Optional[str] = None,
    **kwargs
)

Parameters:

  • scale_factor (float): Data size multiplier (1.0 ≈ 1.7GB, default: 1.0)

  • output_dir (str | Path | None): Directory for generated data files

  • queries_dir (str | None): Directory with Join Order Benchmark query files (loads full 113-query set if provided)

  • verbose (bool | int): Enable verbose output

  • parallel (int): Number of parallel generation threads

  • force_regenerate (bool): Force data regeneration even if files exist

Attributes:

  • scale_factor (float): Current scale factor

  • output_dir (Path): Data file output directory

  • queries_dir (str | None): External queries directory

Data Generation Methods

generate_data() list[Path]

Generate Join Order Benchmark dataset with 21 tables.

Returns: list[Path] - Paths to generated CSV files

Example:

from benchbox.joinorder import JoinOrder

benchmark = JoinOrder(scale_factor=0.1)
data_files = benchmark.generate_data()

print(f"Generated {len(data_files)} data files")
for file in data_files:
    size_mb = file.stat().st_size / (1024 * 1024)
    print(f"  {file.name}: {size_mb:.2f} MB")

# Output:
# Generated 21 data files
#   title.csv: 45.2 MB
#   name.csv: 62.3 MB
#   cast_info.csv: 580.1 MB
#   ...

Schema Methods

get_create_tables_sql(dialect='standard', tuning_config=None) str

Get SQL to create all Join Order Benchmark tables.

Parameters:

  • dialect (str): SQL dialect (sqlite, postgres, mysql, duckdb)

  • tuning_config (UnifiedTuningConfiguration | None): Optional tuning configuration

Returns: str - DDL SQL statements

Example:

benchmark = JoinOrder(scale_factor=0.01)

# Get schema for DuckDB
ddl = benchmark.get_create_tables_sql(dialect="duckdb")

# Create tables
import duckdb
conn = duckdb.connect(":memory:")
conn.execute(ddl)

print("Created 21 tables")
get_schema() list[dict]

Get detailed schema information for all tables.

Returns: list[dict] - Table schema definitions

Example:

schema = benchmark.get_schema()

for table in schema[:3]:  # Show first 3 tables
    print(f"\nTable: {table['name']}")
    print(f"  Columns: {len(table['columns'])}")
    print(f"  Primary Key: {table.get('primary_key', 'None')}")
    print(f"  Foreign Keys: {len(table.get('foreign_keys', []))}")
get_table_names() list[str]

Get list of all table names in the schema.

Returns: list[str] - Table names

Example:

tables = benchmark.get_table_names()
print(f"Total tables: {len(tables)}")
print(f"Tables: {', '.join(sorted(tables))}")

# Output:
# Total tables: 21
# Tables: aka_name, aka_title, cast_info, char_name, ...
get_table_info(table_name) dict

Get detailed information about a specific table.

Parameters:

  • table_name (str): Name of the table

Returns: dict - Table information (columns, constraints, row count estimate)

Example:

# Get info for cast_info table
info = benchmark.get_table_info("cast_info")

print(f"Table: {info['name']}")
print(f"Columns: {len(info['columns'])}")
print(f"Estimated rows (SF=0.1): {info['estimated_rows']:,}")
print(f"Primary Key: {info.get('primary_key')}")

# Show column details
for col in info['columns'][:5]:
    print(f"  - {col['name']}: {col['type']}")
get_relationship_tables() list[str]

Get list of relationship/junction tables.

Returns: list[str] - Relationship table names

Example:

rel_tables = benchmark.get_relationship_tables()
print(f"Relationship tables: {', '.join(rel_tables)}")

# Output:
# Relationship tables: cast_info, movie_companies, movie_keyword, movie_info, ...
get_dimension_tables() list[str]

Get list of main dimension tables.

Returns: list[str] - Dimension table names

Example:

dim_tables = benchmark.get_dimension_tables()
print(f"Dimension tables: {', '.join(dim_tables)}")

# Output:
# Dimension tables: title, name, company_name, keyword, char_name
get_table_row_count(table_name) int

Get expected row count for a table at current scale factor.

Parameters:

  • table_name (str): Table name

Returns: int - Expected row count

Example:

# Check row counts at SF=0.1
tables = ["title", "name", "cast_info", "movie_companies"]

for table in tables:
    count = benchmark.get_table_row_count(table)
    print(f"{table:20s}: {count:>10,} rows")

# Output:
# title              :    250,000 rows
# name               :    400,000 rows
# cast_info          :  3,500,000 rows
# movie_companies    :    260,000 rows

Query Methods

get_query(query_id, *, params=None) str

Get a specific Join Order Benchmark query.

Parameters:

  • query_id (str): Query identifier (e.g., “1a”, “2b”, “3c”)

  • params (dict | None): Not supported for JoinOrder (raises ValueError if provided)

Returns: str - SQL query text

Raises: ValueError if params provided

Example:

# Get query 1a
query = benchmark.get_query("1a")

print(f"Query 1a:")
print(query)

# Count number of joins
join_count = query.upper().count("JOIN")
print(f"\nJoins: {join_count}")
get_queries() dict[str, str]

Get all Join Order Benchmark queries.

Returns: dict - Mapping of query IDs to SQL text

Example:

queries = benchmark.get_queries()

print(f"Total queries: {len(queries)}")

# Show query IDs and join counts
for query_id, sql in sorted(queries.items()):
    join_count = sql.upper().count("JOIN")
    print(f"  {query_id}: {join_count} joins")
get_query_ids() list[str]

Get list of all query IDs.

Returns: list[str] - Query identifiers

Example:

query_ids = benchmark.get_query_ids()
print(f"Available queries: {', '.join(sorted(query_ids))}")

# Output:
# Available queries: 1a, 1b, 1c, 1d, 2a, 2b, 2c, 2d, 3a, 3b, 3c, 4a, 4b
get_query_count() int

Get total number of queries available.

Returns: int - Query count

Example:

count = benchmark.get_query_count()
print(f"Total queries: {count}")

# If loaded from directory
benchmark_full = JoinOrder(
    scale_factor=0.01,
    queries_dir="/path/to/join-order-benchmark/queries"
)
full_count = benchmark_full.get_query_count()
print(f"Full benchmark queries: {full_count}")  # 113 queries

Query Analysis Methods

get_queries_by_complexity() dict[str, list[str]]

Get queries categorized by complexity level.

Returns: dict - Mapping of complexity levels to query IDs

Example:

by_complexity = benchmark.get_queries_by_complexity()

for complexity, queries in sorted(by_complexity.items()):
    print(f"\n{complexity.capitalize()} queries ({len(queries)}):")
    print(f"  {', '.join(queries)}")

# Output:
# Simple queries (4):
#   1a, 1b, 1c, 1d
# Medium queries (5):
#   2a, 2b, 2c, 2d, 3a
# Complex queries (4):
#   3b, 3c, 4a, 4b
get_queries_by_pattern() dict[str, list[str]]

Get queries categorized by join pattern type.

Returns: dict - Mapping of join patterns to query IDs

Example:

by_pattern = benchmark.get_queries_by_pattern()

for pattern, queries in sorted(by_pattern.items()):
    print(f"\n{pattern.replace('_', ' ').title()} ({len(queries)} queries):")
    for query_id in queries:
        query = benchmark.get_query(query_id)
        tables = query.upper().count("FROM") + query.upper().count("JOIN")
        print(f"  {query_id}: ~{tables} tables")

Utility Methods

validate_query(query_id) bool

Validate that a query is syntactically correct.

Parameters:

  • query_id (str): Query identifier

Returns: bool - True if query is valid

Example:

for query_id in benchmark.get_query_ids():
    is_valid = benchmark.validate_query(query_id)
    if not is_valid:
        print(f"⚠️ Query {query_id} validation failed")
get_estimated_data_size() int

Get estimated data size in bytes at current scale factor.

Returns: int - Estimated total size in bytes

Example:

size_bytes = benchmark.get_estimated_data_size()
size_mb = size_bytes / (1024 * 1024)
size_gb = size_bytes / (1024 * 1024 * 1024)

print(f"Estimated data size:")
print(f"  {size_mb:.1f} MB")
print(f"  {size_gb:.2f} GB")
get_benchmark_info() dict

Get comprehensive benchmark information and metadata.

Returns: dict - Benchmark metadata

Example:

info = benchmark.get_benchmark_info()

print(f"Benchmark: {info['benchmark_name']}")
print(f"Description: {info['description']}")
print(f"Scale Factor: {info['scale_factor']}")
print(f"Total Queries: {info['total_queries']}")
print(f"Total Tables: {info['total_tables']}")
print(f"Estimated Size: {info['estimated_size_bytes'] / 1024 / 1024:.1f} MB")
print(f"\nReference: {info['reference_paper']}")
print(f"Authors: {info['authors']}")
load_queries_from_directory(queries_dir) None

Load queries from original Join Order Benchmark repository.

Parameters:

  • queries_dir (str): Path to directory containing .sql query files

Example:

benchmark = JoinOrder(scale_factor=0.01)

# Initially has 13 core queries
print(f"Core queries: {benchmark.get_query_count()}")

# Load full 113-query set
benchmark.load_queries_from_directory("/path/to/join-order-benchmark/queries")

print(f"After loading: {benchmark.get_query_count()} queries")

# Output:
# Core queries: 13
# After loading: 113 queries

Usage Examples

Basic Benchmark Execution

Complete workflow from data generation to query execution:

from benchbox.joinorder import JoinOrder
from benchbox.platforms.duckdb import DuckDBAdapter

# Initialize benchmark with small scale
benchmark = JoinOrder(scale_factor=0.01, verbose=True)

# Generate data
print("Generating data...")
data_files = benchmark.generate_data()
print(f"Generated {len(data_files)} files")

# Run with DuckDB
print("\nRunning benchmark...")
adapter = DuckDBAdapter(memory_limit="2GB")
results = adapter.run_benchmark(benchmark)

# Analyze results
print(f"\nResults:")
print(f"  Total queries: {results.total_queries}")
print(f"  Successful: {results.successful_queries}")
print(f"  Failed: {results.failed_queries}")
print(f"  Average time: {results.average_query_time:.3f}s")
print(f"  Total time: {results.total_execution_time:.2f}s")

Query Complexity Analysis

Analyze query complexity distribution:

from benchbox.joinorder import JoinOrder
from benchbox.platforms.duckdb import DuckDBAdapter
import time

benchmark = JoinOrder(scale_factor=0.01)
benchmark.generate_data()

# Analyze by complexity
by_complexity = benchmark.get_queries_by_complexity()

adapter = DuckDBAdapter()
conn = adapter.create_connection()
adapter.create_schema(benchmark, conn)
adapter.load_data(benchmark, conn, benchmark.output_dir)

# Run queries by complexity level
complexity_times = {}

for complexity, query_ids in sorted(by_complexity.items()):
    print(f"\n{complexity.upper()} Queries ({len(query_ids)}):")

    times = []
    for query_id in query_ids:
        query = benchmark.get_query(query_id)

        start = time.time()
        result = conn.execute(query).fetchall()
        elapsed = time.time() - start

        times.append(elapsed)
        print(f"  {query_id}: {elapsed*1000:.1f} ms ({len(result)} rows)")

    avg_time = sum(times) / len(times)
    complexity_times[complexity] = avg_time

# Summary
print(f"\n{'='*50}")
print("Complexity Performance Summary:")
for complexity, avg_time in sorted(complexity_times.items()):
    print(f"  {complexity:10s}: {avg_time*1000:.1f} ms average")

Join Pattern Analysis

Analyze different join patterns:

from benchbox.joinorder import JoinOrder
from benchbox.platforms.duckdb import DuckDBAdapter

benchmark = JoinOrder(scale_factor=0.01)
benchmark.generate_data()

# Get queries by join pattern
by_pattern = benchmark.get_queries_by_pattern()

print("Join Pattern Distribution:")
for pattern, query_ids in sorted(by_pattern.items()):
    print(f"\n{pattern.replace('_', ' ').title()}:")
    print(f"  Queries: {', '.join(query_ids)}")
    print(f"  Count: {len(query_ids)}")

    # Analyze a sample query
    sample_query = benchmark.get_query(query_ids[0])
    join_count = sample_query.upper().count("JOIN")
    from_count = sample_query.upper().count("FROM")

    print(f"  Example ({query_ids[0]}): {join_count} joins, {from_count} FROM clauses")

Optimizer Testing

Test query optimizer with different query variations:

from benchbox.joinorder import JoinOrder
from benchbox.platforms.duckdb import DuckDBAdapter
import time

benchmark = JoinOrder(scale_factor=0.1)
benchmark.generate_data()

adapter = DuckDBAdapter()
conn = adapter.create_connection()
adapter.create_schema(benchmark, conn)
adapter.load_data(benchmark, conn, benchmark.output_dir)

# Test optimizer on complex queries
complex_queries = benchmark.get_queries_by_complexity()["complex"]

print("Optimizer Analysis:")
for query_id in complex_queries:
    query = benchmark.get_query(query_id)

    # Enable query profiling
    conn.execute("PRAGMA enable_profiling='query_tree'")

    # Run query
    start = time.time()
    result = conn.execute(query).fetchall()
    elapsed = time.time() - start

    print(f"\n{query_id}:")
    print(f"  Execution time: {elapsed*1000:.1f} ms")
    print(f"  Rows returned: {len(result)}")
    print(f"  Tables joined: ~{query.upper().count('JOIN')}")

    # Get query plan (DuckDB specific)
    try:
        plan = conn.execute("EXPLAIN " + query).fetchall()
        print(f"  Plan nodes: {len(plan)}")
    except:
        pass

Schema Exploration

Explore the complex schema structure:

from benchbox.joinorder import JoinOrder

benchmark = JoinOrder(scale_factor=0.1)

# Get all tables categorized
dim_tables = benchmark.get_dimension_tables()
rel_tables = benchmark.get_relationship_tables()

print("Schema Structure:")
print(f"\nDimension Tables ({len(dim_tables)}):")
for table in dim_tables:
    info = benchmark.get_table_info(table)
    row_count = benchmark.get_table_row_count(table)
    print(f"  {table:20s}: {row_count:>10,} rows, {len(info['columns'])} columns")

print(f"\nRelationship Tables ({len(rel_tables)}):")
for table in rel_tables:
    info = benchmark.get_table_info(table)
    row_count = benchmark.get_table_row_count(table)
    fk_count = len(info.get('foreign_keys', []))
    print(f"  {table:20s}: {row_count:>10,} rows, {fk_count} foreign keys")

# Estimate total size
total_size = benchmark.get_estimated_data_size()
print(f"\nEstimated Total Size: {total_size / 1024 / 1024:.1f} MB")

Custom Query Execution

Run specific queries with custom analysis:

from benchbox.joinorder import JoinOrder
from benchbox.platforms.duckdb import DuckDBAdapter
from benchbox.core.results.timing import TimingCollector, TimingAnalyzer

benchmark = JoinOrder(scale_factor=0.01)
benchmark.generate_data()

adapter = DuckDBAdapter()
conn = adapter.create_connection()
adapter.create_schema(benchmark, conn)
adapter.load_data(benchmark, conn, benchmark.output_dir)

# Collect detailed timing
collector = TimingCollector(enable_detailed_timing=True)

# Run subset of queries
target_queries = ["1a", "2a", "3a", "4a"]

for query_id in target_queries:
    query = benchmark.get_query(query_id)

    with collector.time_query(query_id, f"Query {query_id}") as timing:
        with collector.time_phase(query_id, "execute"):
            result = conn.execute(query).fetchall()

        collector.record_metric(query_id, "rows_returned", len(result))
        collector.record_metric(query_id, "tables_joined", query.upper().count("JOIN"))

# Analyze timing
timings = collector.get_completed_timings()
analyzer = TimingAnalyzer(timings)

analysis = analyzer.analyze_query_performance()

print("\nPerformance Analysis:")
print(f"  Mean: {analysis['basic_stats']['mean']*1000:.1f} ms")
print(f"  Median: {analysis['basic_stats']['median']*1000:.1f} ms")
print(f"  Min: {analysis['basic_stats']['min']*1000:.1f} ms")
print(f"  Max: {analysis['basic_stats']['max']*1000:.1f} ms")

Best Practices

  1. Start with Small Scale Factors

    Begin with small scale factors for development and testing:

    # Development - very fast
    dev_benchmark = JoinOrder(scale_factor=0.001)  # ~700 KB
    
    # Testing - reasonable size
    test_benchmark = JoinOrder(scale_factor=0.01)  # ~17 MB
    
    # Production - realistic workload
    prod_benchmark = JoinOrder(scale_factor=0.1)   # ~170 MB
    
  2. Analyze Query Complexity

    Use complexity categorization to understand workload:

    by_complexity = benchmark.get_queries_by_complexity()
    
    # Start with simple queries
    for query_id in by_complexity["simple"]:
        # Test basic join performance
        pass
    
    # Then move to complex queries
    for query_id in by_complexity["complex"]:
        # Test advanced optimizer capabilities
        pass
    
  3. Use Join Pattern Analysis

    Focus on specific join patterns for targeted testing:

    by_pattern = benchmark.get_queries_by_pattern()
    
    # Test star joins
    star_queries = by_pattern.get("star_join", [])
    
    # Test chain joins
    chain_queries = by_pattern.get("chain_join", [])
    
    # Test complex patterns
    complex_queries = by_pattern.get("complex_join", [])
    
  4. Monitor Optimizer Behavior

    Enable query profiling to understand optimizer decisions:

    # DuckDB example
    conn.execute("PRAGMA enable_profiling='query_tree'")
    
    # Run query
    result = conn.execute(query).fetchall()
    
    # Analyze plan
    plan = conn.execute("EXPLAIN " + query).fetchall()
    
  5. Cache Generated Data

    Reuse generated data across runs:

    from pathlib import Path
    
    data_dir = Path("joinorder_cache/sf01")
    
    if data_dir.exists():
        # Reuse existing data
        benchmark = JoinOrder(scale_factor=0.1, output_dir=data_dir)
    else:
        # Generate once
        benchmark = JoinOrder(scale_factor=0.1, output_dir=data_dir)
        benchmark.generate_data()
    

Common Issues

Large Data Size

Problem: Scale factor 1.0 generates ~1.7GB of data

Solution: Use appropriate scale factors:

# For CI/CD - use minimal scale
ci_benchmark = JoinOrder(scale_factor=0.001)

# For local testing - use small scale
test_benchmark = JoinOrder(scale_factor=0.01)

# For performance testing - use realistic scale
perf_benchmark = JoinOrder(scale_factor=0.1)

Slow Query Execution

Problem: Complex queries take too long

Solution: Start with simpler queries or smaller scale:

# Filter to simple queries only
simple_queries = benchmark.get_queries_by_complexity()["simple"]

for query_id in simple_queries:
    query = benchmark.get_query(query_id)
    # Execute fast queries first
    result = conn.execute(query).fetchall()

Memory Issues

Problem: Large datasets cause out-of-memory errors

Solution: Use file-based database or limit memory:

from benchbox.platforms.duckdb import DuckDBAdapter

# Use file-based database
adapter = DuckDBAdapter(
    database_path="joinorder.duckdb",
    memory_limit="2GB"
)

See Also

External Resources