API Reference¶

AMPLab Big Data Benchmark implementation.

Licensed under the MIT License. See LICENSE file in the project root for details.

class AMPLab(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Bases: BaseBenchmark

AMPLab Big Data Benchmark implementation.

Provides AMPLab Big Data Benchmark implementation, including data generation and access to scan, join, and analytical queries for web analytics data.

Reference: AMPLab Big Data Benchmark - https://amplab.cs.berkeley.edu/benchmark/

__init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Initialize AMPLab Big Data Benchmark instance.

Parameters:

scale_factor (float) – Scale factor for the benchmark (1.0 = ~1GB)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs – Additional implementation-specific options

generate_data()[source]¶

Generate AMPLab Big Data Benchmark data.

Returns:: A list of paths to the generated data files
Return type:: list[str | Path]

get_queries(dialect=None)[source]¶

Get all AMPLab Big Data Benchmark queries.

Parameters:: dialect (str | None) – Target SQL dialect for query translation. If None, returns original queries.
Returns:: Dictionary mapping query IDs to query strings
Return type:: dict[str, str]

get_query(query_id, *, params=None)[source]¶

Get specific AMPLab Big Data Benchmark query.

Parameters:

query_id (int | str) – ID of the query to retrieve (1-5)
params (dict[str, Any] | None) – Optional parameters to customize the query

Returns:

Query string

Raises:

ValueError – If query_id is invalid

Return type:

str

get_schema()[source]¶

Get AMPLab Big Data Benchmark schema.

Returns:: List of dictionaries describing the tables in the schema
Return type:: list[dict]

get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶

Get SQL to create all AMPLab Big Data Benchmark tables.

Parameters:

dialect (str) – SQL dialect to use
tuning_config – Unified tuning configuration for constraint settings

Returns:

SQL script for creating all tables

Return type:

str

Base class for all benchmarks.

Licensed under the MIT License. See LICENSE file in the project root for details.

class BaseBenchmark(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Bases: VerbosityMixin, ABC

Base class for all benchmarks.

All benchmarks inherit from this class.

__init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Initialize a benchmark.

Parameters:

scale_factor (float) – Scale factor (1.0 = standard size)
output_dir (str | Path | None) – Data output directory
**kwargs (Any) – Additional options

get_data_source_benchmark()[source]¶

Return the canonical source benchmark when data is shared.

Benchmarks that reuse data generated by another benchmark (for example, Primitives reusing TPC-H datasets) should override this method and return the lower-case identifier of the source benchmark. Benchmarks that produce their own data should return None (default).

abstractmethod generate_data()[source]¶

Generate benchmark data.

Returns:: List of data file paths
Return type:: list[str | Path]

abstractmethod get_queries()[source]¶

Get all benchmark queries.

Returns:: Dictionary mapping query IDs to query strings
Return type:: dict[str, str]

abstractmethod get_query(query_id, *, params=None)[source]¶

Get a benchmark query.

Parameters:

query_id (int | str) – Query ID
params (dict[str, Any] | None) – Optional parameters

Returns:

Query string with parameters resolved

Raises:

ValueError – If query_id is invalid

Return type:

str

setup_database(connection)[source]¶

Set up database with schema and data.

Creates necessary database schema and loads benchmark data into the database.

Parameters:

connection (DatabaseConnection) – Database connection to set up

Raises:

ValueError – If data generation fails
Exception – If database setup fails

run_query(query_id, connection, params=None, fetch_results=False)[source]¶

Execute single query and return timing and results.

Parameters:

query_id (int | str) – ID of the query to execute
connection (DatabaseConnection) – Database connection to execute query on
params (dict[str, Any] | None) – Optional parameters for query customization
fetch_results (bool) – Whether to fetch and return query results

Returns:

query_id: Executed query ID
execution_time: Time taken to execute query in seconds
query_text: Executed query text
results: Query results if fetch_results=True, otherwise None
row_count: Number of rows returned (if results fetched)

Return type:

Dictionary containing

Raises:

ValueError – If query_id is invalid
Exception – If query execution fails

run_benchmark(connection, query_ids=None, fetch_results=False, setup_database=True)[source]¶

Run the complete benchmark suite.

Parameters:

connection (DatabaseConnection) – Database connection to execute queries on
query_ids (list[int | str] | None) – Optional list of specific query IDs to run (defaults to all)
fetch_results (bool) – Whether to fetch and return query results
setup_database (bool) – Whether to set up the database first

Returns:

benchmark_name: Name of the benchmark
total_execution_time: Total time for all queries
total_queries: Number of queries executed
successful_queries: Number of queries that succeeded
failed_queries: Number of queries that failed
query_results: List of individual query results
setup_time: Time taken for database setup (if performed)

Return type:

Dictionary containing

Raises:

Exception – If benchmark execution fails

run_with_platform(platform_adapter, **run_config)[source]¶

Run complete benchmark using platform-specific optimizations.

This method provides a unified interface for running benchmarks using database platform adapters that handle connection management, data loading optimizations, and query execution.

This is the standard method that all benchmarks should support for integration with the CLI and other orchestration tools.

Parameters:

platform_adapter – Platform adapter instance (e.g., DuckDBAdapter)
**run_config – Configuration options: - categories: List of query categories to run (if benchmark supports) - query_subset: List of specific query IDs to run - connection: Connection configuration - benchmark_type: Type hint for optimizations (‘olap’, ‘oltp’, etc.)

Returns:

BenchmarkResults object with execution results

Example

from benchbox.platforms import DuckDBAdapter

benchmark = SomeBenchmark(scale_factor=0.1) adapter = DuckDBAdapter() results = benchmark.run_with_platform(adapter)

format_results(benchmark_result)[source]¶

Format benchmark results for display.

Parameters:: benchmark_result (dict[str, Any]) – Result dictionary from run_benchmark()
Returns:: Formatted string representation of the results
Return type:: str

translate_query(query_id, dialect)[source]¶

Translate a query to a specific SQL dialect.

Parameters:

query_id (int | str) – The ID of the query to translate
dialect (str) – The target SQL dialect

Returns:

The translated query string

Raises:

ValueError – If the query_id is invalid
ImportError – If sqlglot is not installed
ValueError – If the dialect is not supported

Return type:

str

property benchmark_name: str¶: Get the human-readable benchmark name.

create_enhanced_benchmark_result(platform, query_results, execution_metadata=None, phases=None, resource_utilization=None, performance_characteristics=None, **kwargs)[source]¶

Create a BenchmarkResults object with standardized fields.

This centralizes the logic for creating benchmark results that was previously duplicated across platform adapters and CLI orchestrator.

Parameters:

platform (str) – Platform name (e.g., “DuckDB”, “ClickHouse”)
query_results (list[dict[str, Any]]) – List of query execution results
execution_metadata (dict[str, Any] | None) – Optional execution metadata
phases (dict[str, dict[str, Any]] | None) – Optional phase tracking information
resource_utilization (dict[str, Any] | None) – Optional resource usage metrics
performance_characteristics (dict[str, Any] | None) – Optional performance analysis
**kwargs (Any) – Additional fields to override defaults

Returns:

Fully configured BenchmarkResults object

Return type:

BenchmarkResults

ClickBench (ClickHouse Analytics Benchmark) implementation.

Licensed under the MIT License. See LICENSE file in the project root for details.

class ClickBench(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Bases: BaseBenchmark

ClickBench (ClickHouse Analytics Benchmark) implementation.

Provides ClickBench benchmark implementation, including data generation and access to the 43 benchmark queries designed for testing analytical database performance with web analytics data.

Official specification: https://github.com/ClickHouse/ClickBench Results dashboard: https://benchmark.clickhouse.com/

__init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Initialize ClickBench benchmark instance.

Parameters:

scale_factor (float) – Scale factor for the benchmark (1.0 = ~1M records for testing)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs – Additional implementation-specific options

generate_data()[source]¶

Generate ClickBench benchmark data.

Returns:: A list of paths to the generated data files
Return type:: list[str | Path]

get_queries(dialect=None)[source]¶

Get all ClickBench benchmark queries.

Parameters:: dialect (str | None) – Target SQL dialect for translation (e.g., ‘duckdb’, ‘bigquery’, ‘snowflake’) If None, returns queries in their original format.
Returns:: Dictionary mapping query IDs (Q1-Q43) to query strings
Return type:: dict[str, str]

get_query(query_id, *, params=None)[source]¶

Get specific ClickBench benchmark query.

Parameters:

query_id (int | str) – ID of the query to retrieve (Q1-Q43)
params (dict[str, Any] | None) – Optional parameters to customize the query

Returns:

Query string

Raises:

ValueError – If query_id is invalid

Return type:

str

get_schema()[source]¶

Get ClickBench schema.

Returns:: List of dictionaries describing the tables in the schema
Return type:: list[dict]

get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶

Get SQL to create all ClickBench tables.

Parameters:

dialect (str) – SQL dialect to use
tuning_config – Unified tuning configuration for constraint settings

Returns:

SQL script for creating all tables

Return type:

str

translate_query(query_id, dialect)[source]¶

Translate a ClickBench query to a different SQL dialect.

Parameters:

query_id (str) – The ID of the query to translate (Q1-Q43)
dialect (str) – The target SQL dialect (postgres, mysql, bigquery, etc.)

Returns:

The translated query string

Raises:

ValueError – If the query_id is invalid
ImportError – If sqlglot is not installed
ValueError – If the dialect is not supported

Return type:

str

get_query_categories()[source]¶

Get ClickBench queries organized by category.

Returns:: Dictionary mapping category names to lists of query IDs
Return type:: dict[str, list[str]]

Public entrypoint for the reference-aligned CoffeeShop benchmark.

class CoffeeShop(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Bases: BaseBenchmark

High-level wrapper for the CoffeeShop benchmark.

The rewritten benchmark mirrors the public reference generator and now exposes a compact star schema consisting of:

dim_locations: geographic metadata and regional weights
dim_products: canonical product catalog with seasonal availability
order_lines: exploded fact table (1-5 lines per order) with temporal, regional, and pricing dynamics

The query suite (SA*, PR*, TR*, TM*, QC*) focuses on sales analysis, product behaviour, trend analysis, and quality checks for the new schema.

__init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Initialise a CoffeeShop benchmark instance.

generate_data()[source]¶

Generate Coffee Shop benchmark data.

Returns:: A list of paths to the generated data files
Return type:: list[str | Path]

get_queries(dialect=None)[source]¶

Get all Coffee Shop benchmark queries.

Parameters:: dialect (str | None) – Target SQL dialect for query translation. If None, returns original queries.
Returns:: A dictionary mapping query IDs to query strings
Return type:: dict[str, str]

get_query(query_id, *, params=None)[source]¶

Return a single CoffeeShop analytics query.

Query identifiers follow the updated naming convention (e.g. SA1 for sales analysis, PR1 for product mix, TR1 for trend review).

get_schema()[source]¶

Get the Coffee Shop benchmark schema.

Returns:: A list of dictionaries describing the tables in the schema
Return type:: list[dict]

get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶

Get SQL to create all Coffee Shop benchmark tables.

Parameters:

dialect (str) – SQL dialect to use
tuning_config – Unified tuning configuration for constraint settings

Returns:

SQL script for creating all tables

Return type:

str

H2O Database Benchmark implementation.

Licensed under the MIT License. See LICENSE file in the project root for details.

class H2ODB(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Bases: BaseBenchmark

H2O Database Benchmark implementation.

This class provides an implementation of the H2O Database Benchmark, including data generation and access to analytical queries for taxi trip data.

Reference: H2O.ai benchmarking suite for analytical workloads

__init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Initialize an H2O Database Benchmark instance.

Parameters:

scale_factor (float) – Scale factor for the benchmark (1.0 = ~1GB)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs – Additional implementation-specific options

generate_data()[source]¶

Generate H2O Database Benchmark data.

Returns:: A list of paths to the generated data files
Return type:: list[str | Path]

get_queries(dialect=None)[source]¶

Get all H2O Database Benchmark queries.

Parameters:: dialect (str | None) – Target SQL dialect for query translation. If None, returns original queries.
Returns:: A dictionary mapping query IDs to query strings
Return type:: dict[str, str]

get_query(query_id, *, params=None)[source]¶

Get a specific H2O Database Benchmark query.

Parameters:

query_id (int | str) – The ID of the query to retrieve (Q1-Q10)
params (dict[str, Any] | None) – Optional parameters to customize the query

Returns:

The query string

Raises:

ValueError – If the query_id is invalid

Return type:

str

get_schema()[source]¶

Get the H2O Database Benchmark schema.

Returns:: A list of dictionaries describing the tables in the schema
Return type:: list[dict]

get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶

Get SQL to create all H2O Database Benchmark tables.

Parameters:

dialect (str) – SQL dialect to use
tuning_config – Unified tuning configuration for constraint settings

Returns:

SQL script for creating all tables

Return type:

str

Join Order Benchmark top-level interface.

Licensed under the MIT License. See LICENSE file in the project root for details.

class JoinOrder(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Bases: BaseBenchmark

Join Order Benchmark implementation.

This class provides an implementation of the Join Order Benchmark, including data generation and access to complex join queries for cardinality estimation and join order optimization testing.

Reference: Viktor Leis et al. “How Good Are Query Optimizers, Really?”

__init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Initialize a Join Order Benchmark instance.

Parameters:

scale_factor (float) – Scale factor for the benchmark (1.0 = ~1GB)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs (Any) – Additional implementation-specific options

generate_data()[source]¶

Generate Join Order Benchmark data.

Returns:: A list of paths to the generated data files
Return type:: list[Path]

get_queries()[source]¶

Get all Join Order Benchmark queries.

Returns:: A dictionary mapping query IDs to query strings
Return type:: dict[str, str]

get_query(query_id, *, params=None)[source]¶

Get a specific Join Order Benchmark query.

Parameters:

query_id (int | str) – The ID of the query to retrieve
params (dict[str, Any] | None) – Optional parameters to customize the query

Returns:

The query string

Raises:

ValueError – If the query_id is invalid

Return type:

str

get_schema(dialect='sqlite')[source]¶

Get the Join Order Benchmark schema DDL.

Parameters:: dialect (str) – Target SQL dialect
Returns:: DDL statements for creating all tables
Return type:: str

get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶

Get SQL to create all Join Order Benchmark tables.

Parameters:

dialect (str) – SQL dialect to use
tuning_config (Any) – Unified tuning configuration for constraint settings

Returns:

SQL script for creating all tables

Return type:

str

Read Primitives benchmark implementation.

This benchmark combines queries from multiple sources:

Apache Impala targeted-perf workload (https://github.com/apache/impala/tree/master/testdata/workloads/targeted-perf) Apache License 2.0, Copyright Apache Software Foundation
Optimizer sniff test concepts by Justin Jaffray (https://buttondown.com/jaffray/archive/a-sniff-test-for-some-query-optimizers/)

Data generation uses the TPC-H schema (TPC Benchmark H, Copyright Transaction Processing Performance Council).

Licensed under the MIT License. See LICENSE file in the project root for details.

class ReadPrimitives(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Bases: BaseBenchmark

Read Primitives benchmark implementation.

Provides Read Primitives benchmark implementation, including data generation and access to 80+ primitive read operation queries that test fundamental database capabilities using the TPC-H schema.

The benchmark covers: - Aggregation, joins, filters, window functions - OLAP operations, statistical functions - JSON operations, full-text search - Time series analysis, array operations - Graph operations, temporal queries

__init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Initialize Read Primitives benchmark instance.

Parameters:

scale_factor (float) – Scale factor for the benchmark (1.0 = ~6M lineitem rows)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs (Any) – Additional implementation-specific options

generate_data(tables=None)[source]¶

Generate Read Primitives benchmark data.

Parameters:: tables (list[str] | None) – Optional list of table names to generate. If None, generates all.
Returns:: A dictionary mapping table names to file paths
Return type:: dict[str, str]

get_queries(dialect=None)[source]¶

Get all Read Primitives benchmark queries.

Parameters:: dialect (str | None) – Target SQL dialect for query translation. If None, returns original queries.
Returns:: A dictionary mapping query IDs to query strings
Return type:: dict[str, str]

get_query(query_id, *, params=None)[source]¶

Get a specific Read Primitives benchmark query.

Parameters:

query_id (int | str) – The ID of the query to retrieve (e.g., ‘aggregation_simple’)
params (dict[str, Any] | None) – Optional parameters to customize the query

Returns:

The query string

Raises:

ValueError – If the query_id is invalid

Return type:

str

get_queries_by_category(category)[source]¶

Get queries filtered by category.

Parameters:: category (str) – Category name (e.g., ‘aggregation’, ‘window’, ‘join’)
Returns:: Dictionary mapping query IDs to SQL text for the category
Return type:: dict[str, str]

get_query_categories()[source]¶

Get list of available query categories.

Returns:: List of category names
Return type:: list[str]

get_schema()[source]¶

Get the Read Primitives benchmark schema (TPC-H).

Returns:: A dictionary mapping table names to their schema definitions
Return type:: dict[str, dict]

get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶

Get SQL to create all Read Primitives benchmark tables.

Parameters:

dialect (str) – SQL dialect to use
tuning_config – Unified tuning configuration for constraint settings

Returns:

SQL script for creating all tables

Return type:

str

load_data_to_database(connection, tables=None)[source]¶

Load generated data into a database.

Parameters:

connection (Any) – Database connection
tables (list[str] | None) – Optional list of tables to load. If None, loads all.

Raises:

ValueError – If data hasn’t been generated yet

execute_query(query_id, connection, params=None)[source]¶

Execute a Read Primitives query on the given database connection.

Parameters:

query_id (str) – Query identifier (e.g., ‘aggregation_simple’)
connection (Any) – Database connection to use for execution
params (dict[str, Any] | None) – Optional parameters to use in the query

Returns:

Query results from the database

Raises:

ValueError – If the query_id is not valid

Return type:

Any

run_benchmark(connection, queries=None, iterations=1, categories=None)[source]¶

Run the complete Read Primitives benchmark.

Parameters:

connection (Any) – Database connection to use
queries (list[str] | None) – Optional list of query IDs to run. If None, runs all.
iterations (int) – Number of times to run each query
categories (list[str] | None) – Optional list of categories to run. If specified, overrides queries.

Returns:

Dictionary containing benchmark results

Return type:

dict[str, Any]

run_category_benchmark(connection, category, iterations=1)[source]¶

Run benchmark for a specific query category.

Parameters:

connection (Any) – Database connection to use
category (str) – Category name to run (e.g., ‘aggregation’, ‘window’, ‘join’)
iterations (int) – Number of times to run each query

Returns:

Dictionary containing benchmark results for the category

Return type:

dict[str, Any]

get_benchmark_info()[source]¶

Get information about the benchmark.

Returns:: Dictionary containing benchmark metadata
Return type:: dict[str, Any]

Star Schema Benchmark implementation.

Licensed under the MIT License. See LICENSE file in the project root for details.

class SSB(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Bases: BaseBenchmark

Star Schema Benchmark implementation.

This class provides an implementation of the Star Schema Benchmark, including data generation and access to the 13 benchmark queries organized in 4 flights.

Reference: O’Neil et al. “The Star Schema Benchmark and Augmented Fact Table Indexing”

__init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Initialize a Star Schema Benchmark instance.

Parameters:

scale_factor (float) – Scale factor for the benchmark (1.0 = ~1GB)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs (Any) – Additional implementation-specific options

generate_data()[source]¶

Generate Star Schema Benchmark data.

Returns:: A list of paths to the generated data files
Return type:: list[str | Path]

get_queries(dialect=None)[source]¶

Get all Star Schema Benchmark queries.

Parameters:: dialect (str | None) – Target SQL dialect for query translation. If None, returns original queries.
Returns:: A dictionary mapping query IDs to query strings
Return type:: dict[str, str]

get_query(query_id, *, params=None)[source]¶

Get a specific Star Schema Benchmark query.

Parameters:

query_id (str) – The ID of the query to retrieve (Q1.1-Q4.3)
params (dict[str, Any] | None) – Optional parameters to customize the query

Returns:

The query string

Raises:

ValueError – If the query_id is invalid

Return type:

str

get_schema()[source]¶

Get the Star Schema Benchmark schema.

Returns:: A list of dictionaries describing the tables in the schema
Return type:: list[dict]

get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶

Get SQL to create all Star Schema Benchmark tables.

Parameters:

dialect (str) – SQL dialect to use
tuning_config – Unified tuning configuration for constraint settings

Returns:

SQL script for creating all tables

Return type:

str

TPC-DI (Data Integration) benchmark implementation.

Licensed under the MIT License. See LICENSE file in the project root for details.

class TPCDI(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Bases: BaseBenchmark

TPC-DI benchmark implementation.

This class provides an implementation of the TPC-DI benchmark, including data generation and access to validation and analytical queries.

Official specification: http://www.tpc.org/tpcdi

__init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Initialize a TPC-DI benchmark instance.

Parameters:

scale_factor (float) – Scale factor for the benchmark (1.0 = ~1GB)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs – Additional implementation-specific options

generate_data()[source]¶

Generate TPC-DI benchmark data.

Returns:: A list of paths to the generated data files
Return type:: list[str | Path]

get_queries(dialect=None)[source]¶

Get all TPC-DI benchmark queries.

Parameters:: dialect (str | None) – Target SQL dialect for query translation. If None, returns original queries.
Returns:: A dictionary mapping query IDs to query strings
Return type:: dict[str, str]

get_query(query_id, *, params=None)[source]¶

Get a specific TPC-DI benchmark query.

Parameters:

query_id (int | str) – The ID of the query to retrieve
params (dict[str, Any] | None) – Optional parameters to customize the query

Returns:

The query string

Raises:

ValueError – If the query_id is invalid

Return type:

str

get_schema(dialect='standard')[source]¶

Get the TPC-DI schema.

Parameters:: dialect (str) – Target SQL dialect
Returns:: A dictionary mapping table names to table definitions
Return type:: dict[str, dict[str, Any]]

get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶

Get SQL to create all TPC-DI tables.

Parameters:

dialect (str) – SQL dialect to use
tuning_config – Unified tuning configuration for constraint settings

Returns:

SQL script for creating all tables

Return type:

str

generate_source_data(formats=None, batch_types=None)[source]¶

Generate source data in various formats for ETL processing.

Parameters:

formats (list[str] | None) – List of data formats to generate (csv, xml, fixed_width, json)
batch_types (list[str] | None) – List of batch types to generate (historical, incremental, scd)

Returns:

Dictionary mapping formats to lists of generated file paths

Return type:

dict[str, list[str]]

run_etl_pipeline(connection, batch_type='historical', validate_data=True)[source]¶

Run the complete ETL pipeline for TPC-DI.

Parameters:

connection (Any) – Database connection for target warehouse
batch_type (str) – Type of batch to process (historical, incremental, scd)
validate_data (bool) – Whether to run data validation after ETL

Returns:

Dictionary containing ETL execution results and metrics

Return type:

dict[str, Any]

validate_etl_results(connection)[source]¶

Validate ETL results using data quality checks.

Parameters:: connection (Any) – Database connection to validate against
Returns:: Dictionary containing validation results and data quality metrics
Return type:: dict[str, Any]

get_etl_status()[source]¶

Get current ETL processing status and metrics.

Returns:: Dictionary containing ETL status, metrics, and batch information
Return type:: dict[str, Any]

property etl_mode: bool¶

Check if ETL mode is enabled.

Returns:: Always True as TPC-DI is now a pure ETL benchmark

load_data_to_database(connection, tables=None)[source]¶

Load generated data into a database.

Parameters:

connection (Any) – Database connection
tables (list[str] | None) – Optional list of tables to load. If None, loads all.

Raises:

ValueError – If data hasn’t been generated yet

run_benchmark(connection, queries=None, iterations=1)[source]¶

Run the complete TPC-DI benchmark.

Parameters:

connection (Any) – Database connection to use
queries (list[str] | None) – Optional list of query IDs to run. If None, runs all.
iterations (int) – Number of times to run each query

Returns:

Dictionary containing benchmark results

Return type:

dict[str, Any]

execute_query(query_id, connection, params=None)[source]¶

Execute a TPC-DI query on the given database connection.

Parameters:

query_id (int | str) – Query identifier (e.g., “V1”, “V2”, “A1”, etc.)
connection (Any) – Database connection to use for execution
params (dict[str, Any] | None) – Optional parameters to use in the query

Returns:

Query results from the database

Raises:

ValueError – If the query_id is not valid

Return type:

Any

create_schema(connection, dialect='duckdb')[source]¶

Create TPC-DI schema using the schema manager.

Parameters:

connection (Any) – Database connection
dialect (str) – Target SQL dialect

run_full_benchmark(connection, dialect='duckdb')[source]¶

Run the complete TPC-DI benchmark with all phases.

This is the main entry point for running a complete TPC-DI benchmark including schema creation, data loading, ETL processing, validation, and metrics calculation.

Parameters:

connection (Any) – Database connection
dialect (str) – SQL dialect for the target database

Returns:

Complete benchmark results with all metrics

Return type:

dict[str, Any]

run_etl_benchmark(connection, dialect='duckdb')[source]¶

Run the ETL benchmark pipeline.

Parameters:

connection (Any) – Database connection
dialect (str) – SQL dialect

Returns:

ETL execution results

Return type:

Any

run_data_validation(connection)[source]¶

Run data quality validation.

Parameters:: connection (Any) – Database connection
Returns:: Data quality validation results
Return type:: Any

calculate_official_metrics(etl_result, validation_result)[source]¶

Calculate official TPC-DI metrics.

Parameters:

etl_result (Any) – ETL execution results
validation_result (Any) – Data validation results

Returns:

Official TPC-DI benchmark metrics

Return type:

Any

optimize_database(connection)[source]¶

Optimize database performance for TPC-DI queries.

Parameters:: connection (Any) – Database connection
Returns:: Optimization results
Return type:: dict[str, Any]

property validator: Any¶

Get the TPC-DI validator instance.

Returns:: TPCDIValidator instance

property schema_manager: Any¶

Get the TPC-DI schema manager instance.

Returns:: TPCDISchemaManager instance

property metrics_calculator: Any¶

Get the TPC-DI metrics calculator instance.

Returns:: TPCDIMetrics instance

TPC-DS benchmark implementation.

Licensed under the MIT License. See LICENSE file in the project root for details.

class TPCDS(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Bases: BaseBenchmark

TPC-DS benchmark implementation.

Provides TPC-DS benchmark implementation, including data generation and access to the benchmark queries.

Official specification: http://www.tpc.org/tpcds

__init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Initialize TPC-DS benchmark instance.

Parameters:

scale_factor (float) – Scale factor for the benchmark (1.0 = ~1GB)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs (Any) – Additional implementation-specific options

Raises:

ValueError – If scale_factor is not positive
TypeError – If scale_factor is not a number

generate_data()[source]¶

Generate TPC-DS benchmark data.

Returns:: A list of paths to the generated data files
Return type:: list[str | Path]

get_queries(dialect=None, base_dialect=None)[source]¶

Get all TPC-DS benchmark queries.

Parameters:: dialect (str | None) – Target SQL dialect for translation (e.g., ‘duckdb’, ‘postgres’)
Returns:: A dictionary mapping query IDs to query strings
Return type:: dict[str, str]

get_query(query_id, *, params=None, seed=None, scale_factor=None, dialect=None, **kwargs)[source]¶

Get a specific TPC-DS benchmark query.

Parameters:

query_id (int) – The ID of the query to retrieve (1-99)
params (dict[str, Any] | None) – Optional parameters to customize the query (legacy parameter, mostly ignored)
seed (int | None) – Random number generator seed for parameter generation
scale_factor (float | None) – Scale factor for parameter calculations
dialect (str | None) – Target SQL dialect
**kwargs – Additional parameters

Returns:

The query string

Raises:

ValueError – If the query_id is invalid
TypeError – If query_id is not an integer

Return type:

str

property queries: TPCDSQueryManager¶

Access to the query manager.

Returns:: The underlying query manager instance

property generator: TPCDSDataGenerator¶

Access to the data generator.

Returns:: The underlying data generator instance

get_available_tables()[source]¶

Get list of available tables.

Returns:: List of table names
Return type:: list[str]

get_available_queries()[source]¶

Get list of available query IDs.

Returns:: List of query IDs (1-99)
Return type:: list[int]

generate_table_data(table_name, output_dir=None)[source]¶

Generate data for a specific table.

Parameters:

table_name (str) – Name of the table to generate data for
output_dir (str | None) – Optional output directory for generated data

Returns:

Iterator of data rows for the table

Return type:

str

get_schema()[source]¶

Get the TPC-DS schema.

Returns:: A list of dictionaries describing the tables in the schema
Return type:: list[dict]

get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶

Get SQL to create all TPC-DS tables.

Parameters:

dialect (str) – SQL dialect to use (currently ignored, TPC-DS uses standard SQL)
tuning_config – Unified tuning configuration for constraint settings

Returns:

SQL script for creating all tables

Return type:

str

generate_streams(num_streams=1, rng_seed=None, streams_output_dir=None)[source]¶

Generate TPC-DS query streams.

Parameters:

num_streams (int) – Number of concurrent streams to generate
rng_seed (int | None) – Random number generator seed for parameter generation
streams_output_dir (str | Path | None) – Directory to output stream files

Returns:

List of paths to generated stream files

Return type:

list[Path]

get_stream_info(stream_id)[source]¶

Get information about a specific stream.

Parameters:: stream_id (int) – Stream identifier
Returns:: Dictionary containing stream information
Return type:: dict[str, Any]

get_all_streams_info()[source]¶

Get information about all streams.

Returns:: List of dictionaries containing stream information
Return type:: list[dict[str, Any]]

get_benchmark_info()[source]¶

Get benchmark information.

Returns:: Dictionary with benchmark information including name, scale factor, available tables, queries, and C tools info
Return type:: dict[str, Any]

TPC-H benchmark implementation.

Licensed under the MIT License. See LICENSE file in the project root for details.

class TPCH(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Bases: BaseBenchmark

TPC-H benchmark implementation.

Provides TPC-H benchmark implementation, including data generation and access to the 22 benchmark queries.

Official specification: http://www.tpc.org/tpch

__init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Initialize TPC-H benchmark instance.

Parameters:

scale_factor (float) – Scale factor for the benchmark (1.0 = ~1GB)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs (Any) – Additional implementation-specific options

Raises:

ValueError – If scale_factor is not positive
TypeError – If scale_factor is not a number

generate_data()[source]¶

Generate TPC-H benchmark data.

Returns:: A list of paths to the generated data files
Return type:: list[str | Path]

get_queries(dialect=None, base_dialect=None)[source]¶

Get all TPC-H benchmark queries.

Parameters:: dialect (str | None) – Target SQL dialect for translation (e.g., ‘duckdb’, ‘bigquery’, ‘snowflake’) If None, returns queries in their original format.
Returns:: A dictionary mapping query IDs (1-22) to query strings
Return type:: dict[str, str]

get_query(query_id, *, params=None, seed=None, scale_factor=None, dialect=None, base_dialect=None, **kwargs)[source]¶

Get a specific TPC-H benchmark query.

Parameters:

query_id (int) – The ID of the query to retrieve (1-22)
params (dict[str, Any] | None) – Optional parameters to customize the query (legacy parameter, mostly ignored)
seed (int | None) – Random number generator seed for parameter generation
scale_factor (float | None) – Scale factor for parameter calculations
dialect (str | None) – Target SQL dialect
base_dialect (str | None) – Source SQL dialect (default: netezza)
**kwargs – Additional parameters

Returns:

The query string

Raises:

ValueError – If the query_id is invalid
TypeError – If query_id is not an integer

Return type:

str

get_schema()[source]¶

Get the TPC-H schema.

Returns:: A list of dictionaries describing the tables in the schema
Return type:: list[dict]

get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶

Get SQL to create all TPC-H tables.

Parameters:

dialect (str) – SQL dialect to use (currently ignored, TPC-H uses standard SQL)
tuning_config – Unified tuning configuration for constraint settings

Returns:

SQL script for creating all tables

Return type:

str

generate_streams(num_streams=1, rng_seed=None, streams_output_dir=None)[source]¶

Generate TPC-H query streams.

Parameters:

num_streams (int) – Number of concurrent streams to generate
rng_seed (int | None) – Random number generator seed for parameter generation
streams_output_dir (str | Path | None) – Directory to output stream files

Returns:

List of paths to generated stream files

Return type:

list[Path]

get_stream_info(stream_id)[source]¶

Get information about a specific stream.

Parameters:: stream_id (int) – Stream identifier
Returns:: Dictionary containing stream information
Return type:: dict[str, Any]

get_all_streams_info()[source]¶

Get information about all streams.

Returns:: List of dictionaries containing stream information
Return type:: list[dict[str, Any]]

property tables: dict[str, Path]¶

Get the mapping of table names to data file paths.

Returns:: Dictionary mapping table names to paths of generated data files

run_official_benchmark(connection_factory, config=None)[source]¶

Run the official TPC-H benchmark.

This method provides compatibility for official benchmark examples.

Parameters:

connection_factory – Factory function or connection object
config – Optional configuration parameters

Returns:

Dictionary with benchmark results

run_power_test(connection_factory, config=None)[source]¶

Run the TPC-H power test.

This method provides compatibility for power test examples.

Parameters:

connection_factory – Factory function or connection object
config – Optional configuration parameters

Returns:

Dictionary with power test results

run_maintenance_test(connection_factory, config=None)[source]¶

Run the TPC-H maintenance test.

This method provides compatibility for maintenance test examples.

Parameters:

connection_factory – Factory function or connection object
config – Optional configuration parameters

Returns:

Dictionary with maintenance test results

TPC-Havoc benchmark implementation.

Licensed under the MIT License. See LICENSE file in the project root for details.

class TPCHavoc(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Bases: BaseBenchmark

TPC-Havoc benchmark implementation.

Generates TPC-H query variants to stress query optimizers while maintaining result equivalence.

TPC-Havoc provides 10 structural variants for each TPC-H query (1-22). Each variant is semantically equivalent but uses different SQL constructs to stress different optimizer components.

Example

>>> from benchbox import TPCHavoc
>>> from benchbox.platforms.duckdb import DuckDBAdapter
>>>
>>> # Initialize benchmark and platform
>>> benchmark = TPCHavoc(scale_factor=1.0)
>>> adapter = DuckDBAdapter(database=":memory:")
>>>
>>> # Load data
>>> adapter.load_benchmark(benchmark)
>>>
>>> # Get and execute query variant
>>> variant_query = benchmark.get_query_variant(query_id=1, variant_id=1)
>>> results = adapter.execute_query(variant_query)
>>>
>>> # Get variant description
>>> desc = benchmark.get_variant_description(query_id=1, variant_id=1)
>>> print(desc)  # "Join order permutation: customers first"
>>>
>>> # Export all variants
>>> benchmark.export_variant_queries(output_dir="./queries")

Note

Query execution must be performed through platform adapters (DuckDBAdapter, SnowflakeAdapter, etc.). Direct execution methods are not provided to maintain architectural consistency.

__init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶

Initialize TPC-Havoc benchmark instance.

Parameters:

scale_factor (float) – Scale factor (1.0 = ~1GB)
output_dir (str | Path | None) – Data output directory
**kwargs (Any) – Additional options

Raises:

ValueError – If scale_factor is not positive
TypeError – If scale_factor is not a number

generate_data()[source]¶

Generate TPC-Havoc benchmark data (same as TPC-H).

Returns:: A list of paths to the generated data files
Return type:: list[str | Path]

get_queries(dialect=None)[source]¶

Get all TPC-Havoc benchmark queries (base TPC-H queries).

Parameters:: dialect (str | None) – Target SQL dialect for query translation. If None, returns original queries.
Returns:: A dictionary mapping query IDs (1-22) to base query strings
Return type:: dict[str, str]

get_query(query_id, *, params=None, seed=None, scale_factor=None, dialect=None, **kwargs)[source]¶

Get a specific TPC-Havoc benchmark query.

Parameters:

query_id – The ID of the query to retrieve (1-22 for base queries, or “1_v1” format for variants)
params (dict[str, Any] | None) – Optional parameters to customize the query
seed (int | None) – Random number generator seed for parameter generation
scale_factor (float | None) – Scale factor for parameter calculations
dialect (str | None) – Target SQL dialect
**kwargs – Additional parameters

Returns:

The query string

Raises:

ValueError – If the query_id is invalid
TypeError – If query_id is not valid format

Return type:

str

get_query_variant(query_id, variant_id, params=None)[source]¶

Get a specific TPC-Havoc query variant.

Parameters:

query_id (int) – The ID of the query to retrieve (1-22)
variant_id (int) – The ID of the variant to retrieve (1-10)
params (dict[str, Any] | None) – Optional parameter values to use

Returns:

The variant query string

Raises:

ValueError – If the query_id or variant_id is invalid
TypeError – If query_id or variant_id is not an integer

Return type:

str

get_all_variants(query_id)[source]¶

Get all variants for a specific query.

Parameters:

query_id (int) – The ID of the query to retrieve variants for (1-22)

Returns:

A dictionary mapping variant IDs to query strings

Raises:

ValueError – If the query_id is invalid or not implemented
TypeError – If query_id is not an integer

Return type:

dict[int, str]

get_variant_description(query_id, variant_id)[source]¶

Get description of a specific variant.

Parameters:

query_id (int) – The ID of the query (1-22)
variant_id (int) – The ID of the variant (1-10)

Returns:

Human-readable description of the variant

Raises:

ValueError – If the query_id or variant_id is invalid
TypeError – If query_id or variant_id is not an integer

Return type:

str

get_implemented_queries()[source]¶

Get list of query IDs that have variants implemented.

Returns:: List of query IDs with implemented variants
Return type:: list[int]

get_all_variants_info(query_id)[source]¶

Get information about all variants for a specific query.

Parameters:

query_id (int) – The ID of the query (1-22)

Returns:

Dictionary mapping variant IDs to variant info

Raises:

ValueError – If the query_id is invalid or not implemented
TypeError – If query_id is not an integer

Return type:

dict[int, dict[str, str]]

get_schema()[source]¶

Get the TPC-Havoc schema (same as TPC-H).

Returns:: A dictionary mapping table names to table definitions
Return type:: dict[str, dict[str, Any]]

get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶

Get SQL to create all TPC-Havoc tables (same as TPC-H).

Parameters:

dialect (str) – SQL dialect to use
tuning_config – Unified tuning configuration for constraint settings

Returns:

SQL script for creating all tables

Return type:

str

get_benchmark_info()[source]¶

Get information about the TPC-Havoc benchmark.

Returns:: Dictionary containing benchmark metadata
Return type:: dict[str, Any]

export_variant_queries(output_dir=None, format='sql')[source]¶

Export all variant queries to files.

Parameters:

output_dir (str | Path | None) – Directory to export queries to (default: self.output_dir/queries)
format (str) – Export format (“sql”, “json”)

Returns:

Dictionary mapping query identifiers to file paths

Raises:

ValueError – If format is unsupported

Return type:

dict[str, Path]

load_data_to_database(connection_string, dialect='standard', schema=None, drop_existing=False)[source]¶

Load generated data into a database (same as TPC-H).

Parameters:

connection_string (str) – Database connection string
dialect (str) – SQL dialect (standard, postgres, mysql, etc.)
schema (str | None) – Optional database schema to use
drop_existing (bool) – Whether to drop existing tables before creating new ones

Raises:

ValueError – If data hasn’t been generated yet
ImportError – If required database driver is not installed

run_query(query_id, connection_string, params=None, dialect='standard')[source]¶

Run a TPC-Havoc base query against a database.

Parameters:

query_id (int) – The ID of the query to run (1-22)
connection_string (str) – Database connection string
params (dict[str, Any] | None) – Optional parameter values to use
dialect (str) – SQL dialect (standard, postgres, mysql, etc.)

Returns:

Dictionary with query results and timing information

Raises:

ValueError – If the query_id is invalid
TypeError – If query_id is not an integer
ImportError – If required database driver is not installed

Return type:

dict[str, Any]

run_benchmark(connection_string, queries=None, iterations=1, dialect='standard', schema=None)[source]¶

Run the TPC-Havoc benchmark using base queries.

Parameters:

connection_string (str) – Database connection string
queries (list[int] | None) – Optional list of query IDs to run (default: all implemented)
iterations (int) – Number of times to run each query
dialect (str) – SQL dialect (standard, postgres, mysql, etc.)
schema (str | None) – Optional database schema to use

Returns:

Dictionary with benchmark results and timing information

Raises:

ValueError – If any query_id is invalid or iterations is not positive
TypeError – If query_ids are not integers
ImportError – If required database driver is not installed

Return type:

dict[str, Any]