API Reference¶
AMPLab Big Data Benchmark implementation.
Copyright 2026 Joe Harris / BenchBox Project
Licensed under the MIT License. See LICENSE file in the project root for details.
- class AMPLab(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Bases:
BaseBenchmarkAMPLab Big Data Benchmark implementation.
Provides AMPLab Big Data Benchmark implementation, including data generation and access to scan, join, and analytical queries for web analytics data.
Reference: AMPLab Big Data Benchmark - https://amplab.cs.berkeley.edu/benchmark/
- __init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Initialize AMPLab Big Data Benchmark instance.
- Parameters:
scale_factor (float) – Scale factor for the benchmark (1.0 = ~1GB)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs – Additional implementation-specific options
- generate_data()[source]¶
Generate AMPLab Big Data Benchmark data.
- Returns:
A list of paths to the generated data files
- Return type:
list[str | Path]
- get_queries(dialect=None)[source]¶
Get all AMPLab Big Data Benchmark queries.
- Parameters:
dialect (str | None) – Target SQL dialect for query translation. If None, returns original queries.
- Returns:
Dictionary mapping query IDs to query strings
- Return type:
dict[str, str]
- get_query(query_id, *, params=None)[source]¶
Get specific AMPLab Big Data Benchmark query.
- Parameters:
query_id (int | str) – ID of the query to retrieve (1-5)
params (dict[str, Any] | None) – Optional parameters to customize the query
- Returns:
Query string
- Raises:
ValueError – If query_id is invalid
- Return type:
str
- get_schema()[source]¶
Get AMPLab Big Data Benchmark schema.
- Returns:
List of dictionaries describing the tables in the schema
- Return type:
list[dict]
- get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶
Get SQL to create all AMPLab Big Data Benchmark tables.
- Parameters:
dialect (str) – SQL dialect to use
tuning_config – Unified tuning configuration for constraint settings
- Returns:
SQL script for creating all tables
- Return type:
str
Base class for all benchmarks.
Copyright 2026 Joe Harris / BenchBox Project
Licensed under the MIT License. See LICENSE file in the project root for details.
- class BaseBenchmark(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Bases:
VerbosityMixin,ABCBase class for all benchmarks.
All benchmarks inherit from this class.
- __init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Initialize a benchmark.
- Parameters:
scale_factor (float) – Scale factor (1.0 = standard size)
output_dir (str | Path | None) – Data output directory
**kwargs (Any) – Additional options
- get_data_source_benchmark()[source]¶
Return the canonical source benchmark when data is shared.
Benchmarks that reuse data generated by another benchmark (for example,
PrimitivesreusingTPC-Hdatasets) should override this method and return the lower-case identifier of the source benchmark. Benchmarks that produce their own data should returnNone(default).
- abstractmethod generate_data()[source]¶
Generate benchmark data.
- Returns:
List of data file paths
- Return type:
list[str | Path]
- abstractmethod get_queries()[source]¶
Get all benchmark queries.
- Returns:
Dictionary mapping query IDs to query strings
- Return type:
dict[str, str]
- abstractmethod get_query(query_id, *, params=None)[source]¶
Get a benchmark query.
- Parameters:
query_id (int | str) – Query ID
params (dict[str, Any] | None) – Optional parameters
- Returns:
Query string with parameters resolved
- Raises:
ValueError – If query_id is invalid
- Return type:
str
- setup_database(connection)[source]¶
Set up database with schema and data.
Creates necessary database schema and loads benchmark data into the database.
- Parameters:
connection (DatabaseConnection) – Database connection to set up
- Raises:
ValueError – If data generation fails
Exception – If database setup fails
- run_query(query_id, connection, params=None, fetch_results=False)[source]¶
Execute single query and return timing and results.
- Parameters:
query_id (int | str) – ID of the query to execute
connection (DatabaseConnection) – Database connection to execute query on
params (dict[str, Any] | None) – Optional parameters for query customization
fetch_results (bool) – Whether to fetch and return query results
- Returns:
query_id: Executed query ID
execution_time: Time taken to execute query in seconds
query_text: Executed query text
results: Query results if fetch_results=True, otherwise None
row_count: Number of rows returned (if results fetched)
- Return type:
Dictionary containing
- Raises:
ValueError – If query_id is invalid
Exception – If query execution fails
- run_benchmark(connection, query_ids=None, fetch_results=False, setup_database=True)[source]¶
Run the complete benchmark suite.
- Parameters:
connection (DatabaseConnection) – Database connection to execute queries on
query_ids (list[int | str] | None) – Optional list of specific query IDs to run (defaults to all)
fetch_results (bool) – Whether to fetch and return query results
setup_database (bool) – Whether to set up the database first
- Returns:
benchmark_name: Name of the benchmark
total_execution_time: Total time for all queries
total_queries: Number of queries executed
successful_queries: Number of queries that succeeded
failed_queries: Number of queries that failed
query_results: List of individual query results
setup_time: Time taken for database setup (if performed)
- Return type:
Dictionary containing
- Raises:
Exception – If benchmark execution fails
- run_with_platform(platform_adapter, **run_config)[source]¶
Run complete benchmark using platform-specific optimizations.
This method provides a unified interface for running benchmarks using database platform adapters that handle connection management, data loading optimizations, and query execution.
This is the standard method that all benchmarks should support for integration with the CLI and other orchestration tools.
- Parameters:
platform_adapter – Platform adapter instance (e.g., DuckDBAdapter)
**run_config – Configuration options: - categories: List of query categories to run (if benchmark supports) - query_subset: List of specific query IDs to run - connection: Connection configuration - benchmark_type: Type hint for optimizations (‘olap’, ‘oltp’, etc.)
- Returns:
BenchmarkResults object with execution results
Example
from benchbox.platforms import DuckDBAdapter
benchmark = SomeBenchmark(scale_factor=0.1) adapter = DuckDBAdapter() results = benchmark.run_with_platform(adapter)
- format_results(benchmark_result)[source]¶
Format benchmark results for display.
- Parameters:
benchmark_result (dict[str, Any]) – Result dictionary from run_benchmark()
- Returns:
Formatted string representation of the results
- Return type:
str
- translate_query(query_id, dialect)[source]¶
Translate a query to a specific SQL dialect.
- Parameters:
query_id (int | str) – The ID of the query to translate
dialect (str) – The target SQL dialect
- Returns:
The translated query string
- Raises:
ValueError – If the query_id is invalid
ImportError – If sqlglot is not installed
ValueError – If the dialect is not supported
- Return type:
str
- property benchmark_name: str¶
Get the human-readable benchmark name.
- create_enhanced_benchmark_result(platform, query_results, execution_metadata=None, phases=None, resource_utilization=None, performance_characteristics=None, **kwargs)[source]¶
Create a BenchmarkResults object with standardized fields.
This centralizes the logic for creating benchmark results that was previously duplicated across platform adapters and CLI orchestrator.
- Parameters:
platform (str) – Platform name (e.g., “DuckDB”, “ClickHouse”)
query_results (list[dict[str, Any]]) – List of query execution results
execution_metadata (dict[str, Any] | None) – Optional execution metadata
phases (dict[str, dict[str, Any]] | None) – Optional phase tracking information
resource_utilization (dict[str, Any] | None) – Optional resource usage metrics
performance_characteristics (dict[str, Any] | None) – Optional performance analysis
**kwargs (Any) – Additional fields to override defaults
- Returns:
Fully configured BenchmarkResults object
- Return type:
ClickBench (ClickHouse Analytics Benchmark) implementation.
Copyright 2026 Joe Harris / BenchBox Project
Licensed under the MIT License. See LICENSE file in the project root for details.
- class ClickBench(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Bases:
BaseBenchmarkClickBench (ClickHouse Analytics Benchmark) implementation.
Provides ClickBench benchmark implementation, including data generation and access to the 43 benchmark queries designed for testing analytical database performance with web analytics data.
Official specification: https://github.com/ClickHouse/ClickBench Results dashboard: https://benchmark.clickhouse.com/
- __init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Initialize ClickBench benchmark instance.
- Parameters:
scale_factor (float) – Scale factor for the benchmark (1.0 = ~1M records for testing)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs – Additional implementation-specific options
- generate_data()[source]¶
Generate ClickBench benchmark data.
- Returns:
A list of paths to the generated data files
- Return type:
list[str | Path]
- get_queries(dialect=None)[source]¶
Get all ClickBench benchmark queries.
- Parameters:
dialect (str | None) – Target SQL dialect for translation (e.g., ‘duckdb’, ‘bigquery’, ‘snowflake’) If None, returns queries in their original format.
- Returns:
Dictionary mapping query IDs (Q1-Q43) to query strings
- Return type:
dict[str, str]
- get_query(query_id, *, params=None)[source]¶
Get specific ClickBench benchmark query.
- Parameters:
query_id (int | str) – ID of the query to retrieve (Q1-Q43)
params (dict[str, Any] | None) – Optional parameters to customize the query
- Returns:
Query string
- Raises:
ValueError – If query_id is invalid
- Return type:
str
- get_schema()[source]¶
Get ClickBench schema.
- Returns:
List of dictionaries describing the tables in the schema
- Return type:
list[dict]
- get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶
Get SQL to create all ClickBench tables.
- Parameters:
dialect (str) – SQL dialect to use
tuning_config – Unified tuning configuration for constraint settings
- Returns:
SQL script for creating all tables
- Return type:
str
- translate_query(query_id, dialect)[source]¶
Translate a ClickBench query to a different SQL dialect.
- Parameters:
query_id (str) – The ID of the query to translate (Q1-Q43)
dialect (str) – The target SQL dialect (postgres, mysql, bigquery, etc.)
- Returns:
The translated query string
- Raises:
ValueError – If the query_id is invalid
ImportError – If sqlglot is not installed
ValueError – If the dialect is not supported
- Return type:
str
Public entrypoint for the reference-aligned CoffeeShop benchmark.
- class CoffeeShop(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Bases:
BaseBenchmarkHigh-level wrapper for the CoffeeShop benchmark.
The rewritten benchmark mirrors the public reference generator and now exposes a compact star schema consisting of:
dim_locations: geographic metadata and regional weightsdim_products: canonical product catalog with seasonal availabilityorder_lines: exploded fact table (1-5 lines per order) with temporal, regional, and pricing dynamics
The query suite (
SA*,PR*,TR*,TM*,QC*) focuses on sales analysis, product behaviour, trend analysis, and quality checks for the new schema.- __init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Initialise a CoffeeShop benchmark instance.
- generate_data()[source]¶
Generate Coffee Shop benchmark data.
- Returns:
A list of paths to the generated data files
- Return type:
list[str | Path]
- get_queries(dialect=None)[source]¶
Get all Coffee Shop benchmark queries.
- Parameters:
dialect (str | None) – Target SQL dialect for query translation. If None, returns original queries.
- Returns:
A dictionary mapping query IDs to query strings
- Return type:
dict[str, str]
- get_query(query_id, *, params=None)[source]¶
Return a single CoffeeShop analytics query.
Query identifiers follow the updated naming convention (e.g.
SA1for sales analysis,PR1for product mix,TR1for trend review).
H2O Database Benchmark implementation.
Copyright 2026 Joe Harris / BenchBox Project
Licensed under the MIT License. See LICENSE file in the project root for details.
- class H2ODB(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Bases:
BaseBenchmarkH2O Database Benchmark implementation.
This class provides an implementation of the H2O Database Benchmark, including data generation and access to analytical queries for taxi trip data.
Reference: H2O.ai benchmarking suite for analytical workloads
- __init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Initialize an H2O Database Benchmark instance.
- Parameters:
scale_factor (float) – Scale factor for the benchmark (1.0 = ~1GB)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs – Additional implementation-specific options
- generate_data()[source]¶
Generate H2O Database Benchmark data.
- Returns:
A list of paths to the generated data files
- Return type:
list[str | Path]
- get_queries(dialect=None)[source]¶
Get all H2O Database Benchmark queries.
- Parameters:
dialect (str | None) – Target SQL dialect for query translation. If None, returns original queries.
- Returns:
A dictionary mapping query IDs to query strings
- Return type:
dict[str, str]
- get_query(query_id, *, params=None)[source]¶
Get a specific H2O Database Benchmark query.
- Parameters:
query_id (int | str) – The ID of the query to retrieve (Q1-Q10)
params (dict[str, Any] | None) – Optional parameters to customize the query
- Returns:
The query string
- Raises:
ValueError – If the query_id is invalid
- Return type:
str
- get_schema()[source]¶
Get the H2O Database Benchmark schema.
- Returns:
A list of dictionaries describing the tables in the schema
- Return type:
list[dict]
- get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶
Get SQL to create all H2O Database Benchmark tables.
- Parameters:
dialect (str) – SQL dialect to use
tuning_config – Unified tuning configuration for constraint settings
- Returns:
SQL script for creating all tables
- Return type:
str
Join Order Benchmark top-level interface.
Copyright 2026 Joe Harris / BenchBox Project
Licensed under the MIT License. See LICENSE file in the project root for details.
- class JoinOrder(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Bases:
BaseBenchmarkJoin Order Benchmark implementation.
This class provides an implementation of the Join Order Benchmark, including data generation and access to complex join queries for cardinality estimation and join order optimization testing.
Reference: Viktor Leis et al. “How Good Are Query Optimizers, Really?”
- __init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Initialize a Join Order Benchmark instance.
- Parameters:
scale_factor (float) – Scale factor for the benchmark (1.0 = ~1GB)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs (Any) – Additional implementation-specific options
- generate_data()[source]¶
Generate Join Order Benchmark data.
- Returns:
A list of paths to the generated data files
- Return type:
list[Path]
- get_queries()[source]¶
Get all Join Order Benchmark queries.
- Returns:
A dictionary mapping query IDs to query strings
- Return type:
dict[str, str]
- get_query(query_id, *, params=None)[source]¶
Get a specific Join Order Benchmark query.
- Parameters:
query_id (int | str) – The ID of the query to retrieve
params (dict[str, Any] | None) – Optional parameters to customize the query
- Returns:
The query string
- Raises:
ValueError – If the query_id is invalid
- Return type:
str
- get_schema(dialect='sqlite')[source]¶
Get the Join Order Benchmark schema DDL.
- Parameters:
dialect (str) – Target SQL dialect
- Returns:
DDL statements for creating all tables
- Return type:
str
- get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶
Get SQL to create all Join Order Benchmark tables.
- Parameters:
dialect (str) – SQL dialect to use
tuning_config (Any) – Unified tuning configuration for constraint settings
- Returns:
SQL script for creating all tables
- Return type:
str
Read Primitives benchmark implementation.
Copyright 2026 Joe Harris / BenchBox Project
This benchmark combines queries from multiple sources:
Apache Impala targeted-perf workload (https://github.com/apache/impala/tree/master/testdata/workloads/targeted-perf) Apache License 2.0, Copyright Apache Software Foundation
Optimizer sniff test concepts by Justin Jaffray (https://buttondown.com/jaffray/archive/a-sniff-test-for-some-query-optimizers/)
Data generation uses the TPC-H schema (TPC Benchmark H, Copyright Transaction Processing Performance Council).
Licensed under the MIT License. See LICENSE file in the project root for details.
- class ReadPrimitives(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Bases:
BaseBenchmarkRead Primitives benchmark implementation.
Provides Read Primitives benchmark implementation, including data generation and access to 80+ primitive read operation queries that test fundamental database capabilities using the TPC-H schema.
The benchmark covers: - Aggregation, joins, filters, window functions - OLAP operations, statistical functions - JSON operations, full-text search - Time series analysis, array operations - Graph operations, temporal queries
- __init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Initialize Read Primitives benchmark instance.
- Parameters:
scale_factor (float) – Scale factor for the benchmark (1.0 = ~6M lineitem rows)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs (Any) – Additional implementation-specific options
- generate_data(tables=None)[source]¶
Generate Read Primitives benchmark data.
- Parameters:
tables (list[str] | None) – Optional list of table names to generate. If None, generates all.
- Returns:
A dictionary mapping table names to file paths
- Return type:
dict[str, str]
- get_queries(dialect=None)[source]¶
Get all Read Primitives benchmark queries.
- Parameters:
dialect (str | None) – Target SQL dialect for query translation. If None, returns original queries.
- Returns:
A dictionary mapping query IDs to query strings
- Return type:
dict[str, str]
- get_query(query_id, *, params=None)[source]¶
Get a specific Read Primitives benchmark query.
- Parameters:
query_id (int | str) – The ID of the query to retrieve (e.g., ‘aggregation_simple’)
params (dict[str, Any] | None) – Optional parameters to customize the query
- Returns:
The query string
- Raises:
ValueError – If the query_id is invalid
- Return type:
str
- get_queries_by_category(category)[source]¶
Get queries filtered by category.
- Parameters:
category (str) – Category name (e.g., ‘aggregation’, ‘window’, ‘join’)
- Returns:
Dictionary mapping query IDs to SQL text for the category
- Return type:
dict[str, str]
- get_query_categories()[source]¶
Get list of available query categories.
- Returns:
List of category names
- Return type:
list[str]
- get_schema()[source]¶
Get the Read Primitives benchmark schema (TPC-H).
- Returns:
A dictionary mapping table names to their schema definitions
- Return type:
dict[str, dict]
- get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶
Get SQL to create all Read Primitives benchmark tables.
- Parameters:
dialect (str) – SQL dialect to use
tuning_config – Unified tuning configuration for constraint settings
- Returns:
SQL script for creating all tables
- Return type:
str
- load_data_to_database(connection, tables=None)[source]¶
Load generated data into a database.
- Parameters:
connection (Any) – Database connection
tables (list[str] | None) – Optional list of tables to load. If None, loads all.
- Raises:
ValueError – If data hasn’t been generated yet
- execute_query(query_id, connection, params=None)[source]¶
Execute a Read Primitives query on the given database connection.
- Parameters:
query_id (str) – Query identifier (e.g., ‘aggregation_simple’)
connection (Any) – Database connection to use for execution
params (dict[str, Any] | None) – Optional parameters to use in the query
- Returns:
Query results from the database
- Raises:
ValueError – If the query_id is not valid
- Return type:
Any
- run_benchmark(connection, queries=None, iterations=1, categories=None)[source]¶
Run the complete Read Primitives benchmark.
- Parameters:
connection (Any) – Database connection to use
queries (list[str] | None) – Optional list of query IDs to run. If None, runs all.
iterations (int) – Number of times to run each query
categories (list[str] | None) – Optional list of categories to run. If specified, overrides queries.
- Returns:
Dictionary containing benchmark results
- Return type:
dict[str, Any]
- run_category_benchmark(connection, category, iterations=1)[source]¶
Run benchmark for a specific query category.
- Parameters:
connection (Any) – Database connection to use
category (str) – Category name to run (e.g., ‘aggregation’, ‘window’, ‘join’)
iterations (int) – Number of times to run each query
- Returns:
Dictionary containing benchmark results for the category
- Return type:
dict[str, Any]
Star Schema Benchmark implementation.
Copyright 2026 Joe Harris / BenchBox Project
This implementation is derived from TPC Benchmark™ H (TPC-H) - Copyright © Transaction Processing Performance Council
Licensed under the MIT License. See LICENSE file in the project root for details.
- class SSB(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Bases:
BaseBenchmarkStar Schema Benchmark implementation.
This class provides an implementation of the Star Schema Benchmark, including data generation and access to the 13 benchmark queries organized in 4 flights.
Reference: O’Neil et al. “The Star Schema Benchmark and Augmented Fact Table Indexing”
- __init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Initialize a Star Schema Benchmark instance.
- Parameters:
scale_factor (float) – Scale factor for the benchmark (1.0 = ~1GB)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs (Any) – Additional implementation-specific options
- generate_data()[source]¶
Generate Star Schema Benchmark data.
- Returns:
A list of paths to the generated data files
- Return type:
list[str | Path]
- get_queries(dialect=None)[source]¶
Get all Star Schema Benchmark queries.
- Parameters:
dialect (str | None) – Target SQL dialect for query translation. If None, returns original queries.
- Returns:
A dictionary mapping query IDs to query strings
- Return type:
dict[str, str]
- get_query(query_id, *, params=None)[source]¶
Get a specific Star Schema Benchmark query.
- Parameters:
query_id (str) – The ID of the query to retrieve (Q1.1-Q4.3)
params (dict[str, Any] | None) – Optional parameters to customize the query
- Returns:
The query string
- Raises:
ValueError – If the query_id is invalid
- Return type:
str
TPC-DI (Data Integration) benchmark implementation.
Copyright 2026 Joe Harris / BenchBox Project
TPC Benchmark™ DI (TPC-DI) - Copyright © Transaction Processing Performance Council This implementation is based on the TPC-DI specification.
Licensed under the MIT License. See LICENSE file in the project root for details.
- class TPCDI(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Bases:
BaseBenchmarkTPC-DI benchmark implementation.
This class provides an implementation of the TPC-DI benchmark, including data generation and access to validation and analytical queries.
Official specification: http://www.tpc.org/tpcdi
- __init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Initialize a TPC-DI benchmark instance.
- Parameters:
scale_factor (float) – Scale factor for the benchmark (1.0 = ~1GB)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs – Additional implementation-specific options
- generate_data()[source]¶
Generate TPC-DI benchmark data.
- Returns:
A list of paths to the generated data files
- Return type:
list[str | Path]
- get_queries(dialect=None)[source]¶
Get all TPC-DI benchmark queries.
- Parameters:
dialect (str | None) – Target SQL dialect for query translation. If None, returns original queries.
- Returns:
A dictionary mapping query IDs to query strings
- Return type:
dict[str, str]
- get_query(query_id, *, params=None)[source]¶
Get a specific TPC-DI benchmark query.
- Parameters:
query_id (int | str) – The ID of the query to retrieve
params (dict[str, Any] | None) – Optional parameters to customize the query
- Returns:
The query string
- Raises:
ValueError – If the query_id is invalid
- Return type:
str
- get_schema(dialect='standard')[source]¶
Get the TPC-DI schema.
- Parameters:
dialect (str) – Target SQL dialect
- Returns:
A dictionary mapping table names to table definitions
- Return type:
dict[str, dict[str, Any]]
- get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶
Get SQL to create all TPC-DI tables.
- Parameters:
dialect (str) – SQL dialect to use
tuning_config – Unified tuning configuration for constraint settings
- Returns:
SQL script for creating all tables
- Return type:
str
- generate_source_data(formats=None, batch_types=None)[source]¶
Generate source data in various formats for ETL processing.
- Parameters:
formats (list[str] | None) – List of data formats to generate (csv, xml, fixed_width, json)
batch_types (list[str] | None) – List of batch types to generate (historical, incremental, scd)
- Returns:
Dictionary mapping formats to lists of generated file paths
- Return type:
dict[str, list[str]]
- run_etl_pipeline(connection, batch_type='historical', validate_data=True)[source]¶
Run the complete ETL pipeline for TPC-DI.
- Parameters:
connection (Any) – Database connection for target warehouse
batch_type (str) – Type of batch to process (historical, incremental, scd)
validate_data (bool) – Whether to run data validation after ETL
- Returns:
Dictionary containing ETL execution results and metrics
- Return type:
dict[str, Any]
- validate_etl_results(connection)[source]¶
Validate ETL results using data quality checks.
- Parameters:
connection (Any) – Database connection to validate against
- Returns:
Dictionary containing validation results and data quality metrics
- Return type:
dict[str, Any]
- get_etl_status()[source]¶
Get current ETL processing status and metrics.
- Returns:
Dictionary containing ETL status, metrics, and batch information
- Return type:
dict[str, Any]
- property etl_mode: bool¶
Check if ETL mode is enabled.
- Returns:
Always True as TPC-DI is now a pure ETL benchmark
- load_data_to_database(connection, tables=None)[source]¶
Load generated data into a database.
- Parameters:
connection (Any) – Database connection
tables (list[str] | None) – Optional list of tables to load. If None, loads all.
- Raises:
ValueError – If data hasn’t been generated yet
- run_benchmark(connection, queries=None, iterations=1)[source]¶
Run the complete TPC-DI benchmark.
- Parameters:
connection (Any) – Database connection to use
queries (list[str] | None) – Optional list of query IDs to run. If None, runs all.
iterations (int) – Number of times to run each query
- Returns:
Dictionary containing benchmark results
- Return type:
dict[str, Any]
- execute_query(query_id, connection, params=None)[source]¶
Execute a TPC-DI query on the given database connection.
- Parameters:
query_id (int | str) – Query identifier (e.g., “V1”, “V2”, “A1”, etc.)
connection (Any) – Database connection to use for execution
params (dict[str, Any] | None) – Optional parameters to use in the query
- Returns:
Query results from the database
- Raises:
ValueError – If the query_id is not valid
- Return type:
Any
- create_schema(connection, dialect='duckdb')[source]¶
Create TPC-DI schema using the schema manager.
- Parameters:
connection (Any) – Database connection
dialect (str) – Target SQL dialect
- run_full_benchmark(connection, dialect='duckdb')[source]¶
Run the complete TPC-DI benchmark with all phases.
This is the main entry point for running a complete TPC-DI benchmark including schema creation, data loading, ETL processing, validation, and metrics calculation.
- Parameters:
connection (Any) – Database connection
dialect (str) – SQL dialect for the target database
- Returns:
Complete benchmark results with all metrics
- Return type:
dict[str, Any]
- run_etl_benchmark(connection, dialect='duckdb')[source]¶
Run the ETL benchmark pipeline.
- Parameters:
connection (Any) – Database connection
dialect (str) – SQL dialect
- Returns:
ETL execution results
- Return type:
Any
- run_data_validation(connection)[source]¶
Run data quality validation.
- Parameters:
connection (Any) – Database connection
- Returns:
Data quality validation results
- Return type:
Any
- calculate_official_metrics(etl_result, validation_result)[source]¶
Calculate official TPC-DI metrics.
- Parameters:
etl_result (Any) – ETL execution results
validation_result (Any) – Data validation results
- Returns:
Official TPC-DI benchmark metrics
- Return type:
Any
- optimize_database(connection)[source]¶
Optimize database performance for TPC-DI queries.
- Parameters:
connection (Any) – Database connection
- Returns:
Optimization results
- Return type:
dict[str, Any]
- property validator: Any¶
Get the TPC-DI validator instance.
- Returns:
TPCDIValidator instance
- property schema_manager: Any¶
Get the TPC-DI schema manager instance.
- Returns:
TPCDISchemaManager instance
- property metrics_calculator: Any¶
Get the TPC-DI metrics calculator instance.
- Returns:
TPCDIMetrics instance
TPC-DS benchmark implementation.
Copyright 2026 Joe Harris / BenchBox Project
TPC Benchmark™ DS (TPC-DS) - Copyright © Transaction Processing Performance Council This implementation is based on the TPC-DS specification.
Licensed under the MIT License. See LICENSE file in the project root for details.
- class TPCDS(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Bases:
BaseBenchmarkTPC-DS benchmark implementation.
Provides TPC-DS benchmark implementation, including data generation and access to the benchmark queries.
Official specification: http://www.tpc.org/tpcds
- __init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Initialize TPC-DS benchmark instance.
- Parameters:
scale_factor (float) – Scale factor for the benchmark (1.0 = ~1GB)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs (Any) – Additional implementation-specific options
- Raises:
ValueError – If scale_factor is not positive
TypeError – If scale_factor is not a number
- generate_data()[source]¶
Generate TPC-DS benchmark data.
- Returns:
A list of paths to the generated data files
- Return type:
list[str | Path]
- get_queries(dialect=None, base_dialect=None)[source]¶
Get all TPC-DS benchmark queries.
- Parameters:
dialect (str | None) – Target SQL dialect for translation (e.g., ‘duckdb’, ‘postgres’)
- Returns:
A dictionary mapping query IDs to query strings
- Return type:
dict[str, str]
- get_query(query_id, *, params=None, seed=None, scale_factor=None, dialect=None, **kwargs)[source]¶
Get a specific TPC-DS benchmark query.
- Parameters:
query_id (int) – The ID of the query to retrieve (1-99)
params (dict[str, Any] | None) – Optional parameters to customize the query (legacy parameter, mostly ignored)
seed (int | None) – Random number generator seed for parameter generation
scale_factor (float | None) – Scale factor for parameter calculations
dialect (str | None) – Target SQL dialect
**kwargs – Additional parameters
- Returns:
The query string
- Raises:
ValueError – If the query_id is invalid
TypeError – If query_id is not an integer
- Return type:
str
- property queries: TPCDSQueryManager¶
Access to the query manager.
- Returns:
The underlying query manager instance
- property generator: TPCDSDataGenerator¶
Access to the data generator.
- Returns:
The underlying data generator instance
- get_available_tables()[source]¶
Get list of available tables.
- Returns:
List of table names
- Return type:
list[str]
- get_available_queries()[source]¶
Get list of available query IDs.
- Returns:
List of query IDs (1-99)
- Return type:
list[int]
- generate_table_data(table_name, output_dir=None)[source]¶
Generate data for a specific table.
- Parameters:
table_name (str) – Name of the table to generate data for
output_dir (str | None) – Optional output directory for generated data
- Returns:
Iterator of data rows for the table
- Return type:
str
- get_schema()[source]¶
Get the TPC-DS schema.
- Returns:
A list of dictionaries describing the tables in the schema
- Return type:
list[dict]
- get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶
Get SQL to create all TPC-DS tables.
- Parameters:
dialect (str) – SQL dialect to use (currently ignored, TPC-DS uses standard SQL)
tuning_config – Unified tuning configuration for constraint settings
- Returns:
SQL script for creating all tables
- Return type:
str
- generate_streams(num_streams=1, rng_seed=None, streams_output_dir=None)[source]¶
Generate TPC-DS query streams.
- Parameters:
num_streams (int) – Number of concurrent streams to generate
rng_seed (int | None) – Random number generator seed for parameter generation
streams_output_dir (str | Path | None) – Directory to output stream files
- Returns:
List of paths to generated stream files
- Return type:
list[Path]
- get_stream_info(stream_id)[source]¶
Get information about a specific stream.
- Parameters:
stream_id (int) – Stream identifier
- Returns:
Dictionary containing stream information
- Return type:
dict[str, Any]
TPC-H benchmark implementation.
Copyright 2026 Joe Harris / BenchBox Project
TPC Benchmark™ H (TPC-H) - Copyright © Transaction Processing Performance Council This implementation is based on the TPC-H specification.
Licensed under the MIT License. See LICENSE file in the project root for details.
- class TPCH(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Bases:
BaseBenchmarkTPC-H benchmark implementation.
Provides TPC-H benchmark implementation, including data generation and access to the 22 benchmark queries.
Official specification: http://www.tpc.org/tpch
- __init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Initialize TPC-H benchmark instance.
- Parameters:
scale_factor (float) – Scale factor for the benchmark (1.0 = ~1GB)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs (Any) – Additional implementation-specific options
- Raises:
ValueError – If scale_factor is not positive
TypeError – If scale_factor is not a number
- generate_data()[source]¶
Generate TPC-H benchmark data.
- Returns:
A list of paths to the generated data files
- Return type:
list[str | Path]
- get_queries(dialect=None, base_dialect=None)[source]¶
Get all TPC-H benchmark queries.
- Parameters:
dialect (str | None) – Target SQL dialect for translation (e.g., ‘duckdb’, ‘bigquery’, ‘snowflake’) If None, returns queries in their original format.
- Returns:
A dictionary mapping query IDs (1-22) to query strings
- Return type:
dict[str, str]
- get_query(query_id, *, params=None, seed=None, scale_factor=None, dialect=None, base_dialect=None, **kwargs)[source]¶
Get a specific TPC-H benchmark query.
- Parameters:
query_id (int) – The ID of the query to retrieve (1-22)
params (dict[str, Any] | None) – Optional parameters to customize the query (legacy parameter, mostly ignored)
seed (int | None) – Random number generator seed for parameter generation
scale_factor (float | None) – Scale factor for parameter calculations
dialect (str | None) – Target SQL dialect
base_dialect (str | None) – Source SQL dialect (default: netezza)
**kwargs – Additional parameters
- Returns:
The query string
- Raises:
ValueError – If the query_id is invalid
TypeError – If query_id is not an integer
- Return type:
str
- get_schema()[source]¶
Get the TPC-H schema.
- Returns:
A list of dictionaries describing the tables in the schema
- Return type:
list[dict]
- get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶
Get SQL to create all TPC-H tables.
- Parameters:
dialect (str) – SQL dialect to use (currently ignored, TPC-H uses standard SQL)
tuning_config – Unified tuning configuration for constraint settings
- Returns:
SQL script for creating all tables
- Return type:
str
- generate_streams(num_streams=1, rng_seed=None, streams_output_dir=None)[source]¶
Generate TPC-H query streams.
- Parameters:
num_streams (int) – Number of concurrent streams to generate
rng_seed (int | None) – Random number generator seed for parameter generation
streams_output_dir (str | Path | None) – Directory to output stream files
- Returns:
List of paths to generated stream files
- Return type:
list[Path]
- get_stream_info(stream_id)[source]¶
Get information about a specific stream.
- Parameters:
stream_id (int) – Stream identifier
- Returns:
Dictionary containing stream information
- Return type:
dict[str, Any]
- get_all_streams_info()[source]¶
Get information about all streams.
- Returns:
List of dictionaries containing stream information
- Return type:
list[dict[str, Any]]
- property tables: dict[str, Path]¶
Get the mapping of table names to data file paths.
- Returns:
Dictionary mapping table names to paths of generated data files
- run_official_benchmark(connection_factory, config=None)[source]¶
Run the official TPC-H benchmark.
This method provides compatibility for official benchmark examples.
- Parameters:
connection_factory – Factory function or connection object
config – Optional configuration parameters
- Returns:
Dictionary with benchmark results
- run_power_test(connection_factory, config=None)[source]¶
Run the TPC-H power test.
This method provides compatibility for power test examples.
- Parameters:
connection_factory – Factory function or connection object
config – Optional configuration parameters
- Returns:
Dictionary with power test results
- run_maintenance_test(connection_factory, config=None)[source]¶
Run the TPC-H maintenance test.
This method provides compatibility for maintenance test examples.
- Parameters:
connection_factory – Factory function or connection object
config – Optional configuration parameters
- Returns:
Dictionary with maintenance test results
TPC-Havoc benchmark implementation.
Copyright 2026 Joe Harris / BenchBox Project
This implementation is derived from TPC Benchmark™ H (TPC-H) - Copyright © Transaction Processing Performance Council
Licensed under the MIT License. See LICENSE file in the project root for details.
- class TPCHavoc(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Bases:
BaseBenchmarkTPC-Havoc benchmark implementation.
Generates TPC-H query variants to stress query optimizers while maintaining result equivalence.
TPC-Havoc provides 10 structural variants for each TPC-H query (1-22). Each variant is semantically equivalent but uses different SQL constructs to stress different optimizer components.
Example
>>> from benchbox import TPCHavoc >>> from benchbox.platforms.duckdb import DuckDBAdapter >>> >>> # Initialize benchmark and platform >>> benchmark = TPCHavoc(scale_factor=1.0) >>> adapter = DuckDBAdapter(database=":memory:") >>> >>> # Load data >>> adapter.load_benchmark(benchmark) >>> >>> # Get and execute query variant >>> variant_query = benchmark.get_query_variant(query_id=1, variant_id=1) >>> results = adapter.execute_query(variant_query) >>> >>> # Get variant description >>> desc = benchmark.get_variant_description(query_id=1, variant_id=1) >>> print(desc) # "Join order permutation: customers first" >>> >>> # Export all variants >>> benchmark.export_variant_queries(output_dir="./queries")
Note
Query execution must be performed through platform adapters (DuckDBAdapter, SnowflakeAdapter, etc.). Direct execution methods are not provided to maintain architectural consistency.
- __init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Initialize TPC-Havoc benchmark instance.
- Parameters:
scale_factor (float) – Scale factor (1.0 = ~1GB)
output_dir (str | Path | None) – Data output directory
**kwargs (Any) – Additional options
- Raises:
ValueError – If scale_factor is not positive
TypeError – If scale_factor is not a number
- generate_data()[source]¶
Generate TPC-Havoc benchmark data (same as TPC-H).
- Returns:
A list of paths to the generated data files
- Return type:
list[str | Path]
- get_queries(dialect=None)[source]¶
Get all TPC-Havoc benchmark queries (base TPC-H queries).
- Parameters:
dialect (str | None) – Target SQL dialect for query translation. If None, returns original queries.
- Returns:
A dictionary mapping query IDs (1-22) to base query strings
- Return type:
dict[str, str]
- get_query(query_id, *, params=None, seed=None, scale_factor=None, dialect=None, **kwargs)[source]¶
Get a specific TPC-Havoc benchmark query.
- Parameters:
query_id – The ID of the query to retrieve (1-22 for base queries, or “1_v1” format for variants)
params (dict[str, Any] | None) – Optional parameters to customize the query
seed (int | None) – Random number generator seed for parameter generation
scale_factor (float | None) – Scale factor for parameter calculations
dialect (str | None) – Target SQL dialect
**kwargs – Additional parameters
- Returns:
The query string
- Raises:
ValueError – If the query_id is invalid
TypeError – If query_id is not valid format
- Return type:
str
- get_query_variant(query_id, variant_id, params=None)[source]¶
Get a specific TPC-Havoc query variant.
- Parameters:
query_id (int) – The ID of the query to retrieve (1-22)
variant_id (int) – The ID of the variant to retrieve (1-10)
params (dict[str, Any] | None) – Optional parameter values to use
- Returns:
The variant query string
- Raises:
ValueError – If the query_id or variant_id is invalid
TypeError – If query_id or variant_id is not an integer
- Return type:
str
- get_all_variants(query_id)[source]¶
Get all variants for a specific query.
- Parameters:
query_id (int) – The ID of the query to retrieve variants for (1-22)
- Returns:
A dictionary mapping variant IDs to query strings
- Raises:
ValueError – If the query_id is invalid or not implemented
TypeError – If query_id is not an integer
- Return type:
dict[int, str]
- get_variant_description(query_id, variant_id)[source]¶
Get description of a specific variant.
- Parameters:
query_id (int) – The ID of the query (1-22)
variant_id (int) – The ID of the variant (1-10)
- Returns:
Human-readable description of the variant
- Raises:
ValueError – If the query_id or variant_id is invalid
TypeError – If query_id or variant_id is not an integer
- Return type:
str
- get_implemented_queries()[source]¶
Get list of query IDs that have variants implemented.
- Returns:
List of query IDs with implemented variants
- Return type:
list[int]
- get_all_variants_info(query_id)[source]¶
Get information about all variants for a specific query.
- Parameters:
query_id (int) – The ID of the query (1-22)
- Returns:
Dictionary mapping variant IDs to variant info
- Raises:
ValueError – If the query_id is invalid or not implemented
TypeError – If query_id is not an integer
- Return type:
dict[int, dict[str, str]]
- get_schema()[source]¶
Get the TPC-Havoc schema (same as TPC-H).
- Returns:
A dictionary mapping table names to table definitions
- Return type:
dict[str, dict[str, Any]]
- get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶
Get SQL to create all TPC-Havoc tables (same as TPC-H).
- Parameters:
dialect (str) – SQL dialect to use
tuning_config – Unified tuning configuration for constraint settings
- Returns:
SQL script for creating all tables
- Return type:
str
- get_benchmark_info()[source]¶
Get information about the TPC-Havoc benchmark.
- Returns:
Dictionary containing benchmark metadata
- Return type:
dict[str, Any]
- export_variant_queries(output_dir=None, format='sql')[source]¶
Export all variant queries to files.
- Parameters:
output_dir (str | Path | None) – Directory to export queries to (default: self.output_dir/queries)
format (str) – Export format (“sql”, “json”)
- Returns:
Dictionary mapping query identifiers to file paths
- Raises:
ValueError – If format is unsupported
- Return type:
dict[str, Path]
- load_data_to_database(connection_string, dialect='standard', schema=None, drop_existing=False)[source]¶
Load generated data into a database (same as TPC-H).
- Parameters:
connection_string (str) – Database connection string
dialect (str) – SQL dialect (standard, postgres, mysql, etc.)
schema (str | None) – Optional database schema to use
drop_existing (bool) – Whether to drop existing tables before creating new ones
- Raises:
ValueError – If data hasn’t been generated yet
ImportError – If required database driver is not installed
- run_query(query_id, connection_string, params=None, dialect='standard')[source]¶
Run a TPC-Havoc base query against a database.
- Parameters:
query_id (int) – The ID of the query to run (1-22)
connection_string (str) – Database connection string
params (dict[str, Any] | None) – Optional parameter values to use
dialect (str) – SQL dialect (standard, postgres, mysql, etc.)
- Returns:
Dictionary with query results and timing information
- Raises:
ValueError – If the query_id is invalid
TypeError – If query_id is not an integer
ImportError – If required database driver is not installed
- Return type:
dict[str, Any]
- run_benchmark(connection_string, queries=None, iterations=1, dialect='standard', schema=None)[source]¶
Run the TPC-Havoc benchmark using base queries.
- Parameters:
connection_string (str) – Database connection string
queries (list[int] | None) – Optional list of query IDs to run (default: all implemented)
iterations (int) – Number of times to run each query
dialect (str) – SQL dialect (standard, postgres, mysql, etc.)
schema (str | None) – Optional database schema to use
- Returns:
Dictionary with benchmark results and timing information
- Raises:
ValueError – If any query_id is invalid or iterations is not positive
TypeError – If query_ids are not integers
ImportError – If required database driver is not installed
- Return type:
dict[str, Any]