API Reference

Tags reference python-api

Complete API documentation for BenchBox classes and methods. This page highlights the most common entry points; the full autodoc catalog lives under the python-api/ section of the Sphinx build.

Core Classes

BaseBenchmark

Base abstract class that all benchmarks inherit from.

from benchbox.base import BaseBenchmark

class BaseBenchmark(ABC):
    def __init__(self, scale_factor: float = 1.0, output_dir: Optional[Path] = None):
        """Initialize benchmark with scale factor and output directory."""

Methods

generate_data() -> List[Path]

Generate benchmark data files.

Returns: List of Path objects pointing to generated data files.

Example:

from benchbox import TPCH

tpch = TPCH(scale_factor=0.1)
data_files = tpch.generate_data()
# Returns: [Path("customer.tbl"), Path("orders.tbl"), Path("lineitem.tbl"), ...]
for file_path in data_files:
    table_name = file_path.stem
    print(f"Generated {table_name} at {file_path}")
get_queries() -> Dict[Union[int, str], str]

Get all queries for this benchmark.

Returns: Dictionary mapping query IDs to SQL query strings.

Example:

queries = tpch.get_queries()
# Returns: {1: "SELECT l_returnflag...", 2: "SELECT s_acctbal...", ...}
get_query(query_id: Union[int, str]) -> str

Get a specific query by ID.

Parameters:

  • query_id: Query identifier (integer or string)

Returns: SQL query string.

Example:

# Basic query
query_1 = tpch.get_query(1)

# String-based query ID (for some benchmarks)
primitives_query = primitives.get_query("aggregation_basic")
translate_query(query_id: Union[int, str], dialect: str) -> str

Translate query to specific SQL dialect.

Parameters:

  • query_id: Query identifier

  • dialect: Target SQL dialect (“postgres”, “mysql”, “sqlite”, “duckdb”, etc.)

Returns: Translated SQL query string.

Example:

postgres_query = tpch.translate_query(1, "postgres")
mysql_query = tpch.translate_query(1, "mysql")
duckdb_query = tpch.translate_query(1, "duckdb")  # Recommended default
get_create_tables_sql() -> str

Get DDL statements to create benchmark tables.

Returns: DDL statements as string in standard SQL format.

Example:

ddl = tpch.get_create_tables_sql()
# Use with DuckDB (recommended)
import duckdb
conn = duckdb.connect(":memory:")
conn.execute(ddl)
get_schema() -> List[Dict[str, Any]]

Get benchmark schema information.

Returns: List of dictionaries describing tables and columns.

Example:

schema = tpch.get_schema()
for table in schema:
    print(f"Table: {table['name']}")
    for column in table['columns']:
        print(f"  {column['name']}: {column['type']}")

Benchmark Classes

TPCH

TPC-H Decision Support Benchmark implementation.

from benchbox import TPCH

tpch = TPCH(scale_factor=1.0, output_dir=None)

Properties

  • Query Count: 22 analytical queries (Q1-Q22)

  • Tables: 8 tables (customer, orders, lineitem, part, partsupp, supplier, nation, region)

  • Scale Factors: 0.001 to 1000+

  • Recommended Database: DuckDB

Example Usage

# Recommended DuckDB integration
import duckdb
from benchbox import TPCH

conn = duckdb.connect(":memory:")
tpch = TPCH(scale_factor=0.1)

# Generate and load data
data_files = tpch.generate_data()
for file_path in data_files:
    table_name = file_path.stem
    conn.execute(f"""
        CREATE TABLE {table_name} AS
        SELECT * FROM read_csv('{file_path}', delimiter='|', header=false)
    """)

# Run queries
result = conn.execute(tpch.get_query(1)).fetchall()

TPCDS

TPC-DS Decision Support Benchmark implementation.

from benchbox import TPCDS

tpcds = TPCDS(scale_factor=1.0, output_dir=None)

Properties

  • Query Count: 99 complex analytical queries

  • Tables: 24 tables (retail data warehouse schema)

  • Features: Window functions, CTEs, complex joins

  • Recommended Database: DuckDB

Primitives

Database Primitives Benchmark for focused operation testing.

from benchbox import ReadPrimitives

read_primitives = ReadPrimitives(scale_factor=1.0, output_dir=None)

Additional Methods

get_query_categories() -> List[str]

Get list of available query categories.

Returns: List of category names.

Example:

categories = primitives.get_query_categories()
# Returns: ["aggregation", "join", "filter", "sort", ...]
get_queries_by_category(category: str) -> Dict[str, str]

Get queries filtered by category.

Parameters:

  • category: Category name (“aggregation”, “join”, “filter”, etc.)

Returns: Dictionary mapping query IDs to SQL strings.

Example:

agg_queries = primitives.get_queries_by_category("aggregation")
join_queries = primitives.get_queries_by_category("join")

Other Benchmarks

  • SSB: Star Schema Benchmark

  • AMPLab: Big data benchmark

  • H2ODB: Data science benchmark

  • ClickBench: Analytical benchmark

  • JoinOrder: Join order optimization benchmark

  • TPCDI: Data integration benchmark

  • WritePrimitives: Write operation benchmark (INSERT, UPDATE, DELETE, MERGE, etc.)

  • TPCHavoc: TPC-H syntax variants

All benchmarks follow the same basic interface as shown above.


Configuration

Constructor Parameters

All benchmarks accept these parameters:

Parameter

Type

Default

Description

scale_factor

float

1.0

Dataset size multiplier

output_dir

Optional[Path]

None

Data output directory

Scale Factor Guidelines

Scale Factor

Data Size

Use Case

0.001

~1MB

Unit tests

0.01

~10MB

Development

0.1

~100MB

Integration tests

1.0

~1GB

Benchmarking

Example Configuration

from pathlib import Path
from benchbox import TPCH

# Development configuration
tpch_dev = TPCH(
    scale_factor=0.01,
    output_dir=Path("./benchmark_data")
)

# Production configuration
tpch_prod = TPCH(
    scale_factor=1.0,
    output_dir=Path("/var/lib/benchbox")
)

Database Integration

Other Databases

For other databases, use SQL dialect translation:

# PostgreSQL (dialect translation only - adapter not yet available)
postgres_query = tpch.translate_query(1, "postgres")

# MySQL (dialect translation only - adapter not yet available)
mysql_query = tpch.translate_query(1, "mysql")

# SQLite (fully supported)
sqlite_query = tpch.translate_query(1, "sqlite")

Note on Dialect Translation: BenchBox can translate queries to many SQL dialects via SQLGlot, but this doesn’t mean platform adapters exist for connecting to those databases. Currently supported platforms include: DuckDB, SQLite, PostgreSQL, ClickHouse, Databricks SQL, BigQuery, Redshift, Snowflake, Trino, Presto, Amazon Athena, Firebolt, Azure Synapse Analytics, Microsoft Fabric, and more. See the Platform Documentation for the full list.


Common Patterns

Basic Benchmark Execution

from benchbox import TPCH
import time

# Initialize
tpch = TPCH(scale_factor=0.1)

# Generate data
data_files = tpch.generate_data()

# Get and run queries
queries = tpch.get_queries()
for query_id, query_sql in list(queries.items())[:3]:  # First 3 queries
    start_time = time.time()
    # Execute query with your database connection
    execution_time = time.time() - start_time
    print(f"Query {query_id}: {execution_time:.3f}s")

Multi-Database Testing

from benchbox import TPCH

tpch = TPCH(scale_factor=0.01)
data_files = tpch.generate_data()

# Test different SQL dialects (note: translation != platform adapter)
dialects = ["duckdb", "clickhouse", "sqlite"]  # Use supported platforms
for dialect in dialects:
    translated_query = tpch.translate_query(1, dialect)
    print(f"{dialect}: {len(translated_query)} chars")

See Also


This API reference covers the core BenchBox functionality. For implementation details, see the source code documentation.