API Reference¶

Tags reference python-api

Complete API documentation for BenchBox classes and methods. This page highlights the most common entry points; the full autodoc catalog lives under the python-api/ section of the Sphinx build.

Core Classes¶

BaseBenchmark¶

Base abstract class that all benchmarks inherit from.

from benchbox.base import BaseBenchmark

class BaseBenchmark(ABC):
    def __init__(self, scale_factor: float = 1.0, output_dir: Optional[Path] = None):
        """Initialize benchmark with scale factor and output directory."""

Methods¶

`generate_data() -> List[Path]`¶

Generate benchmark data files.

Returns: List of Path objects pointing to generated data files.

Example:

from benchbox import TPCH

tpch = TPCH(scale_factor=0.1)
data_files = tpch.generate_data()
# Returns: [Path("customer.tbl"), Path("orders.tbl"), Path("lineitem.tbl"), ...]
for file_path in data_files:
    table_name = file_path.stem
    print(f"Generated {table_name} at {file_path}")

`get_queries() -> Dict[Union[int, str], str]`¶

Get all queries for this benchmark.

Returns: Dictionary mapping query IDs to SQL query strings.

Example:

queries = tpch.get_queries()
# Returns: {1: "SELECT l_returnflag...", 2: "SELECT s_acctbal...", ...}

`get_query(query_id: Union[int, str]) -> str`¶

Get a specific query by ID.

Parameters:

query_id: Query identifier (integer or string)

Returns: SQL query string.

Example:

# Basic query
query_1 = tpch.get_query(1)

# String-based query ID (for some benchmarks)
primitives_query = primitives.get_query("aggregation_basic")

`translate_query(query_id: Union[int, str], dialect: str) -> str`¶

Translate query to specific SQL dialect.

Parameters:

query_id: Query identifier
dialect: Target SQL dialect (“postgres”, “mysql”, “sqlite”, “duckdb”, etc.)

Returns: Translated SQL query string.

Example:

postgres_query = tpch.translate_query(1, "postgres")
mysql_query = tpch.translate_query(1, "mysql")
duckdb_query = tpch.translate_query(1, "duckdb")  # Recommended default

`get_create_tables_sql() -> str`¶

Get DDL statements to create benchmark tables.

Returns: DDL statements as string in standard SQL format.

Example:

ddl = tpch.get_create_tables_sql()
# Use with DuckDB (recommended)
import duckdb
conn = duckdb.connect(":memory:")
conn.execute(ddl)

`get_schema() -> List[Dict[str, Any]]`¶

Get benchmark schema information.

Returns: List of dictionaries describing tables and columns.

Example:

schema = tpch.get_schema()
for table in schema:
    print(f"Table: {table['name']}")
    for column in table['columns']:
        print(f"  {column['name']}: {column['type']}")

Benchmark Classes¶

TPCH¶

TPC-H Decision Support Benchmark implementation.

from benchbox import TPCH

tpch = TPCH(scale_factor=1.0, output_dir=None)

Properties¶

Query Count: 22 analytical queries (Q1-Q22)
Tables: 8 tables (customer, orders, lineitem, part, partsupp, supplier, nation, region)
Scale Factors: 0.001 to 1000+
Recommended Database: DuckDB

Example Usage¶

# Recommended DuckDB integration
import duckdb
from benchbox import TPCH

conn = duckdb.connect(":memory:")
tpch = TPCH(scale_factor=0.1)

# Generate and load data
data_files = tpch.generate_data()
for file_path in data_files:
    table_name = file_path.stem
    conn.execute(f"""
        CREATE TABLE {table_name} AS
        SELECT * FROM read_csv('{file_path}', delimiter='|', header=false)
    """)

# Run queries
result = conn.execute(tpch.get_query(1)).fetchall()

TPCDS¶

TPC-DS Decision Support Benchmark implementation.

from benchbox import TPCDS

tpcds = TPCDS(scale_factor=1.0, output_dir=None)

Properties¶

Query Count: 99 complex analytical queries
Tables: 24 tables (retail data warehouse schema)
Features: Window functions, CTEs, complex joins
Recommended Database: DuckDB

Primitives¶

Database Primitives Benchmark for focused operation testing.

from benchbox import ReadPrimitives

read_primitives = ReadPrimitives(scale_factor=1.0, output_dir=None)

Additional Methods¶

`get_queries_by_category(category: str) -> Dict[str, str]`¶

Get queries filtered by category.

Parameters:

category: Category name (“aggregation”, “join”, “filter”, etc.)

Returns: Dictionary mapping query IDs to SQL strings.

Example:

agg_queries = primitives.get_queries_by_category("aggregation")
join_queries = primitives.get_queries_by_category("join")

Other Benchmarks¶

SSB: Star Schema Benchmark
AMPLab: Big data benchmark
H2ODB: Data science benchmark
ClickBench: Analytical benchmark
JoinOrder: Join order optimization benchmark
TPCDI: Data integration benchmark
WritePrimitives: Write operation benchmark (INSERT, UPDATE, DELETE, MERGE, etc.)
TPCHavoc: TPC-H syntax variants

All benchmarks follow the same basic interface as shown above.

Configuration¶

Constructor Parameters¶

All benchmarks accept these parameters:

Parameter	Type	Default	Description
`scale_factor`	`float`	`1.0`	Dataset size multiplier
`output_dir`	`Optional[Path]`	`None`	Data output directory

Scale Factor Guidelines¶

Scale Factor	Data Size	Use Case
0.001	~1MB	Unit tests
0.01	~10MB	Development
0.1	~100MB	Integration tests
1.0	~1GB	Benchmarking

Example Configuration¶

from pathlib import Path
from benchbox import TPCH

# Development configuration
tpch_dev = TPCH(
    scale_factor=0.01,
    output_dir=Path("./benchmark_data")
)

# Production configuration
tpch_prod = TPCH(
    scale_factor=1.0,
    output_dir=Path("/var/lib/benchbox")
)

Database Integration¶

DuckDB (Recommended)¶

DuckDB is the recommended database for BenchBox:

import duckdb
from benchbox import TPCH

# Setup
conn = duckdb.connect(":memory:")
tpch = TPCH(scale_factor=0.1)

# Load data
data_files = tpch.generate_data()
ddl = tpch.get_create_tables_sql()
conn.execute(ddl)

for file_path in data_files:
    table_name = file_path.stem
    conn.execute(f"""
        INSERT INTO {table_name}
        SELECT * FROM read_csv('{file_path}', delimiter='|', header=false)
    """)

# Run queries (DuckDB uses ANSI SQL by default)
for query_id in range(1, 6):  # First 5 queries
    result = conn.execute(tpch.get_query(query_id)).fetchall()
    print(f"Query {query_id}: {len(result)} rows")

Other Databases¶

For other databases, use SQL dialect translation:

# PostgreSQL (dialect translation only - adapter not yet available)
postgres_query = tpch.translate_query(1, "postgres")

# MySQL (dialect translation only - adapter not yet available)
mysql_query = tpch.translate_query(1, "mysql")

# SQLite (fully supported)
sqlite_query = tpch.translate_query(1, "sqlite")

Note on Dialect Translation: BenchBox can translate queries to many SQL dialects via SQLGlot, but this doesn’t mean platform adapters exist for connecting to those databases. Currently supported platforms: DuckDB, ClickHouse, Databricks, BigQuery, Redshift, Snowflake, SQLite. See Future Platforms for roadmap.

Common Patterns¶

Basic Benchmark Execution¶

from benchbox import TPCH
import time

# Initialize
tpch = TPCH(scale_factor=0.1)

# Generate data
data_files = tpch.generate_data()

# Get and run queries
queries = tpch.get_queries()
for query_id, query_sql in list(queries.items())[:3]:  # First 3 queries
    start_time = time.time()
    # Execute query with your database connection
    execution_time = time.time() - start_time
    print(f"Query {query_id}: {execution_time:.3f}s")

Multi-Database Testing¶

from benchbox import TPCH

tpch = TPCH(scale_factor=0.01)
data_files = tpch.generate_data()

# Test different SQL dialects (note: translation != platform adapter)
dialects = ["duckdb", "clickhouse", "sqlite"]  # Use supported platforms
for dialect in dialects:
    translated_query = tpch.translate_query(1, dialect)
    print(f"{dialect}: {len(translated_query)} chars")

API Reference¶

Core Classes¶

BaseBenchmark¶

Methods¶

generate_data() -> List[Path]¶

get_queries() -> Dict[Union[int, str], str]¶

get_query(query_id: Union[int, str]) -> str¶

translate_query(query_id: Union[int, str], dialect: str) -> str¶

get_create_tables_sql() -> str¶

get_schema() -> List[Dict[str, Any]]¶

Benchmark Classes¶

TPCH¶

Properties¶

Example Usage¶

TPCDS¶

Properties¶

Primitives¶

Additional Methods¶

get_query_categories() -> List[str]¶

get_queries_by_category(category: str) -> Dict[str, str]¶

Other Benchmarks¶

Configuration¶

Constructor Parameters¶

Scale Factor Guidelines¶

Example Configuration¶

Database Integration¶

DuckDB (Recommended)¶

Other Databases¶

Common Patterns¶

Basic Benchmark Execution¶

Multi-Database Testing¶

See Also¶

`generate_data() -> List[Path]`¶

`get_queries() -> Dict[Union[int, str], str]`¶

`get_query(query_id: Union[int, str]) -> str`¶

`translate_query(query_id: Union[int, str], dialect: str) -> str`¶

`get_create_tables_sql() -> str`¶

`get_schema() -> List[Dict[str, Any]]`¶

`get_query_categories() -> List[str]`¶

`get_queries_by_category(category: str) -> Dict[str, str]`¶