API Reference

Tags reference python-api

Complete API documentation for BenchBox classes and methods. This page highlights the most common entry points; the full autodoc catalog lives under the python-api/ section of the Sphinx build.

Core Classes

BaseBenchmark

Base abstract class that all benchmarks inherit from.

from benchbox.base import BaseBenchmark

class BaseBenchmark(ABC):
    def __init__(self, scale_factor: float = 1.0, output_dir: Optional[Path] = None):
        """Initialize benchmark with scale factor and output directory."""

Methods

generate_data() -> List[Path]

Generate benchmark data files.

Returns: List of Path objects pointing to generated data files.

Example:

from benchbox import TPCH

tpch = TPCH(scale_factor=0.1)
data_files = tpch.generate_data()
# Returns: [Path("customer.tbl"), Path("orders.tbl"), Path("lineitem.tbl"), ...]
for file_path in data_files:
    table_name = file_path.stem
    print(f"Generated {table_name} at {file_path}")
get_queries() -> Dict[Union[int, str], str]

Get all queries for this benchmark.

Returns: Dictionary mapping query IDs to SQL query strings.

Example:

queries = tpch.get_queries()
# Returns: {1: "SELECT l_returnflag...", 2: "SELECT s_acctbal...", ...}
get_query(query_id: Union[int, str]) -> str

Get a specific query by ID.

Parameters:

  • query_id: Query identifier (integer or string)

Returns: SQL query string.

Example:

# Basic query
query_1 = tpch.get_query(1)

# String-based query ID (for some benchmarks)
primitives_query = primitives.get_query("aggregation_basic")
translate_query(query_id: Union[int, str], dialect: str) -> str

Translate query to specific SQL dialect.

Parameters:

  • query_id: Query identifier

  • dialect: Target SQL dialect (“postgres”, “mysql”, “sqlite”, “duckdb”, etc.)

Returns: Translated SQL query string.

Example:

postgres_query = tpch.translate_query(1, "postgres")
mysql_query = tpch.translate_query(1, "mysql")
duckdb_query = tpch.translate_query(1, "duckdb")  # Recommended default
get_create_tables_sql() -> str

Get DDL statements to create benchmark tables.

Returns: DDL statements as string in standard SQL format.

Example:

ddl = tpch.get_create_tables_sql()
# Use with DuckDB (recommended)
import duckdb
conn = duckdb.connect(":memory:")
conn.execute(ddl)
get_schema() -> List[Dict[str, Any]]

Get benchmark schema information.

Returns: List of dictionaries describing tables and columns.

Example:

schema = tpch.get_schema()
for table in schema:
    print(f"Table: {table['name']}")
    for column in table['columns']:
        print(f"  {column['name']}: {column['type']}")

Benchmark Classes

TPCH

TPC-H Decision Support Benchmark implementation.

from benchbox import TPCH

tpch = TPCH(scale_factor=1.0, output_dir=None)

Properties

  • Query Count: 22 analytical queries (Q1-Q22)

  • Tables: 8 tables (customer, orders, lineitem, part, partsupp, supplier, nation, region)

  • Scale Factors: 0.001 to 1000+

  • Recommended Database: DuckDB

Example Usage

# Recommended DuckDB integration
import duckdb
from benchbox import TPCH

conn = duckdb.connect(":memory:")
tpch = TPCH(scale_factor=0.1)

# Generate and load data
data_files = tpch.generate_data()
for file_path in data_files:
    table_name = file_path.stem
    conn.execute(f"""
        CREATE TABLE {table_name} AS
        SELECT * FROM read_csv('{file_path}', delimiter='|', header=false)
    """)

# Run queries
result = conn.execute(tpch.get_query(1)).fetchall()

TPCDS

TPC-DS Decision Support Benchmark implementation.

from benchbox import TPCDS

tpcds = TPCDS(scale_factor=1.0, output_dir=None)

Properties

  • Query Count: 99 complex analytical queries

  • Tables: 24 tables (retail data warehouse schema)

  • Features: Window functions, CTEs, complex joins

  • Recommended Database: DuckDB

Primitives

Database Primitives Benchmark for focused operation testing.

from benchbox import ReadPrimitives

read_primitives = ReadPrimitives(scale_factor=1.0, output_dir=None)

Additional Methods

get_query_categories() -> List[str]

Get list of available query categories.

Returns: List of category names.

Example:

categories = primitives.get_query_categories()
# Returns: ["aggregation", "join", "filter", "sort", ...]
get_queries_by_category(category: str) -> Dict[str, str]

Get queries filtered by category.

Parameters:

  • category: Category name (“aggregation”, “join”, “filter”, etc.)

Returns: Dictionary mapping query IDs to SQL strings.

Example:

agg_queries = primitives.get_queries_by_category("aggregation")
join_queries = primitives.get_queries_by_category("join")

Other Benchmarks

  • SSB: Star Schema Benchmark

  • AMPLab: Big data benchmark

  • H2ODB: Data science benchmark

  • ClickBench: Analytical benchmark

  • JoinOrder: Join order optimization benchmark

  • TPCDI: Data integration benchmark

  • WritePrimitives: Write operation benchmark (INSERT, UPDATE, DELETE, MERGE, etc.)

  • TPCHavoc: TPC-H syntax variants

All benchmarks follow the same basic interface as shown above.


Configuration

Constructor Parameters

All benchmarks accept these parameters:

Parameter

Type

Default

Description

scale_factor

float

1.0

Dataset size multiplier

output_dir

Optional[Path]

None

Data output directory

Scale Factor Guidelines

Scale Factor

Data Size

Use Case

0.001

~1MB

Unit tests

0.01

~10MB

Development

0.1

~100MB

Integration tests

1.0

~1GB

Benchmarking

Example Configuration

from pathlib import Path
from benchbox import TPCH

# Development configuration
tpch_dev = TPCH(
    scale_factor=0.01,
    output_dir=Path("./benchmark_data")
)

# Production configuration
tpch_prod = TPCH(
    scale_factor=1.0,
    output_dir=Path("/var/lib/benchbox")
)

Database Integration

Other Databases

For other databases, use SQL dialect translation:

# PostgreSQL (dialect translation only - adapter not yet available)
postgres_query = tpch.translate_query(1, "postgres")

# MySQL (dialect translation only - adapter not yet available)
mysql_query = tpch.translate_query(1, "mysql")

# SQLite (fully supported)
sqlite_query = tpch.translate_query(1, "sqlite")

Note on Dialect Translation: BenchBox can translate queries to many SQL dialects via SQLGlot, but this doesn’t mean platform adapters exist for connecting to those databases. Currently supported platforms: DuckDB, ClickHouse, Databricks, BigQuery, Redshift, Snowflake, SQLite. See Future Platforms for roadmap.


Common Patterns

Basic Benchmark Execution

from benchbox import TPCH
import time

# Initialize
tpch = TPCH(scale_factor=0.1)

# Generate data
data_files = tpch.generate_data()

# Get and run queries
queries = tpch.get_queries()
for query_id, query_sql in list(queries.items())[:3]:  # First 3 queries
    start_time = time.time()
    # Execute query with your database connection
    execution_time = time.time() - start_time
    print(f"Query {query_id}: {execution_time:.3f}s")

Multi-Database Testing

from benchbox import TPCH

tpch = TPCH(scale_factor=0.01)
data_files = tpch.generate_data()

# Test different SQL dialects (note: translation != platform adapter)
dialects = ["duckdb", "clickhouse", "sqlite"]  # Use supported platforms
for dialect in dialects:
    translated_query = tpch.translate_query(1, dialect)
    print(f"{dialect}: {len(translated_query)} chars")

See Also


This API reference covers the core BenchBox functionality. For implementation details, see the source code documentation.