API Reference¶
Complete API documentation for BenchBox classes and methods. This page highlights the most common entry points; the full autodoc catalog lives under the python-api/ section of the Sphinx build.
Core Classes¶
BaseBenchmark¶
Base abstract class that all benchmarks inherit from.
from benchbox.base import BaseBenchmark
class BaseBenchmark(ABC):
def __init__(self, scale_factor: float = 1.0, output_dir: Optional[Path] = None):
"""Initialize benchmark with scale factor and output directory."""
Methods¶
generate_data() -> List[Path]¶
Generate benchmark data files.
Returns: List of Path objects pointing to generated data files.
Example:
from benchbox import TPCH
tpch = TPCH(scale_factor=0.1)
data_files = tpch.generate_data()
# Returns: [Path("customer.tbl"), Path("orders.tbl"), Path("lineitem.tbl"), ...]
for file_path in data_files:
table_name = file_path.stem
print(f"Generated {table_name} at {file_path}")
get_queries() -> Dict[Union[int, str], str]¶
Get all queries for this benchmark.
Returns: Dictionary mapping query IDs to SQL query strings.
Example:
queries = tpch.get_queries()
# Returns: {1: "SELECT l_returnflag...", 2: "SELECT s_acctbal...", ...}
get_query(query_id: Union[int, str]) -> str¶
Get a specific query by ID.
Parameters:
query_id: Query identifier (integer or string)
Returns: SQL query string.
Example:
# Basic query
query_1 = tpch.get_query(1)
# String-based query ID (for some benchmarks)
primitives_query = primitives.get_query("aggregation_basic")
translate_query(query_id: Union[int, str], dialect: str) -> str¶
Translate query to specific SQL dialect.
Parameters:
query_id: Query identifierdialect: Target SQL dialect (“postgres”, “mysql”, “sqlite”, “duckdb”, etc.)
Returns: Translated SQL query string.
Example:
postgres_query = tpch.translate_query(1, "postgres")
mysql_query = tpch.translate_query(1, "mysql")
duckdb_query = tpch.translate_query(1, "duckdb") # Recommended default
get_create_tables_sql() -> str¶
Get DDL statements to create benchmark tables.
Returns: DDL statements as string in standard SQL format.
Example:
ddl = tpch.get_create_tables_sql()
# Use with DuckDB (recommended)
import duckdb
conn = duckdb.connect(":memory:")
conn.execute(ddl)
get_schema() -> List[Dict[str, Any]]¶
Get benchmark schema information.
Returns: List of dictionaries describing tables and columns.
Example:
schema = tpch.get_schema()
for table in schema:
print(f"Table: {table['name']}")
for column in table['columns']:
print(f" {column['name']}: {column['type']}")
Benchmark Classes¶
TPCH¶
TPC-H Decision Support Benchmark implementation.
from benchbox import TPCH
tpch = TPCH(scale_factor=1.0, output_dir=None)
Properties¶
Query Count: 22 analytical queries (Q1-Q22)
Tables: 8 tables (customer, orders, lineitem, part, partsupp, supplier, nation, region)
Scale Factors: 0.001 to 1000+
Recommended Database: DuckDB
Example Usage¶
# Recommended DuckDB integration
import duckdb
from benchbox import TPCH
conn = duckdb.connect(":memory:")
tpch = TPCH(scale_factor=0.1)
# Generate and load data
data_files = tpch.generate_data()
for file_path in data_files:
table_name = file_path.stem
conn.execute(f"""
CREATE TABLE {table_name} AS
SELECT * FROM read_csv('{file_path}', delimiter='|', header=false)
""")
# Run queries
result = conn.execute(tpch.get_query(1)).fetchall()
TPCDS¶
TPC-DS Decision Support Benchmark implementation.
from benchbox import TPCDS
tpcds = TPCDS(scale_factor=1.0, output_dir=None)
Properties¶
Query Count: 99 complex analytical queries
Tables: 24 tables (retail data warehouse schema)
Features: Window functions, CTEs, complex joins
Recommended Database: DuckDB
Primitives¶
Database Primitives Benchmark for focused operation testing.
from benchbox import ReadPrimitives
read_primitives = ReadPrimitives(scale_factor=1.0, output_dir=None)
Additional Methods¶
get_query_categories() -> List[str]¶
Get list of available query categories.
Returns: List of category names.
Example:
categories = primitives.get_query_categories()
# Returns: ["aggregation", "join", "filter", "sort", ...]
get_queries_by_category(category: str) -> Dict[str, str]¶
Get queries filtered by category.
Parameters:
category: Category name (“aggregation”, “join”, “filter”, etc.)
Returns: Dictionary mapping query IDs to SQL strings.
Example:
agg_queries = primitives.get_queries_by_category("aggregation")
join_queries = primitives.get_queries_by_category("join")
Other Benchmarks¶
SSB: Star Schema Benchmark
AMPLab: Big data benchmark
H2ODB: Data science benchmark
ClickBench: Analytical benchmark
JoinOrder: Join order optimization benchmark
TPCDI: Data integration benchmark
WritePrimitives: Write operation benchmark (INSERT, UPDATE, DELETE, MERGE, etc.)
TPCHavoc: TPC-H syntax variants
All benchmarks follow the same basic interface as shown above.
Configuration¶
Constructor Parameters¶
All benchmarks accept these parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Dataset size multiplier |
|
|
|
Data output directory |
Scale Factor Guidelines¶
Scale Factor |
Data Size |
Use Case |
|---|---|---|
0.001 |
~1MB |
Unit tests |
0.01 |
~10MB |
Development |
0.1 |
~100MB |
Integration tests |
1.0 |
~1GB |
Benchmarking |
Example Configuration¶
from pathlib import Path
from benchbox import TPCH
# Development configuration
tpch_dev = TPCH(
scale_factor=0.01,
output_dir=Path("./benchmark_data")
)
# Production configuration
tpch_prod = TPCH(
scale_factor=1.0,
output_dir=Path("/var/lib/benchbox")
)
Database Integration¶
DuckDB (Recommended)¶
DuckDB is the recommended database for BenchBox:
import duckdb
from benchbox import TPCH
# Setup
conn = duckdb.connect(":memory:")
tpch = TPCH(scale_factor=0.1)
# Load data
data_files = tpch.generate_data()
ddl = tpch.get_create_tables_sql()
conn.execute(ddl)
for file_path in data_files:
table_name = file_path.stem
conn.execute(f"""
INSERT INTO {table_name}
SELECT * FROM read_csv('{file_path}', delimiter='|', header=false)
""")
# Run queries (DuckDB uses ANSI SQL by default)
for query_id in range(1, 6): # First 5 queries
result = conn.execute(tpch.get_query(query_id)).fetchall()
print(f"Query {query_id}: {len(result)} rows")
Other Databases¶
For other databases, use SQL dialect translation:
# PostgreSQL (dialect translation only - adapter not yet available)
postgres_query = tpch.translate_query(1, "postgres")
# MySQL (dialect translation only - adapter not yet available)
mysql_query = tpch.translate_query(1, "mysql")
# SQLite (fully supported)
sqlite_query = tpch.translate_query(1, "sqlite")
Note on Dialect Translation: BenchBox can translate queries to many SQL dialects via SQLGlot, but this doesn’t mean platform adapters exist for connecting to those databases. Currently supported platforms include: DuckDB, SQLite, PostgreSQL, ClickHouse, Databricks SQL, BigQuery, Redshift, Snowflake, Trino, Presto, Amazon Athena, Firebolt, Azure Synapse Analytics, Microsoft Fabric, and more. See the Platform Documentation for the full list.
Common Patterns¶
Basic Benchmark Execution¶
from benchbox import TPCH
import time
# Initialize
tpch = TPCH(scale_factor=0.1)
# Generate data
data_files = tpch.generate_data()
# Get and run queries
queries = tpch.get_queries()
for query_id, query_sql in list(queries.items())[:3]: # First 3 queries
start_time = time.time()
# Execute query with your database connection
execution_time = time.time() - start_time
print(f"Query {query_id}: {execution_time:.3f}s")
Multi-Database Testing¶
from benchbox import TPCH
tpch = TPCH(scale_factor=0.01)
data_files = tpch.generate_data()
# Test different SQL dialects (note: translation != platform adapter)
dialects = ["duckdb", "clickhouse", "sqlite"] # Use supported platforms
for dialect in dialects:
translated_query = tpch.translate_query(1, dialect)
print(f"{dialect}: {len(translated_query)} chars")
See Also¶
Getting Started - Basic usage tutorial
Examples - Practical code examples
Configuration - Configuration guide
Benchmarks - Individual benchmark documentation
This API reference covers the core BenchBox functionality. For implementation details, see the source code documentation.