API Reference¶
Complete API documentation for BenchBox classes and methods. This page highlights the most common entry points; the full autodoc catalog lives under the python-api/ section of the Sphinx build.
Core Classes¶
BaseBenchmark¶
Base abstract class that all benchmarks inherit from.
from benchbox.base import BaseBenchmark
class BaseBenchmark(ABC):
def __init__(self, scale_factor: float = 1.0, output_dir: Optional[Path] = None):
"""Initialize benchmark with scale factor and output directory."""
Methods¶
generate_data() -> List[Path]¶
Generate benchmark data files.
Returns: List of Path objects pointing to generated data files.
Example:
from benchbox import TPCH
tpch = TPCH(scale_factor=0.1)
data_files = tpch.generate_data()
# Returns: [Path("customer.tbl"), Path("orders.tbl"), Path("lineitem.tbl"), ...]
for file_path in data_files:
table_name = file_path.stem
print(f"Generated {table_name} at {file_path}")
get_queries() -> Dict[Union[int, str], str]¶
Get all queries for this benchmark.
Returns: Dictionary mapping query IDs to SQL query strings.
Example:
queries = tpch.get_queries()
# Returns: {1: "SELECT l_returnflag...", 2: "SELECT s_acctbal...", ...}
get_query(query_id: Union[int, str]) -> str¶
Get a specific query by ID.
Parameters:
query_id: Query identifier (integer or string)
Returns: SQL query string.
Example:
# Basic query
query_1 = tpch.get_query(1)
# String-based query ID (for some benchmarks)
primitives_query = primitives.get_query("aggregation_basic")
translate_query(query_id: Union[int, str], dialect: str) -> str¶
Translate query to specific SQL dialect.
Parameters:
query_id: Query identifierdialect: Target SQL dialect (“postgres”, “mysql”, “sqlite”, “duckdb”, etc.)
Returns: Translated SQL query string.
Example:
postgres_query = tpch.translate_query(1, "postgres")
mysql_query = tpch.translate_query(1, "mysql")
duckdb_query = tpch.translate_query(1, "duckdb") # Recommended default
get_create_tables_sql() -> str¶
Get DDL statements to create benchmark tables.
Returns: DDL statements as string in standard SQL format.
Example:
ddl = tpch.get_create_tables_sql()
# Use with DuckDB (recommended)
import duckdb
conn = duckdb.connect(":memory:")
conn.execute(ddl)
get_schema() -> List[Dict[str, Any]]¶
Get benchmark schema information.
Returns: List of dictionaries describing tables and columns.
Example:
schema = tpch.get_schema()
for table in schema:
print(f"Table: {table['name']}")
for column in table['columns']:
print(f" {column['name']}: {column['type']}")
Benchmark Classes¶
TPCH¶
TPC-H Decision Support Benchmark implementation.
from benchbox import TPCH
tpch = TPCH(scale_factor=1.0, output_dir=None)
Properties¶
Query Count: 22 analytical queries (Q1-Q22)
Tables: 8 tables (customer, orders, lineitem, part, partsupp, supplier, nation, region)
Scale Factors: 0.001 to 1000+
Recommended Database: DuckDB
Example Usage¶
# Recommended DuckDB integration
import duckdb
from benchbox import TPCH
conn = duckdb.connect(":memory:")
tpch = TPCH(scale_factor=0.1)
# Generate and load data
data_files = tpch.generate_data()
for file_path in data_files:
table_name = file_path.stem
conn.execute(f"""
CREATE TABLE {table_name} AS
SELECT * FROM read_csv('{file_path}', delimiter='|', header=false)
""")
# Run queries
result = conn.execute(tpch.get_query(1)).fetchall()
TPCDS¶
TPC-DS Decision Support Benchmark implementation.
from benchbox import TPCDS
tpcds = TPCDS(scale_factor=1.0, output_dir=None)
Properties¶
Query Count: 99 complex analytical queries
Tables: 24 tables (retail data warehouse schema)
Features: Window functions, CTEs, complex joins
Recommended Database: DuckDB
Primitives¶
Database Primitives Benchmark for focused operation testing.
from benchbox import ReadPrimitives
read_primitives = ReadPrimitives(scale_factor=1.0, output_dir=None)
Additional Methods¶
get_query_categories() -> List[str]¶
Get list of available query categories.
Returns: List of category names.
Example:
categories = primitives.get_query_categories()
# Returns: ["aggregation", "join", "filter", "sort", ...]
get_queries_by_category(category: str) -> Dict[str, str]¶
Get queries filtered by category.
Parameters:
category: Category name (“aggregation”, “join”, “filter”, etc.)
Returns: Dictionary mapping query IDs to SQL strings.
Example:
agg_queries = primitives.get_queries_by_category("aggregation")
join_queries = primitives.get_queries_by_category("join")
Other Benchmarks¶
SSB: Star Schema Benchmark
AMPLab: Big data benchmark
H2ODB: Data science benchmark
ClickBench: Analytical benchmark
JoinOrder: Join order optimization benchmark
TPCDI: Data integration benchmark
WritePrimitives: Write operation benchmark (INSERT, UPDATE, DELETE, MERGE, etc.)
TPCHavoc: TPC-H syntax variants
All benchmarks follow the same basic interface as shown above.
Configuration¶
Constructor Parameters¶
All benchmarks accept these parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Dataset size multiplier |
|
|
|
Data output directory |
Scale Factor Guidelines¶
Scale Factor |
Data Size |
Use Case |
|---|---|---|
0.001 |
~1MB |
Unit tests |
0.01 |
~10MB |
Development |
0.1 |
~100MB |
Integration tests |
1.0 |
~1GB |
Benchmarking |
Example Configuration¶
from pathlib import Path
from benchbox import TPCH
# Development configuration
tpch_dev = TPCH(
scale_factor=0.01,
output_dir=Path("./benchmark_data")
)
# Production configuration
tpch_prod = TPCH(
scale_factor=1.0,
output_dir=Path("/var/lib/benchbox")
)
Database Integration¶
DuckDB (Recommended)¶
DuckDB is the recommended database for BenchBox:
import duckdb
from benchbox import TPCH
# Setup
conn = duckdb.connect(":memory:")
tpch = TPCH(scale_factor=0.1)
# Load data
data_files = tpch.generate_data()
ddl = tpch.get_create_tables_sql()
conn.execute(ddl)
for file_path in data_files:
table_name = file_path.stem
conn.execute(f"""
INSERT INTO {table_name}
SELECT * FROM read_csv('{file_path}', delimiter='|', header=false)
""")
# Run queries (DuckDB uses ANSI SQL by default)
for query_id in range(1, 6): # First 5 queries
result = conn.execute(tpch.get_query(query_id)).fetchall()
print(f"Query {query_id}: {len(result)} rows")
Other Databases¶
For other databases, use SQL dialect translation:
# PostgreSQL (dialect translation only - adapter not yet available)
postgres_query = tpch.translate_query(1, "postgres")
# MySQL (dialect translation only - adapter not yet available)
mysql_query = tpch.translate_query(1, "mysql")
# SQLite (fully supported)
sqlite_query = tpch.translate_query(1, "sqlite")
Note on Dialect Translation: BenchBox can translate queries to many SQL dialects via SQLGlot, but this doesn’t mean platform adapters exist for connecting to those databases. Currently supported platforms: DuckDB, ClickHouse, Databricks, BigQuery, Redshift, Snowflake, SQLite. See Future Platforms for roadmap.
Common Patterns¶
Basic Benchmark Execution¶
from benchbox import TPCH
import time
# Initialize
tpch = TPCH(scale_factor=0.1)
# Generate data
data_files = tpch.generate_data()
# Get and run queries
queries = tpch.get_queries()
for query_id, query_sql in list(queries.items())[:3]: # First 3 queries
start_time = time.time()
# Execute query with your database connection
execution_time = time.time() - start_time
print(f"Query {query_id}: {execution_time:.3f}s")
Multi-Database Testing¶
from benchbox import TPCH
tpch = TPCH(scale_factor=0.01)
data_files = tpch.generate_data()
# Test different SQL dialects (note: translation != platform adapter)
dialects = ["duckdb", "clickhouse", "sqlite"] # Use supported platforms
for dialect in dialects:
translated_query = tpch.translate_query(1, dialect)
print(f"{dialect}: {len(translated_query)} chars")
See Also¶
Getting Started - Basic usage tutorial
Examples - Practical code examples
Configuration - Configuration guide
Benchmarks - Individual benchmark documentation
This API reference covers the core BenchBox functionality. For implementation details, see the source code documentation.