BenchBox Architecture Design Document¶
1. High-Level Architecture¶
BenchBox is designed as a modular, extensible library for embedding benchmark datasets and queries for database evaluation. The architecture follows these key principles:
Modularity: Clear separation between different components
Extensibility: Easy to add new benchmark types
Self-contained: with minimal external dependencies
Cross-Database Compatibility: Support for multiple database systems
1.2 Component Responsibilities¶
Core Framework: Provides base interfaces, abstract classes, and common utilities
Benchmarks: Concrete implementations of specific benchmarks (TPC-H, TPC-DS, etc.)
Data Generator: Generates benchmark data according to specifications
SQL Manager: Stores and translates SQL queries for different database dialects
2. Core Interfaces and Abstract Classes¶
2.1 BaseBenchmark Abstract Class¶
from abc import ABC, abstractmethod
from typing import Dict, List, Optional, Union
import pathlib
class BaseBenchmark(ABC):
"""Base abstract class for all benchmark implementations."""
def __init__(self, scale_factor: float = 1.0, output_dir: Optional[pathlib.Path] = None):
"""
Initialize a benchmark instance.
Args:
scale_factor: Size of the generated dataset
output_dir: Directory to store generated data files
"""
self.scale_factor = scale_factor
self.output_dir = output_dir or pathlib.Path("./data")
self._initialize()
@abstractmethod
def _initialize(self) -> None:
"""Initialize benchmark-specific components."""
pass
@abstractmethod
def get_schema(self) -> Dict[str, Dict]:
"""
Get the schema definition for this benchmark.
Returns:
Dictionary mapping table names to their schema definitions
"""
pass
@abstractmethod
def generate_data(self) -> Dict[str, pathlib.Path]:
"""
Generate benchmark data.
Returns:
Dictionary mapping table names to generated data file paths
"""
pass
@abstractmethod
def get_query(self, query_id: Union[int, str]) -> str:
"""
Get a specific query by ID.
Args:
query_id: Identifier for the query
Returns:
SQL query string
"""
pass
@abstractmethod
def get_queries(self) -> Dict[Union[int, str], str]:
"""
Get all queries for this benchmark.
Returns:
Dictionary mapping query IDs to SQL query strings
"""
pass
def translate_query(self, query_id: Union[int, str], dialect: str) -> str:
"""
Translate a query to a specific SQL dialect.
Args:
query_id: Identifier for the query
dialect: Target SQL dialect
Returns:
Translated SQL query string
"""
query = self.get_query(query_id)
return self._translate_sql(query, dialect)
def _translate_sql(self, sql: str, dialect: str) -> str:
"""
Translate SQL from the benchmark's native dialect to the target dialect.
Args:
sql: SQL query string
dialect: Target SQL dialect
Returns:
Translated SQL query string
"""
# Use sqlglot for translation
import sqlglot
return sqlglot.transpile(sql, read="ansi", write=dialect)[0]
2.2 DataGenerator Interface¶
from abc import ABC, abstractmethod
from typing import Dict, Optional, List
import pathlib
class DataGenerator(ABC):
"""Interface for benchmark data generators."""
@abstractmethod
def generate_table(self, table_name: str, schema: Dict, scale_factor: float,
output_path: pathlib.Path) -> pathlib.Path:
"""
Generate data for a specific table.
Args:
table_name: Name of the table
schema: Schema definition for the table
scale_factor: Size multiplier for the generated data
output_path: Path to write the generated data
Returns:
Path to the generated data file
"""
pass
@abstractmethod
def generate_all(self, schemas: Dict[str, Dict], scale_factor: float,
output_dir: pathlib.Path) -> Dict[str, pathlib.Path]:
"""
Generate data for all tables in the benchmark.
Args:
schemas: Dictionary mapping table names to their schema definitions
scale_factor: Size multiplier for the generated data
output_dir: Directory to write the generated data files
Returns:
Dictionary mapping table names to generated data file paths
"""
pass
2.3 QueryManager Interface¶
from abc import ABC, abstractmethod
from typing import Dict, Union
class QueryManager(ABC):
"""Interface for managing benchmark queries."""
@abstractmethod
def get_query(self, query_id: Union[int, str]) -> str:
"""
Get a specific query by ID.
Args:
query_id: Identifier for the query
Returns:
SQL query string
"""
pass
@abstractmethod
def get_all_queries(self) -> Dict[Union[int, str], str]:
"""
Get all queries managed by this instance.
Returns:
Dictionary mapping query IDs to SQL query strings
"""
pass
@abstractmethod
def translate_query(self, query_id: Union[int, str], dialect: str) -> str:
"""
Translate a query to a specific SQL dialect.
Args:
query_id: Identifier for the query
dialect: Target SQL dialect
Returns:
Translated SQL query string
"""
pass
3. Module Structure and Dependencies¶
3.1 Module Organization¶
benchbox/
│
├── __init__.py # Package exports
├── core/ # Core framework components
│ ├── __init__.py
│ ├── base.py # Base abstract classes
│ ├── data/ # Data generation framework
│ │ ├── __init__.py
│ │ ├── generator.py # Data generator interfaces
│ │ ├── schema.py # Schema definition utilities
│ │ └── random.py # Random data generation utilities
│ ├── sql/ # SQL management framework
│ │ ├── __init__.py
│ │ ├── manager.py # Query manager interfaces
│ │ └── translator.py # SQL translation utilities
│ └── utils/ # Common utilities
│ ├── __init__.py
│ ├── file.py # File handling utilities
│ └── validation.py # Validation utilities
│
├── benchmarks/ # Benchmark implementations
│ ├── __init__.py
│ ├── tpch/ # TPC-H benchmark
│ │ ├── __init__.py
│ │ ├── benchmark.py # TPC-H implementation
│ │ ├── schema.py # TPC-H schema definitions
│ │ ├── generator.py # TPC-H data generator
│ │ └── queries/ # TPC-H query templates
│ │ ├── __init__.py
│ │ ├── q1.sql
│ │ └── ...
│ ├── tpcds/ # TPC-DS benchmark
│ │ ├── ...
│ ├── ssb/ # Star Schema Benchmark
│ │ ├── ...
│ └── ... # Other benchmarks
│
└── cli/ # Command-line interface
├── __init__.py
└── main.py # CLI entry point
3.2 Dependency Relationships¶
┌─────────────────┐ ┌───────────────┐ ┌─────────────────┐
│ │ │ │ │ │
│ Benchmark Impl │────▶│ Core Framework│◀────│ CLI Application │
│ │ │ │ │ │
└─────────────────┘ └───────────────┘ └─────────────────┘
│ │ │
│ │ │
│ ▼ │
│ ┌───────────────┐ │
└────────────▶│ Data Generator│◀──────────────┘
│ │
└───────────────┘
│
│
▼
┌───────────────┐
│ │
│ SQL Manager │
│ │
└───────────────┘
4. Design Patterns and Extensibility¶
4.1 Strategy Pattern for Data Generation¶
The architecture employs the Strategy pattern for data generation, allowing different generation algorithms to be plugged in based on the benchmark type.
from abc import ABC, abstractmethod
import pathlib
from typing import Dict
class GenerationStrategy(ABC):
"""Strategy interface for data generation algorithms."""
@abstractmethod
def generate(self, schema: Dict, scale_factor: float, output_path: pathlib.Path) -> pathlib.Path:
"""Generate data according to the strategy."""
pass
class RandomDataStrategy(GenerationStrategy):
"""Generate random data based on schema constraints."""
def generate(self, schema: Dict, scale_factor: float, output_path: pathlib.Path) -> pathlib.Path:
# Implementation for random data generation
pass
class DeterministicDataStrategy(GenerationStrategy):
"""Generate deterministic data based on benchmark specifications."""
def generate(self, schema: Dict, scale_factor: float, output_path: pathlib.Path) -> pathlib.Path:
# Implementation for deterministic data generation
pass
4.2 Factory Method for Benchmark Creation¶
class BenchmarkFactory:
"""Factory for creating benchmark instances."""
@staticmethod
def create_benchmark(benchmark_type: str, scale_factor: float = 1.0, **kwargs):
"""
Create a benchmark instance of the specified type.
Args:
benchmark_type: Type of benchmark ('tpch', 'tpcds', 'ssb', etc.)
scale_factor: Size of the generated dataset
**kwargs: Additional benchmark-specific parameters
Returns:
BaseBenchmark instance
"""
benchmark_type = benchmark_type.lower()
if benchmark_type == 'tpch':
from benchbox import TPCH
return TPCH(scale_factor=scale_factor, **kwargs)
elif benchmark_type == 'tpcds':
from benchbox import TPCDS
return TPCDS(scale_factor=scale_factor, **kwargs)
elif benchmark_type == 'ssb':
from benchbox import SSB
return SSB(scale_factor=scale_factor, **kwargs)
# Add more benchmark types as they are implemented
else:
raise ValueError(f"Unsupported benchmark type: {benchmark_type}")
4.3 Template Method for Benchmark Execution¶
The BaseBenchmark class uses the Template Method pattern to define the skeleton of the benchmark execution process, with specific steps implemented by subclasses.
class BaseBenchmark(ABC):
# ... other methods ...
def run(self, connection, query_ids=None):
"""
Run benchmark queries against a database connection.
Args:
connection: Database connection object
query_ids: List of query IDs to run (runs all if None)
Returns:
Dictionary mapping query IDs to execution results
"""
# 1. Prepare benchmark
self._prepare_benchmark(connection)
# 2. Determine which queries to run
if query_ids is None:
queries = self.get_queries()
else:
queries = {qid: self.get_query(qid) for qid in query_ids}
# 3. Execute queries and collect results
results = {}
for qid, query in queries.items():
results[qid] = self._execute_query(connection, query)
# 4. Post-process results
return self._post_process_results(results)
@abstractmethod
def _prepare_benchmark(self, connection):
"""Prepare the database for benchmark execution."""
pass
@abstractmethod
def _execute_query(self, connection, query):
"""Execute a query and return its result."""
pass
@abstractmethod
def _post_process_results(self, results):
"""Post-process benchmark results."""
pass
5. Key Design Decisions¶
5.1 Data Generation Strategy¶
Decision: Use official TPC tools (dbgen for TPC-H, dsdgen for TPC-DS) for data generation.
Rationale:
Official tools ensure specification compliance
Template-based query generation works independently of data generation
Implementation Strategy:
Use external TPC tools for official compliance
Use pseudo-random number generators with fixed seeds for deterministic parameter generation
Support both full data generation and query-only usage patterns
5.2 Embedded Query Storage¶
Decision: Embed SQL queries directly in the library code rather than loading from external files.
Rationale:
Simplifies distribution and packaging
Eliminates file system dependencies
Allows for programmatic query manipulation and introspection
Implementation Strategy:
Store queries as string constants or templates in Python modules
Organize queries by benchmark and query ID
Provide interface for retrieving and customizing queries
5.3 SQL Dialect Translation¶
Decision: Use sqlglot for SQL translation between different database dialects.
Rationale:
Leverages an established SQL parsing and translation library
Supports a wide range of SQL dialects
Provides a clean abstraction for SQL manipulation
Implementation Strategy:
Wrap sqlglot functionality in a simple interface
Implement benchmark-specific SQL transformations when needed
Cache translated queries for performance
5.4 Extensibility Model¶
Decision: Use abstract base classes and interfaces to define extension points.
Rationale:
Provides clear contracts for implementing new benchmarks
Ensures consistency across different benchmark implementations
Simplifies the process of adding new benchmark types
Implementation Strategy:
Define core interfaces for key components
Implement concrete benchmark classes for each supported benchmark
Document extension patterns and provide examples
6. Class/Interface Diagram¶
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
│ BaseBenchmark │ │ DataGenerator │ │ QueryManager │
│ (Abstract) │ │ (Interface) │ │ (Interface) │
├───────────────────┤ ├───────────────────┤ ├───────────────────┤
│ - scale_factor │ │ │ │ │
│ - output_dir │ │ │ │ │
├───────────────────┤ ├───────────────────┤ ├───────────────────┤
│ + get_schema() │ │ + generate_table()│ │ + get_query() │
│ + generate_data() │◄────┤ + generate_all() │ │ + get_all_queries()│
│ + get_query() │ │ │ │ + translate_query()│
│ + get_queries() │◄────┼───────────────────┘ │ │
│ + translate_query()│ │ └───────────────────┘
└───────────────────┘ │ ▲
▲ │ │
│ │ │
┌─────────┴──────────┐ │ ┌─────────┴──────────┐
│ │ │ │ │
│ TPCHBenchmark │ │ │ TPCHQueryManager │
│ │ │ │ │
├────────────────────┤ │ ├────────────────────┤
│ - tables │ │ │ - queries │
│ - data_generator │───┘ │ │
│ - query_manager │─────────────────────────── │ │
├────────────────────┤ ├────────────────────┤
│ + _initialize() │ │ + get_query() │
│ + get_schema() │ │ + get_all_queries()│
│ + generate_data() │ │ + translate_query()│
│ + get_query() │ │ │
│ + get_queries() │ │ │
└────────────────────┘ └────────────────────┘
7. Implementation Considerations¶
7.1 Performance Considerations¶
Use lazy loading for queries and other resources to minimize startup time
Implement incremental data generation for large datasets
Consider using Rust components for performance-critical data generation paths
Cache translated queries to avoid redundant translation
7.2 Memory Management¶
Use generators and iterators for large data generation to minimize memory usage
Implement stream-based data writing for large tables
Consider chunked processing for very large benchmark datasets
7.3 Testing Strategy¶
Unit test each component separately
Integration test benchmarks against small-scale data
Validate generated data against benchmark specifications
Test SQL translation across multiple dialects
7.4 Documentation Standards¶
Document public APIs with detailed docstrings
Provide examples for common use cases
Include benchmark-specific documentation (schema, query specifications, etc.)
Document extension points and patterns
8. Roadmap and Future Extensions¶
8.1 Phase 1: Core Framework and TPC-H¶
Implement BaseBenchmark and core interfaces
Develop data generation framework
Implement TPC-H benchmark
Add sqlglot integration for SQL translation
8.2 Phase 2: Additional Benchmarks¶
Implement TPC-DS benchmark
Implement Star Schema Benchmark (SSB)
Add H2O/db-benchmark
Add ClickBench
8.3 Phase 3: Features¶
Add support for custom benchmarks
Implement benchmark execution framework
Add result analysis and visualization tools
Optimize performance for large-scale benchmarks
9. Conclusion¶
The proposed architecture provides a solid foundation for the BenchBox library, with clear separation of concerns, well-defined interfaces, and a modular structure. The design prioritizes extensibility, allowing new benchmarks to be added with minimal effort, while maintaining the hermetic nature of the library.
The architecture balances flexibility and simplicity, providing powerful abstractions while keeping the implementation straightforward. By leveraging established design patterns and following best practices, the design ensures that BenchBox will be maintainable and extensible as it evolves.