BenchBox Architecture Design Document¶

1. High-Level Architecture¶

BenchBox is designed as a modular, extensible library for embedding benchmark datasets and queries for database evaluation. The architecture follows these key principles:

Modularity: Clear separation between different components
Extensibility: Easy to add new benchmark types
Self-contained: with minimal external dependencies
Cross-Database Compatibility: Support for multiple database systems

1.2 Component Responsibilities¶

Core Framework: Provides base interfaces, abstract classes, and common utilities
Benchmarks: Concrete implementations of specific benchmarks (TPC-H, TPC-DS, etc.)
Data Generator: Generates benchmark data according to specifications
SQL Manager: Stores and translates SQL queries for different database dialects

2. Core Interfaces and Abstract Classes¶

2.1 BaseBenchmark Abstract Class¶

from abc import ABC, abstractmethod
from typing import Dict, List, Optional, Union
import pathlib

class BaseBenchmark(ABC):
    """Base abstract class for all benchmark implementations."""

    def __init__(self, scale_factor: float = 1.0, output_dir: Optional[pathlib.Path] = None):
        """
        Initialize a benchmark instance.

        Args:
            scale_factor: Size of the generated dataset
            output_dir: Directory to store generated data files
        """
        self.scale_factor = scale_factor
        self.output_dir = output_dir or pathlib.Path("./data")
        self._initialize()

    @abstractmethod
    def _initialize(self) -> None:
        """Initialize benchmark-specific components."""
        pass

    @abstractmethod
    def get_schema(self) -> Dict[str, Dict]:
        """
        Get the schema definition for this benchmark.

        Returns:
            Dictionary mapping table names to their schema definitions
        """
        pass

    @abstractmethod
    def generate_data(self) -> Dict[str, pathlib.Path]:
        """
        Generate benchmark data.

        Returns:
            Dictionary mapping table names to generated data file paths
        """
        pass

    @abstractmethod
    def get_query(self, query_id: Union[int, str]) -> str:
        """
        Get a specific query by ID.

        Args:
            query_id: Identifier for the query

        Returns:
            SQL query string
        """
        pass

    @abstractmethod
    def get_queries(self) -> Dict[Union[int, str], str]:
        """
        Get all queries for this benchmark.

        Returns:
            Dictionary mapping query IDs to SQL query strings
        """
        pass

    def translate_query(self, query_id: Union[int, str], dialect: str) -> str:
        """
        Translate a query to a specific SQL dialect.

        Args:
            query_id: Identifier for the query
            dialect: Target SQL dialect

        Returns:
            Translated SQL query string
        """
        query = self.get_query(query_id)
        return self._translate_sql(query, dialect)

    def _translate_sql(self, sql: str, dialect: str) -> str:
        """
        Translate SQL from the benchmark's native dialect to the target dialect.

        Args:
            sql: SQL query string
            dialect: Target SQL dialect

        Returns:
            Translated SQL query string
        """
        # Use sqlglot for translation
        import sqlglot
        return sqlglot.transpile(sql, read="ansi", write=dialect)[0]

2.2 DataGenerator Interface¶

from abc import ABC, abstractmethod
from typing import Dict, Optional, List
import pathlib

class DataGenerator(ABC):
    """Interface for benchmark data generators."""

    @abstractmethod
    def generate_table(self, table_name: str, schema: Dict, scale_factor: float,
                       output_path: pathlib.Path) -> pathlib.Path:
        """
        Generate data for a specific table.

        Args:
            table_name: Name of the table
            schema: Schema definition for the table
            scale_factor: Size multiplier for the generated data
            output_path: Path to write the generated data

        Returns:
            Path to the generated data file
        """
        pass

    @abstractmethod
    def generate_all(self, schemas: Dict[str, Dict], scale_factor: float,
                    output_dir: pathlib.Path) -> Dict[str, pathlib.Path]:
        """
        Generate data for all tables in the benchmark.

        Args:
            schemas: Dictionary mapping table names to their schema definitions
            scale_factor: Size multiplier for the generated data
            output_dir: Directory to write the generated data files

        Returns:
            Dictionary mapping table names to generated data file paths
        """
        pass

2.3 QueryManager Interface¶

from abc import ABC, abstractmethod
from typing import Dict, Union

class QueryManager(ABC):
    """Interface for managing benchmark queries."""

    @abstractmethod
    def get_query(self, query_id: Union[int, str]) -> str:
        """
        Get a specific query by ID.

        Args:
            query_id: Identifier for the query

        Returns:
            SQL query string
        """
        pass

    @abstractmethod
    def get_all_queries(self) -> Dict[Union[int, str], str]:
        """
        Get all queries managed by this instance.

        Returns:
            Dictionary mapping query IDs to SQL query strings
        """
        pass

    @abstractmethod
    def translate_query(self, query_id: Union[int, str], dialect: str) -> str:
        """
        Translate a query to a specific SQL dialect.

        Args:
            query_id: Identifier for the query
            dialect: Target SQL dialect

        Returns:
            Translated SQL query string
        """
        pass

3. Module Structure and Dependencies¶

3.1 Module Organization¶

benchbox/
│
├── __init__.py                  # Package exports
├── core/                        # Core framework components
│   ├── __init__.py
│   ├── base.py                  # Base abstract classes
│   ├── data/                    # Data generation framework
│   │   ├── __init__.py
│   │   ├── generator.py         # Data generator interfaces
│   │   ├── schema.py            # Schema definition utilities
│   │   └── random.py            # Random data generation utilities
│   ├── sql/                     # SQL management framework
│   │   ├── __init__.py
│   │   ├── manager.py           # Query manager interfaces
│   │   └── translator.py        # SQL translation utilities
│   └── utils/                   # Common utilities
│       ├── __init__.py
│       ├── file.py              # File handling utilities
│       └── validation.py        # Validation utilities
│
├── benchmarks/                  # Benchmark implementations
│   ├── __init__.py
│   ├── tpch/                    # TPC-H benchmark
│   │   ├── __init__.py
│   │   ├── benchmark.py         # TPC-H implementation
│   │   ├── schema.py            # TPC-H schema definitions
│   │   ├── generator.py         # TPC-H data generator
│   │   └── queries/             # TPC-H query templates
│   │       ├── __init__.py
│   │       ├── q1.sql
│   │       └── ...
│   ├── tpcds/                   # TPC-DS benchmark
│   │   ├── ...
│   ├── ssb/                     # Star Schema Benchmark
│   │   ├── ...
│   └── ...                      # Other benchmarks
│
└── cli/                         # Command-line interface
    ├── __init__.py
    └── main.py                  # CLI entry point

3.2 Dependency Relationships¶

┌─────────────────┐     ┌───────────────┐     ┌─────────────────┐
│                 │     │               │     │                 │
│ Benchmark Impl  │────▶│ Core Framework│◀────│ CLI Application │
│                 │     │               │     │                 │
└─────────────────┘     └───────────────┘     └─────────────────┘
        │                      │                       │
        │                      │                       │
        │                      ▼                       │
        │              ┌───────────────┐               │
        └────────────▶│  Data Generator│◀──────────────┘
                      │               │
                      └───────────────┘
                            │
                            │
                            ▼
                      ┌───────────────┐
                      │               │
                      │  SQL Manager  │
                      │               │
                      └───────────────┘

4. Design Patterns and Extensibility¶

4.1 Strategy Pattern for Data Generation¶

The architecture employs the Strategy pattern for data generation, allowing different generation algorithms to be plugged in based on the benchmark type.

from abc import ABC, abstractmethod
import pathlib
from typing import Dict

class GenerationStrategy(ABC):
    """Strategy interface for data generation algorithms."""

    @abstractmethod
    def generate(self, schema: Dict, scale_factor: float, output_path: pathlib.Path) -> pathlib.Path:
        """Generate data according to the strategy."""
        pass

class RandomDataStrategy(GenerationStrategy):
    """Generate random data based on schema constraints."""

    def generate(self, schema: Dict, scale_factor: float, output_path: pathlib.Path) -> pathlib.Path:
        # Implementation for random data generation
        pass

class DeterministicDataStrategy(GenerationStrategy):
    """Generate deterministic data based on benchmark specifications."""

    def generate(self, schema: Dict, scale_factor: float, output_path: pathlib.Path) -> pathlib.Path:
        # Implementation for deterministic data generation
        pass

4.2 Factory Method for Benchmark Creation¶

class BenchmarkFactory:
    """Factory for creating benchmark instances."""

    @staticmethod
    def create_benchmark(benchmark_type: str, scale_factor: float = 1.0, **kwargs):
        """
        Create a benchmark instance of the specified type.

        Args:
            benchmark_type: Type of benchmark ('tpch', 'tpcds', 'ssb', etc.)
            scale_factor: Size of the generated dataset
            **kwargs: Additional benchmark-specific parameters

        Returns:
            BaseBenchmark instance
        """
        benchmark_type = benchmark_type.lower()

        if benchmark_type == 'tpch':
            from benchbox import TPCH
            return TPCH(scale_factor=scale_factor, **kwargs)
        elif benchmark_type == 'tpcds':
            from benchbox import TPCDS
            return TPCDS(scale_factor=scale_factor, **kwargs)
        elif benchmark_type == 'ssb':
            from benchbox import SSB
            return SSB(scale_factor=scale_factor, **kwargs)
        # Add more benchmark types as they are implemented
        else:
            raise ValueError(f"Unsupported benchmark type: {benchmark_type}")

4.3 Template Method for Benchmark Execution¶

The BaseBenchmark class uses the Template Method pattern to define the skeleton of the benchmark execution process, with specific steps implemented by subclasses.

class BaseBenchmark(ABC):
    # ... other methods ...

    def run(self, connection, query_ids=None):
        """
        Run benchmark queries against a database connection.

        Args:
            connection: Database connection object
            query_ids: List of query IDs to run (runs all if None)

        Returns:
            Dictionary mapping query IDs to execution results
        """
        # 1. Prepare benchmark
        self._prepare_benchmark(connection)

        # 2. Determine which queries to run
        if query_ids is None:
            queries = self.get_queries()
        else:
            queries = {qid: self.get_query(qid) for qid in query_ids}

        # 3. Execute queries and collect results
        results = {}
        for qid, query in queries.items():
            results[qid] = self._execute_query(connection, query)

        # 4. Post-process results
        return self._post_process_results(results)

    @abstractmethod
    def _prepare_benchmark(self, connection):
        """Prepare the database for benchmark execution."""
        pass

    @abstractmethod
    def _execute_query(self, connection, query):
        """Execute a query and return its result."""
        pass

    @abstractmethod
    def _post_process_results(self, results):
        """Post-process benchmark results."""
        pass

5. Key Design Decisions¶

5.1 Data Generation Strategy¶

Decision: Use official TPC tools (dbgen for TPC-H, dsdgen for TPC-DS) for data generation.

Rationale:

Official tools ensure specification compliance
Template-based query generation works independently of data generation

Implementation Strategy:

Use external TPC tools for official compliance
Use pseudo-random number generators with fixed seeds for deterministic parameter generation
Support both full data generation and query-only usage patterns

5.2 Embedded Query Storage¶

Decision: Embed SQL queries directly in the library code rather than loading from external files.

Rationale:

Simplifies distribution and packaging
Eliminates file system dependencies
Allows for programmatic query manipulation and introspection

Implementation Strategy:

Store queries as string constants or templates in Python modules
Organize queries by benchmark and query ID
Provide interface for retrieving and customizing queries

5.3 SQL Dialect Translation¶

Decision: Use sqlglot for SQL translation between different database dialects.

Rationale:

Leverages an established SQL parsing and translation library
Supports a wide range of SQL dialects
Provides a clean abstraction for SQL manipulation

Implementation Strategy:

Wrap sqlglot functionality in a simple interface
Implement benchmark-specific SQL transformations when needed
Cache translated queries for performance

5.4 Extensibility Model¶

Decision: Use abstract base classes and interfaces to define extension points.

Rationale:

Provides clear contracts for implementing new benchmarks
Ensures consistency across different benchmark implementations
Simplifies the process of adding new benchmark types

Implementation Strategy:

Define core interfaces for key components
Implement concrete benchmark classes for each supported benchmark
Document extension patterns and provide examples

6. Class/Interface Diagram¶

┌───────────────────┐     ┌───────────────────┐     ┌───────────────────┐
│  BaseBenchmark    │     │  DataGenerator    │     │   QueryManager    │
│  (Abstract)       │     │  (Interface)      │     │  (Interface)      │
├───────────────────┤     ├───────────────────┤     ├───────────────────┤
│ - scale_factor    │     │                   │     │                   │
│ - output_dir      │     │                   │     │                   │
├───────────────────┤     ├───────────────────┤     ├───────────────────┤
│ + get_schema()    │     │ + generate_table()│     │ + get_query()     │
│ + generate_data() │◄────┤ + generate_all()  │     │ + get_all_queries()│
│ + get_query()     │     │                   │     │ + translate_query()│
│ + get_queries()   │◄────┼───────────────────┘     │                   │
│ + translate_query()│    │                         └───────────────────┘
└───────────────────┘    │                                  ▲
          ▲              │                                  │
          │              │                                  │
┌─────────┴──────────┐   │                        ┌─────────┴──────────┐
│                    │   │                        │                    │
│ TPCHBenchmark      │   │                        │ TPCHQueryManager   │
│                    │   │                        │                    │
├────────────────────┤   │                        ├────────────────────┤
│ - tables           │   │                        │ - queries          │
│ - data_generator   │───┘                        │                    │
│ - query_manager    │─────────────────────────── │                    │
├────────────────────┤                            ├────────────────────┤
│ + _initialize()    │                            │ + get_query()      │
│ + get_schema()     │                            │ + get_all_queries()│
│ + generate_data()  │                            │ + translate_query()│
│ + get_query()      │                            │                    │
│ + get_queries()    │                            │                    │
└────────────────────┘                            └────────────────────┘

7. Implementation Considerations¶

7.1 Performance Considerations¶

Use lazy loading for queries and other resources to minimize startup time
Implement incremental data generation for large datasets
Consider using Rust components for performance-critical data generation paths
Cache translated queries to avoid redundant translation

7.2 Memory Management¶

Use generators and iterators for large data generation to minimize memory usage
Implement stream-based data writing for large tables
Consider chunked processing for very large benchmark datasets

7.3 Testing Strategy¶

Unit test each component separately
Integration test benchmarks against small-scale data
Validate generated data against benchmark specifications
Test SQL translation across multiple dialects

7.4 Documentation Standards¶

Document public APIs with detailed docstrings
Provide examples for common use cases
Include benchmark-specific documentation (schema, query specifications, etc.)
Document extension points and patterns

8. Roadmap and Future Extensions¶

8.1 Phase 1: Core Framework and TPC-H¶

Implement BaseBenchmark and core interfaces
Develop data generation framework
Implement TPC-H benchmark
Add sqlglot integration for SQL translation

8.2 Phase 2: Additional Benchmarks¶

Implement TPC-DS benchmark
Implement Star Schema Benchmark (SSB)
Add H2O/db-benchmark
Add ClickBench

8.3 Phase 3: Features¶

Add support for custom benchmarks
Implement benchmark execution framework
Add result analysis and visualization tools
Optimize performance for large-scale benchmarks

9. Conclusion¶

The proposed architecture provides a solid foundation for the BenchBox library, with clear separation of concerns, well-defined interfaces, and a modular structure. The design prioritizes extensibility, allowing new benchmarks to be added with minimal effort, while maintaining the hermetic nature of the library.

The architecture balances flexibility and simplicity, providing powerful abstractions while keeping the implementation straightforward. By leveraging established design patterns and following best practices, the design ensures that BenchBox will be maintainable and extensible as it evolves.