H2O.ai Database Benchmark API¶
Complete Python API reference for the H2O.ai Database Benchmark.
Overview¶
The H2O.ai Database Benchmark tests analytical database performance using real-world taxi trip data patterns. Developed by H2O.ai for their database benchmarking initiative, this benchmark focuses on fundamental analytical operations common in data science and machine learning workflows: aggregations, grouping, and time-series analysis.
Key Features:
Real-World Data: Based on NYC taxi trip data structure
Data Science Focus: Tests operations common in ML pipelines
Single-Table Design: Emphasizes aggregation performance
Time-Series Operations: Tests temporal aggregation patterns
Scalable: From 10K rows (SF 0.01) to 1B rows (SF 1000)
Analytics-Oriented: Focuses on data exploration patterns
Reference: https://h2oai.github.io/db-benchmark/
Quick Start¶
from benchbox.h2odb import H2ODB
from benchbox.platforms.duckdb import DuckDBAdapter
# Create benchmark
benchmark = H2ODB(scale_factor=0.1)
benchmark.generate_data()
# Run benchmark
adapter = DuckDBAdapter()
results = adapter.run_benchmark(benchmark)
API Reference¶
H2ODB Class¶
- class H2ODB(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Bases:
BaseBenchmarkH2O Database Benchmark implementation.
This class provides an implementation of the H2O Database Benchmark, including data generation and access to analytical queries for taxi trip data.
Reference: H2O.ai benchmarking suite for analytical workloads
- __init__(scale_factor=1.0, output_dir=None, **kwargs)[source]¶
Initialize an H2O Database Benchmark instance.
- Parameters:
scale_factor (float) – Scale factor for the benchmark (1.0 = ~1GB)
output_dir (str | Path | None) – Directory to output generated data files
**kwargs – Additional implementation-specific options
- generate_data()[source]¶
Generate H2O Database Benchmark data.
- Returns:
A list of paths to the generated data files
- Return type:
list[str | Path]
- get_queries(dialect=None)[source]¶
Get all H2O Database Benchmark queries.
- Parameters:
dialect (str | None) – Target SQL dialect for query translation. If None, returns original queries.
- Returns:
A dictionary mapping query IDs to query strings
- Return type:
dict[str, str]
- get_query(query_id, *, params=None)[source]¶
Get a specific H2O Database Benchmark query.
- Parameters:
query_id (int | str) – The ID of the query to retrieve (Q1-Q10)
params (dict[str, Any] | None) – Optional parameters to customize the query
- Returns:
The query string
- Raises:
ValueError – If the query_id is invalid
- Return type:
str
- get_schema()[source]¶
Get the H2O Database Benchmark schema.
- Returns:
A list of dictionaries describing the tables in the schema
- Return type:
list[dict]
- get_create_tables_sql(dialect='standard', tuning_config=None)[source]¶
Get SQL to create all H2O Database Benchmark tables.
- Parameters:
dialect (str) – SQL dialect to use
tuning_config – Unified tuning configuration for constraint settings
- Returns:
SQL script for creating all tables
- Return type:
str
- apply_verbosity(settings)¶
Apply verbosity settings to the mixin consumer.
- property benchmark_name: str¶
Get the human-readable benchmark name.
- create_enhanced_benchmark_result(platform, query_results, execution_metadata=None, phases=None, resource_utilization=None, performance_characteristics=None, **kwargs)¶
Create a BenchmarkResults object with standardized fields.
This centralizes the logic for creating benchmark results that was previously duplicated across platform adapters and CLI orchestrator.
- Parameters:
platform (str) – Platform name (e.g., “DuckDB”, “ClickHouse”)
query_results (list[dict[str, Any]]) – List of query execution results
execution_metadata (dict[str, Any] | None) – Optional execution metadata
phases (dict[str, dict[str, Any]] | None) – Optional phase tracking information
resource_utilization (dict[str, Any] | None) – Optional resource usage metrics
performance_characteristics (dict[str, Any] | None) – Optional performance analysis
**kwargs (Any) – Additional fields to override defaults
- Returns:
Fully configured BenchmarkResults object
- Return type:
- format_results(benchmark_result)¶
Format benchmark results for display.
- Parameters:
benchmark_result (dict[str, Any]) – Result dictionary from run_benchmark()
- Returns:
Formatted string representation of the results
- Return type:
str
- get_data_source_benchmark()¶
Return the canonical source benchmark when data is shared.
Benchmarks that reuse data generated by another benchmark (for example,
PrimitivesreusingTPC-Hdatasets) should override this method and return the lower-case identifier of the source benchmark. Benchmarks that produce their own data should returnNone(default).
- log_debug_info(context='Debug')¶
Log comprehensive debug information including version details.
- log_error_with_debug_info(error, context='Error')¶
Log an error with comprehensive debug information.
- log_operation_complete(operation, duration=None, details='')¶
- log_operation_start(operation, details='')¶
- log_verbose(message)¶
- log_version_warning()¶
Log version consistency warnings if any exist.
- log_very_verbose(message)¶
- property logger: Logger¶
Return the logger configured for the verbosity mixin consumer.
- quiet: bool = False¶
- run_benchmark(connection, query_ids=None, fetch_results=False, setup_database=True)¶
Run the complete benchmark suite.
- Parameters:
connection (DatabaseConnection) – Database connection to execute queries on
query_ids (list[int | str] | None) – Optional list of specific query IDs to run (defaults to all)
fetch_results (bool) – Whether to fetch and return query results
setup_database (bool) – Whether to set up the database first
- Returns:
benchmark_name: Name of the benchmark
total_execution_time: Total time for all queries
total_queries: Number of queries executed
successful_queries: Number of queries that succeeded
failed_queries: Number of queries that failed
query_results: List of individual query results
setup_time: Time taken for database setup (if performed)
- Return type:
Dictionary containing
- Raises:
Exception – If benchmark execution fails
- run_query(query_id, connection, params=None, fetch_results=False)¶
Execute single query and return timing and results.
- Parameters:
query_id (int | str) – ID of the query to execute
connection (DatabaseConnection) – Database connection to execute query on
params (dict[str, Any] | None) – Optional parameters for query customization
fetch_results (bool) – Whether to fetch and return query results
- Returns:
query_id: Executed query ID
execution_time: Time taken to execute query in seconds
query_text: Executed query text
results: Query results if fetch_results=True, otherwise None
row_count: Number of rows returned (if results fetched)
- Return type:
Dictionary containing
- Raises:
ValueError – If query_id is invalid
Exception – If query execution fails
- run_with_platform(platform_adapter, **run_config)¶
Run complete benchmark using platform-specific optimizations.
This method provides a unified interface for running benchmarks using database platform adapters that handle connection management, data loading optimizations, and query execution.
This is the standard method that all benchmarks should support for integration with the CLI and other orchestration tools.
- Parameters:
platform_adapter – Platform adapter instance (e.g., DuckDBAdapter)
**run_config – Configuration options: - categories: List of query categories to run (if benchmark supports) - query_subset: List of specific query IDs to run - connection: Connection configuration - benchmark_type: Type hint for optimizations (‘olap’, ‘oltp’, etc.)
- Returns:
BenchmarkResults object with execution results
Example
from benchbox.platforms import DuckDBAdapter
benchmark = SomeBenchmark(scale_factor=0.1) adapter = DuckDBAdapter() results = benchmark.run_with_platform(adapter)
- setup_database(connection)¶
Set up database with schema and data.
Creates necessary database schema and loads benchmark data into the database.
- Parameters:
connection (DatabaseConnection) – Database connection to set up
- Raises:
ValueError – If data generation fails
Exception – If database setup fails
- translate_query(query_id, dialect)¶
Translate a query to a specific SQL dialect.
- Parameters:
query_id (int | str) – The ID of the query to translate
dialect (str) – The target SQL dialect
- Returns:
The translated query string
- Raises:
ValueError – If the query_id is invalid
ImportError – If sqlglot is not installed
ValueError – If the dialect is not supported
- Return type:
str
- verbose: bool = False¶
- verbose_enabled: bool = False¶
- verbose_level: int = 0¶
- property verbosity_settings: VerbositySettings¶
Return the current verbosity settings.
- very_verbose: bool = False¶
Constructor:
H2ODB(
scale_factor: float = 1.0,
output_dir: Optional[Union[str, Path]] = None,
**kwargs
)
Parameters:
scale_factor (float): Data size multiplier (1.0 ≈ 1M rows, ~100MB)
output_dir (str | Path | None): Output directory for generated data
Schema Methods¶
- get_create_tables_sql(dialect='standard', tuning_config=None) str¶
Get SQL to create the TRIPS table with taxi trip data schema.
Example:
schema_sql = benchmark.get_create_tables_sql(dialect="duckdb")
- get_schema() list[dict]¶
Get detailed schema information for the TRIPS table.
Query Methods¶
- get_query(query_id, *, params=None) str¶
Get specific H2O.ai query with optional parameters.
Parameters:
query_id (int | str): Query ID (Q1-Q10 or 1-10)
params (dict | None): Query parameters
Supported Parameters:
start_date (str): Start date for temporal filtering
end_date (str): End date for temporal filtering
min_fare (float): Minimum fare amount
max_fare (float): Maximum fare amount
Example:
# Basic aggregation query count_query = benchmark.get_query("Q1") # Temporal analysis with date range temporal_query = benchmark.get_query("Q8", params={ 'start_date': '2020-01-01', 'end_date': '2020-01-31' }) # Statistical analysis with fare filter stats_query = benchmark.get_query("Q9", params={ 'min_fare': 5.0, 'max_fare': 100.0 })
- get_queries(dialect=None) dict[str, str]¶
Get all H2O.ai queries (10 queries total).
Example:
queries = benchmark.get_queries() print(f"Available queries: {list(queries.keys())}") # Output: ['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10']
Query Categories¶
The H2O.ai benchmark queries are organized by analytical operation type:
- Basic Aggregation Queries (Q1-Q2):
Q1: Simple COUNT(*)
Q2: SUM and AVG aggregations
Performance focus: Sequential scan and basic aggregation speed
- Grouping Queries (Q3-Q6):
Q3: Single-column GROUP BY
Q4: Multi-aggregate grouping
Q5: Two-column GROUP BY
Q6: Complex multi-column grouping
Performance focus: Hash aggregation and grouping algorithms
- Temporal Analysis Queries (Q7-Q8):
Q7: Time-based grouping (hourly analysis)
Q8: Complex temporal aggregation with date filtering
Performance focus: Date/time function evaluation and temporal grouping
- Advanced Analytics Queries (Q9-Q10):
Q9: Statistical analysis (STDDEV, percentiles)
Q10: Complex multi-metric analytics
Performance focus: Statistical function computation and sorting
Usage Examples¶
Basic Benchmark Execution¶
from benchbox.h2odb import H2ODB
from benchbox.platforms.duckdb import DuckDBAdapter
# Initialize with testing scale
benchmark = H2ODB(scale_factor=0.1)
data_files = benchmark.generate_data()
# Run with DuckDB
adapter = DuckDBAdapter(memory_limit="4GB")
results = adapter.run_benchmark(benchmark)
print(f"Queries: {results.total_queries}")
print(f"Average time: {results.average_query_time:.3f}s")
Query Group Testing¶
from benchbox.h2odb import H2ODB
from benchbox.platforms.duckdb import DuckDBAdapter
import time
benchmark = H2ODB(scale_factor=0.01)
benchmark.generate_data()
adapter = DuckDBAdapter()
conn = adapter.create_connection()
adapter.create_schema(benchmark, conn)
adapter.load_data(benchmark, conn, benchmark.output_dir)
# Test query groups
query_groups = {
'Basic': ['Q1', 'Q2'],
'Grouping': ['Q3', 'Q4', 'Q5', 'Q6'],
'Temporal': ['Q7', 'Q8'],
'Advanced': ['Q9', 'Q10']
}
params = {
'start_date': '2020-01-01',
'end_date': '2020-01-31',
'min_fare': 5.0,
'max_fare': 100.0
}
for group_name, query_ids in query_groups.items():
print(f"\n{group_name} Queries:")
for query_id in query_ids:
query = benchmark.get_query(query_id, params=params)
start = time.time()
result = conn.execute(query).fetchall()
elapsed = time.time() - start
print(f" {query_id}: {elapsed*1000:.1f} ms ({len(result)} rows)")
Aggregation Performance Analysis¶
from benchbox.h2odb import H2ODB
from benchbox.platforms.duckdb import DuckDBAdapter
import time
from statistics import mean
# Test aggregation performance across scales
scale_factors = [0.01, 0.1, 1.0]
for sf in scale_factors:
print(f"\n=== Scale Factor {sf} ===")
benchmark = H2ODB(scale_factor=sf)
benchmark.generate_data()
adapter = DuckDBAdapter()
conn = adapter.create_connection()
adapter.create_schema(benchmark, conn)
adapter.load_data(benchmark, conn, benchmark.output_dir)
# Test different aggregation patterns
aggregation_tests = [
('Q1', 'Simple count'),
('Q2', 'Sum and avg'),
('Q3', 'Single-column GROUP BY'),
('Q5', 'Two-column GROUP BY'),
('Q9', 'Statistical functions')
]
for query_id, description in aggregation_tests:
query = benchmark.get_query(query_id)
# Run 3 times for stability
times = []
for _ in range(3):
start = time.time()
result = conn.execute(query).fetchall()
times.append(time.time() - start)
print(f"{query_id} ({description}): {mean(times)*1000:.1f} ms")
Data Science Workflow Integration¶
from benchbox.h2odb import H2ODB
from benchbox.platforms.duckdb import DuckDBAdapter
import pandas as pd
import time
# Generate data for ML preprocessing simulation
benchmark = H2ODB(scale_factor=1.0)
benchmark.generate_data()
# Load into pandas for data science operations
trips_df = pd.read_csv(benchmark.output_dir / "trips.csv")
print("Data Science Operations Performance:")
# Feature engineering
start = time.time()
trips_df['hour'] = pd.to_datetime(trips_df['pickup_datetime']).dt.hour
trips_df['day_of_week'] = pd.to_datetime(trips_df['pickup_datetime']).dt.dayofweek
trips_df['trip_duration'] = (
pd.to_datetime(trips_df['dropoff_datetime']) -
pd.to_datetime(trips_df['pickup_datetime'])
).dt.total_seconds()
print(f"Feature engineering: {time.time() - start:.3f}s")
# Aggregation operations (similar to H2O queries)
start = time.time()
hourly_stats = trips_df.groupby('hour').agg({
'fare_amount': ['sum', 'mean', 'std', 'count'],
'trip_distance': ['mean'],
'passenger_count': ['mean']
})
print(f"GroupBy aggregation: {time.time() - start:.3f}s")
# Statistical analysis
start = time.time()
vendor_stats = trips_df.groupby('vendor_id').agg({
'fare_amount': ['count', 'sum', 'mean', 'std', 'min', 'max'],
'tip_amount': ['mean', 'std']
})
print(f"Statistical analysis: {time.time() - start:.3f}s")
Best Practices¶
Use Appropriate Scale Factors
# Development and testing dev = H2ODB(scale_factor=0.01) # ~10K rows, ~1 MB # Standard benchmark test = H2ODB(scale_factor=1.0) # ~1M rows, ~100 MB # Large-scale testing prod = H2ODB(scale_factor=10.0) # ~10M rows, ~1 GB
Test Query Groups Separately
# Test basic aggregation performance basic_queries = ['Q1', 'Q2'] # Test grouping performance grouping_queries = ['Q3', 'Q4', 'Q5', 'Q6'] # Test temporal analysis performance temporal_queries = ['Q7', 'Q8'] # Test advanced analytics performance advanced_queries = ['Q9', 'Q10']
Parameterize Temporal Queries
params = { 'start_date': '2020-01-01', 'end_date': '2020-01-31', 'min_fare': 5.0, 'max_fare': 100.0 } query = benchmark.get_query("Q8", params=params)
Monitor Memory for Large Grouping Operations
# Large scale factors may require memory configuration adapter = DuckDBAdapter(memory_limit="8GB")
Use Multiple Iterations for Timing
from statistics import mean times = [] for _ in range(3): start = time.time() result = conn.execute(query).fetchall() times.append(time.time() - start) avg_time = mean(times)
Common Issues¶
- Issue: Slow aggregation queries on large datasets
Solution: Use columnar storage and appropriate indices
Consider partitioning by date for temporal queries
Increase memory limits for large GROUP BY operations
- Issue: Memory errors with high scale factors
Solution: Start with smaller scale factors (0.01, 0.1)
Increase database memory limits
Use external aggregation if available
- Issue: Incorrect temporal analysis results
Solution: Ensure proper timezone handling
Filter out NULL or invalid dates
Use appropriate date truncation functions
See Also¶
H2O DB Benchmark - H2O.ai benchmark guide
ClickBench Benchmark API - ClickBench analytics benchmark
AMPLab Big Data Benchmark API - AMPLab big data benchmark
Base Benchmark API - Base benchmark interface
External Resources¶
H2O.ai DB Benchmark - Original specification
NYC Taxi Data - Source data format
Database Performance Analysis - Performance comparisons