Glossary¶

Comprehensive reference for benchmarking, database, and BenchBox-specific terminology.

A¶

Adapter : Platform-specific implementation that handles connection management, data loading, and query execution for a particular database system. See Platform Selection Guide.

AMPLab Big Data Benchmark : Benchmark comparing performance of analytical SQL engines using big data processing frameworks. Focuses on scan, aggregation, and join operations.

Analytical Database : Database system optimized for OLAP workloads, supporting complex queries over large datasets with aggregations, joins, and analytics. Examples: DuckDB, ClickHouse, Snowflake.

B¶

Benchmark : Standardized workload used to measure and compare database performance. BenchBox supports TPC-H, TPC-DS, TPC-DI, SSB, ClickBench, and others.

BigQuery : Google Cloud’s fully managed, serverless data warehouse for analytics. Supported via BigQueryAdapter.

Bulk Loading : Optimized method for loading large amounts of data into a database, typically faster than row-by-row insertion. Uses platform-specific features like COPY FROM or cloud storage staging.

C¶

ClickBench : Real-world analytics benchmark based on anonymized web analytics data. Tests aggregation performance across 43 queries.

ClickHouse : Open-source column-oriented database management system optimized for OLAP workloads. Supported via ClickHouseAdapter.

CoffeeShop Benchmark : Minimal example benchmark in BenchBox for quick testing and demonstration. Uses tiny dataset with simple queries.

Column Store : Database storage architecture that stores data by columns rather than rows, optimizing for analytical queries that read many rows but few columns.

Composite Score : Single metric combining results from multiple benchmark tests. For TPC-H, this is QphH@Size. For TPC-DS, QphDS@Size.

Concurrent Streams : Multiple query streams executing simultaneously to test throughput performance. Each stream runs a complete set of queries in parallel with other streams.

D¶

Databricks : Unified analytics platform built on Apache Spark. Supported via DatabricksAdapter for SQL Warehouse and cluster execution.

Data Generation : Process of creating synthetic benchmark data at specified scale factors. Uses tools like dbgen (TPC-H) or dsdgen (TPC-DS).

Data Loading : Process of importing generated data files into the target database. May involve schema creation, constraints, indexes, and validation.

Dialect : SQL syntax variant specific to a database system. BenchBox uses sqlglot for dialect translation (e.g., PostgreSQL → Snowflake).

Dry Run : Preview mode that generates queries and configuration without executing against a database. Useful for validation and cost estimation.

DuckDB : In-process analytical database with fast query performance. Default platform for BenchBox local testing.

E¶

ETL (Extract, Transform, Load) : Data pipeline pattern central to TPC-DI benchmark. Involves extracting data from sources, transforming it, and loading into warehouse.

Execution Phase : Distinct stage of benchmark execution tracked separately: setup, data generation, schema creation, data loading, validation, and query execution.

Execution Time : Time taken to execute a query or complete benchmark run, excluding setup and data loading unless specified.

G¶

Generator : Tool that creates benchmark data files. Examples: dbgen (TPC-H), dsdgen (TPC-DS), datagen (TPC-DI).

H¶

H2O.ai Benchmark : Benchmark comparing data manipulation tools (databases, dataframes, query engines) using groupby operations.

I¶

Incremental Loading : Loading data in batches over time, as tested in TPC-DI benchmark through incremental update batches.

J¶

JoinOrder Benchmark : Real-world benchmark based on IMDB data, testing join query optimization with complex multi-table queries.

M¶

Maintenance Function : Data modification operations (INSERT, UPDATE, DELETE) executed in the Maintenance Test phase of TPC benchmarks. These operations permanently modify the database and require a database reload before running additional power or throughput tests. In TPC-H, these are called Refresh Functions (RF1 and RF2). See TPC-H Maintenance Test.

Measurement Interval : Time period during which performance metrics are collected for throughput test calculations.

Metric Tons : TPC-H scale factor 1 represents approximately 1 GB of data or 1 metric ton of goods in the business scenario.

O¶

OLAP (Online Analytical Processing) : Workload type characterized by complex read queries, aggregations, and analytics over large datasets. Most BenchBox benchmarks are OLAP-focused.

OLTP (Online Transaction Processing) : Workload type characterized by many short read/write transactions. Not the primary focus of BenchBox but relevant for mixed workloads.

P¶

Parquet : Columnar storage file format commonly used for analytical workloads. Default output format for BenchBox data generation.

Performance Run : Official benchmark execution following TPC rules for result reporting and comparison.

Platform : Target database system for benchmark execution. BenchBox supports DuckDB, ClickHouse, Databricks, Snowflake, BigQuery, Redshift.

Power Test : Single-stream benchmark execution measuring query response time. Runs each query once in sequence. Produces geometric mean or QphH metric.

Read Primitives : Microbenchmark suite testing fundamental database operations (scans, filters, aggregations, joins) using TPC-H data.

Q¶

QphDS@Size : TPC-DS composite performance metric: “Queries per hour at database size”. Combines power and throughput test results.

QphH@Size : TPC-H composite performance metric: “Queries per hour at database size”. Formula: (3600 / geomean_power_time) * throughput_factor / scale_factor.

Query ID : Unique identifier for a benchmark query. Examples: "q1", "query42", "Q01".

Query Parameter : Variable value substituted into query template during execution. TPC benchmarks use random parameters for different runs.

Query Stream : Sequence of queries executed in a specific order. Throughput tests use multiple concurrent streams.

Query Substitution : Process of replacing parameter placeholders in query templates with actual values.

Query Template : Query with parameter placeholders that can be instantiated multiple times with different values.

R¶

Redshift : Amazon Web Services cloud data warehouse. Supported via RedshiftAdapter.

Refresh Function (RF) : Maintenance operations in TPC benchmarks that permanently modify data. RF1 typically inserts new data (e.g., new orders and lineitems in TPC-H), while RF2 deletes old data. Database must be reloaded after executing refresh functions before running power or throughput tests. Executed in the Maintenance Test phase, not during power/throughput tests. See TPC-H Maintenance Test for details.

Result Schema : Standardized JSON format for BenchBox benchmark results. Includes query timings, metadata, system profile, and validation status.

Row Store : Database storage architecture that stores data by complete rows, optimizing for transactional workloads.

S¶

Scale Factor (SF) : Multiplier controlling benchmark data size. SF=1 is standard size (typically 1 GB for TPC-H). SF=0.01 is 1% size. SF=10 is 10x size.

Schema : Database table definitions including columns, data types, constraints, and relationships.

Snowflake : Cloud data warehouse platform. Supported via SnowflakeAdapter.

SQL Dialect : See Dialect.

SSB (Star Schema Benchmark) : Simplified benchmark derived from TPC-H, using a star schema design with a single fact table and four dimension tables.

Setup Phase : Pre-execution stage including data generation, schema creation, and data loading. Typically excluded from performance measurements.

T¶

Throughput Test : Multi-stream benchmark execution measuring sustained query throughput. Runs multiple query streams concurrently. Produces queries-per-hour metric.

TPC (Transaction Processing Performance Council) : Organization defining standard database benchmarks including TPC-H, TPC-DS, and TPC-DI.

TPC-C : OLTP benchmark measuring transaction processing performance. Not currently supported by BenchBox.

TPC-DI (Data Integration) : Benchmark simulating end-to-end data integration scenario with ETL processes, incremental updates, and data quality validation.

TPC-DS (Decision Support) : Complex analytical benchmark with 99 queries covering advanced SQL features: multi-table joins, subqueries, window functions, rollups.

TPC-H (Ad-Hoc Query) : Widely-used analytical benchmark with 22 queries simulating business intelligence workloads. Focus on joins, aggregations, and sorting.

TPCHavoc : Modified version of TPC-H queries designed to test query optimizer robustness with challenging query patterns.

Tuning : Platform-specific optimizations applied to tables (partitioning, clustering, sorting, indexes) to improve query performance.

V¶

Validation : Process of verifying benchmark results correctness by checking row counts, result sets, or checksums against expected values.

Variant : TPC-DS query variation with different parameter seeds or substitution values. Used to test optimizer across diverse workload patterns.

W¶

Warehouse : Data warehouse system designed for analytical queries. Also refers to Databricks SQL Warehouse or Snowflake Warehouse compute resources.

Workload : Complete set of operations executed against a database, including queries and maintenance functions.

Glossary¶

A¶

B¶

C¶

D¶

E¶

G¶

H¶

I¶

J¶

M¶

O¶

P¶

Q¶

R¶

S¶

T¶

V¶

W¶

Related Concepts¶

Performance Metrics¶

Data Formats¶

BenchBox CLI Concepts¶

See Also¶