Glossary

Tags beginner reference

Comprehensive reference for benchmarking, database, and BenchBox-specific terminology.

A

Adapter : Platform-specific implementation that handles connection management, data loading, and query execution for a particular database system. See Platform Selection Guide.

AMPLab Big Data Benchmark : Benchmark comparing performance of analytical SQL engines using big data processing frameworks. Focuses on scan, aggregation, and join operations.

Analytical Database : Database system optimized for OLAP workloads, supporting complex queries over large datasets with aggregations, joins, and analytics. Examples: DuckDB, ClickHouse, Snowflake.

B

Benchmark : Standardized workload used to measure and compare database performance. BenchBox supports TPC-H, TPC-DS, TPC-DI, SSB, ClickBench, and others.

BigQuery : Google Cloud’s fully managed, serverless data warehouse for analytics. Supported via BigQueryAdapter.

Bulk Loading : Optimized method for loading large amounts of data into a database, typically faster than row-by-row insertion. Uses platform-specific features like COPY FROM or cloud storage staging.

C

ClickBench : Real-world analytics benchmark based on anonymized web analytics data. Tests aggregation performance across 43 queries.

ClickHouse : Open-source column-oriented database management system optimized for OLAP workloads. Supported via ClickHouseAdapter.

CoffeeShop Benchmark : Minimal example benchmark in BenchBox for quick testing and demonstration. Uses tiny dataset with simple queries.

Column Store : Database storage architecture that stores data by columns rather than rows, optimizing for analytical queries that read many rows but few columns.

Composite Score : Single metric combining results from multiple benchmark tests. For TPC-H, this is QphH@Size. For TPC-DS, QphDS@Size.

Concurrent Streams : Multiple query streams executing simultaneously to test throughput performance. Each stream runs a complete set of queries in parallel with other streams.

D

Databricks : Unified analytics platform built on Apache Spark. Supported via DatabricksAdapter for SQL Warehouse and cluster execution.

Data Generation : Process of creating synthetic benchmark data at specified scale factors. Uses tools like dbgen (TPC-H) or dsdgen (TPC-DS).

Data Loading : Process of importing generated data files into the target database. May involve schema creation, constraints, indexes, and validation.

Dialect : SQL syntax variant specific to a database system. BenchBox uses sqlglot for dialect translation (e.g., PostgreSQL → Snowflake).

Dry Run : Preview mode that generates queries and configuration without executing against a database. Useful for validation and cost estimation.

DuckDB : In-process analytical database with fast query performance. Default platform for BenchBox local testing.

E

ETL (Extract, Transform, Load) : Data pipeline pattern central to TPC-DI benchmark. Involves extracting data from sources, transforming it, and loading into warehouse.

Execution Phase : Distinct stage of benchmark execution tracked separately: setup, data generation, schema creation, data loading, validation, and query execution.

Execution Time : Time taken to execute a query or complete benchmark run, excluding setup and data loading unless specified.

G

Generator : Tool that creates benchmark data files. Examples: dbgen (TPC-H), dsdgen (TPC-DS), datagen (TPC-DI).

H

H2O.ai Benchmark : Benchmark comparing data manipulation tools (databases, dataframes, query engines) using groupby operations.

I

Incremental Loading : Loading data in batches over time, as tested in TPC-DI benchmark through incremental update batches.

J

JoinOrder Benchmark : Real-world benchmark based on IMDB data, testing join query optimization with complex multi-table queries.

M

Maintenance Function : Data modification operations (INSERT, UPDATE, DELETE) executed in the Maintenance Test phase of TPC benchmarks. These operations permanently modify the database and require a database reload before running additional power or throughput tests. In TPC-H, these are called Refresh Functions (RF1 and RF2). See TPC-H Maintenance Test.

Measurement Interval : Time period during which performance metrics are collected for throughput test calculations.

Metric Tons : TPC-H scale factor 1 represents approximately 1 GB of data or 1 metric ton of goods in the business scenario.

O

OLAP (Online Analytical Processing) : Workload type characterized by complex read queries, aggregations, and analytics over large datasets. Most BenchBox benchmarks are OLAP-focused.

OLTP (Online Transaction Processing) : Workload type characterized by many short read/write transactions. Not the primary focus of BenchBox but relevant for mixed workloads.

P

Parquet : Columnar storage file format commonly used for analytical workloads. Default output format for BenchBox data generation.

Performance Run : Official benchmark execution following TPC rules for result reporting and comparison.

Platform : Target database system for benchmark execution. BenchBox supports DuckDB, ClickHouse, Databricks, Snowflake, BigQuery, Redshift.

Power Test : Single-stream benchmark execution measuring query response time. Runs each query once in sequence. Produces geometric mean or QphH metric.

Read Primitives : Microbenchmark suite testing fundamental database operations (scans, filters, aggregations, joins) using TPC-H data.

Q

QphDS@Size : TPC-DS composite performance metric: “Queries per hour at database size”. Combines power and throughput test results.

QphH@Size : TPC-H composite performance metric: “Queries per hour at database size”. Formula: (3600 / geomean_power_time) * throughput_factor / scale_factor.

Query ID : Unique identifier for a benchmark query. Examples: "q1", "query42", "Q01".

Query Parameter : Variable value substituted into query template during execution. TPC benchmarks use random parameters for different runs.

Query Stream : Sequence of queries executed in a specific order. Throughput tests use multiple concurrent streams.

Query Substitution : Process of replacing parameter placeholders in query templates with actual values.

Query Template : Query with parameter placeholders that can be instantiated multiple times with different values.

R

Redshift : Amazon Web Services cloud data warehouse. Supported via RedshiftAdapter.

Refresh Function (RF) : Maintenance operations in TPC benchmarks that permanently modify data. RF1 typically inserts new data (e.g., new orders and lineitems in TPC-H), while RF2 deletes old data. Database must be reloaded after executing refresh functions before running power or throughput tests. Executed in the Maintenance Test phase, not during power/throughput tests. See TPC-H Maintenance Test for details.

Result Schema : Standardized JSON format for BenchBox benchmark results. Includes query timings, metadata, system profile, and validation status.

Row Store : Database storage architecture that stores data by complete rows, optimizing for transactional workloads.

S

Scale Factor (SF) : Multiplier controlling benchmark data size. SF=1 is standard size (typically 1 GB for TPC-H). SF=0.01 is 1% size. SF=10 is 10x size.

Schema : Database table definitions including columns, data types, constraints, and relationships.

Snowflake : Cloud data warehouse platform. Supported via SnowflakeAdapter.

SQL Dialect : See Dialect.

SSB (Star Schema Benchmark) : Simplified benchmark derived from TPC-H, using a star schema design with a single fact table and four dimension tables.

Setup Phase : Pre-execution stage including data generation, schema creation, and data loading. Typically excluded from performance measurements.

T

Throughput Test : Multi-stream benchmark execution measuring sustained query throughput. Runs multiple query streams concurrently. Produces queries-per-hour metric.

TPC (Transaction Processing Performance Council) : Organization defining standard database benchmarks including TPC-H, TPC-DS, and TPC-DI.

TPC-C : OLTP benchmark measuring transaction processing performance. Not currently supported by BenchBox.

TPC-DI (Data Integration) : Benchmark simulating end-to-end data integration scenario with ETL processes, incremental updates, and data quality validation.

TPC-DS (Decision Support) : Complex analytical benchmark with 99 queries covering advanced SQL features: multi-table joins, subqueries, window functions, rollups.

TPC-H (Ad-Hoc Query) : Widely-used analytical benchmark with 22 queries simulating business intelligence workloads. Focus on joins, aggregations, and sorting.

TPCHavoc : Modified version of TPC-H queries designed to test query optimizer robustness with challenging query patterns.

Tuning : Platform-specific optimizations applied to tables (partitioning, clustering, sorting, indexes) to improve query performance.

V

Validation : Process of verifying benchmark results correctness by checking row counts, result sets, or checksums against expected values.

Variant : TPC-DS query variation with different parameter seeds or substitution values. Used to test optimizer across diverse workload patterns.

W

Warehouse : Data warehouse system designed for analytical queries. Also refers to Databricks SQL Warehouse or Snowflake Warehouse compute resources.

Workload : Complete set of operations executed against a database, including queries and maintenance functions.

See Also