Real-World Data Benchmarks¶

Benchmarks built on public real-world datasets rather than synthetic TPC or vendor data.

Why Real-World Data?¶

Synthetic benchmarks like TPC-H generate data that is statistically clean: uniform distributions, predictable cardinalities, well-behaved joins. Real-world data is messier - skewed, sparse, seasonal, and full of edge cases. BenchBox’s real-world benchmarks capture these characteristics using authoritative public datasets.

What real-world data exposes that synthetic data hides

Temporal skew - weekday vs. weekend, rush hour vs. overnight, holiday effects
Geographic skew - airport hubs, city centers, regional carriers
Categorical imbalance - a handful of carriers / zones / host classes dominate
Sparsity - not every combination of dimensions exists in the data
Dirty data - nulls, rare categories, outliers that force defensive SQL

These characteristics drive optimizer behavior in ways TPC workloads do not - cardinality estimation, join ordering, partition pruning, and predicate pushdown all behave differently.

Real-World Benchmarks in BenchBox¶

Benchmark	Data Source	Focus
NYC Taxi	NYC TLC trip records	Geospatial + temporal OLAP, taxi zones dimension, 25 queries
Flight Data	US BTS On-Time Performance	Aviation analytics - delays, routes, carriers, seasonality, 20 queries

When to Use¶

NYC Taxi - geospatial joins, pickup/dropoff zone rollups, fare analytics
Flight Data - test temporal and categorical aggregation on a 20+ year government dataset

Real-World Data Benchmarks¶

Why Real-World Data?¶

Real-World Benchmarks in BenchBox¶

When to Use¶

Included Benchmarks¶

See Also¶