Result Integrity Validation¶
BenchBox includes a three-tier integrity validator for benchmark result JSON files. The validator checks structural correctness, completeness against benchmark specifications, and statistical believability of reported metrics.
Overview¶
Property |
Value |
|---|---|
Module |
|
Specs |
|
CLI script |
|
MCP tool |
|
Tests |
|
Design Principles¶
Never raises exceptions on validation failure - always returns a structured
IntegrityReportComposes
SchemaV2Validatoras the first structural gate - does not reimplement schema validationSpecs are hardcoded from DuckDB SF1 reference runs, not dynamically imported from benchmark modules (avoids transitive import issues)
Two severity levels: FAIL (mathematical impossibility / corruption) and WARN (suspicious but plausible)
Skipped checks emit PASS with a “skipped” message so check counts remain consistent across benchmarks
Three-Tier Validation Model¶
Tier 1: Structural (8 checks)¶
Validates that the result JSON is well-formed and internally consistent.
Check |
Severity |
Description |
|---|---|---|
|
FAIL |
Delegates to |
|
FAIL |
Top-level keys present ( |
|
FAIL |
|
|
FAIL |
All timing fields >= 0 |
|
FAIL |
p90 <= p95 <= p99 |
|
WARN |
Phase statuses are recognized values |
|
FAIL |
Each query has required fields (id, ms, status) |
|
FAIL |
Per-query millisecond values >= 0 |
Tier 2: Completeness (5 checks)¶
Validates that the result contains expected content for its benchmark type.
Check |
Severity |
Description |
|---|---|---|
|
FAIL |
Query IDs match benchmark spec (e.g., 22 for TPC-H) |
|
FAIL |
At least one measurement query (run_type=’measurement’ or iter > 0) |
|
WARN |
Power test phase ran successfully |
|
WARN |
Tables object present when spec requires it |
|
WARN |
TPC performance metrics present for TPC-H/DS/Havoc/Skew |
Tier 3: Believability (7 checks)¶
Validates that reported metrics are statistically plausible.
Check |
Severity |
Description |
|---|---|---|
|
WARN |
Average timing is between min and max |
|
WARN |
Geometric mean is between min and max |
|
WARN |
Success rate meets benchmark spec floor (e.g., 100% for TPC-H) |
|
WARN |
Row counts within +/-1% of spec (SF1 only; warns on missing tables) |
|
WARN |
No individual query exceeds 30 minutes |
|
FAIL |
No duplicate (id, iter, stream) tuples |
|
WARN |
Data load time > 0 |
BenchmarkSpec System¶
Each benchmark has a BenchmarkSpec frozen dataclass defining its expected characteristics:
@dataclass(frozen=True)
class BenchmarkSpec:
benchmark_id: str
unique_query_ids: frozenset[str]
min_unique_queries: int = 0
min_success_rate: float = 1.0
high_failure_expected: bool = False
requires_tables_object: bool = True
sf1_row_counts: dict[str, int] | None = None
sf1_power_at_size_range: tuple[float, float] | None = None
Key fields¶
unique_query_ids: The complete set of expected query IDs for the benchmarkmin_success_rate: Floor for success rate checks (e.g., 1.0 for TPC-H, 0.95 for TPC-DS)high_failure_expected: WhenTrue, bypasses success rate floor (used bytransaction_primitives)requires_tables_object: WhenFalse, skips the tables object check (used bymetadata_primitives)sf1_row_counts: Expected table row counts at scale factor 1, used for believability checks
Coverage¶
Specs exist for 20 of 22 registered benchmarks: tpch, tpcds, tpchavoc, tpch_skew, ssb, clickbench, nyctaxi, h2odb, amplab, joinorder, flightdata, datavault, coffeeshop, tpcdi, tsbs_devops, tpcds_obt, read_primitives, write_primitives, metadata_primitives, transaction_primitives. ai_primitives and vector_search do not yet have integrity specs.
8 legacy aliases are mapped automatically (e.g., star_schema -> ssb, amplab_big_data -> amplab).
Adding a spec for a new benchmark¶
Run the benchmark at SF1 on DuckDB to produce a reference result file
Add a new entry to the
BENCHMARK_SPECSdict inbenchmark_specs.py:
BENCHMARK_SPECS["my_benchmark"] = BenchmarkSpec(
benchmark_id="my_benchmark",
unique_query_ids=frozenset(["q1", "q2", "q3"]),
min_success_rate=1.0,
requires_tables_object=True,
sf1_row_counts={
"table_a": 100_000,
"table_b": 50_000,
},
)
If the benchmark has legacy aliases, add them to
LEGACY_ALIASESRun the test suite:
uv run -- python -m pytest tests/unit/core/results/test_integrity_validator.py
IntegrityReport¶
The validator returns an IntegrityReport dataclass:
@dataclass
class IntegrityReport:
file: str # File path validated
benchmark_id: str # Benchmark identifier
platform: str # Platform name
scale_factor: float # Scale factor
overall_status: CheckStatus # Worst status across all checks
checks: list[CheckResult] # Individual check results
summary: dict[str, int] # Counts by status (PASS, WARN, FAIL)
Helper methods:
report.passed()- ReturnsTrueif overall status is PASSreport.has_warnings()- ReturnsTrueif any check has WARN status
Each CheckResult contains:
@dataclass
class CheckResult:
category: CheckCategory # STRUCTURAL, COMPLETENESS, or BELIEVABILITY
name: str # Check identifier (e.g., "query_count_math")
status: CheckStatus # PASS, WARN, or FAIL
message: str # Human-readable description
details: dict | None # Optional structured details
Convenience Functions¶
The module provides two convenience functions for common use cases:
from benchbox.core.results.integrity_validator import validate_file, validate_directory
# Single file
report = validate_file(Path("results/tpch_duckdb_sf1.json"))
# Directory (returns list of reports)
reports = validate_directory(Path("results/"), pattern="*.json")
Both functions handle invalid JSON gracefully - returning a FAIL report rather than raising exceptions.
Access Points¶
Interface |
Usage |
|---|---|
CLI script |
|
MCP tool |
|
Python API |
|
See the CLI reference and MCP reference for usage details.
Special Cases¶
metadata_primitives: Hasrequires_tables_object=Falseand is excluded from load-time checks (no data loading phase)transaction_primitives: Hashigh_failure_expected=True- bypasses the success rate floor since failures are an expected part of transaction testingNon-SF1 results: Row count checks are skipped (emitted as PASS with “skipped” message)
Non-TPC benchmarks: TPC metric checks are skipped similarly