Result Integrity Validation¶

Tags contributor validation architecture

BenchBox includes a three-tier integrity validator for benchmark result JSON files. The validator checks structural correctness, completeness against benchmark specifications, and statistical believability of reported metrics.

Overview¶

Property	Value
Module	`benchbox.core.results.integrity_validator`
Specs	`benchbox.core.results.benchmark_specs`
CLI script	`_project/scripts/validate_results.py`
MCP tool	`validate_results` (in `benchbox.mcp.tools.analytics`)
Tests	`tests/unit/core/results/test_integrity_validator.py`

Design Principles¶

Never raises exceptions on validation failure - always returns a structured IntegrityReport
Composes SchemaV2Validator as the first structural gate - does not reimplement schema validation
Specs are hardcoded from DuckDB SF1 reference runs, not dynamically imported from benchmark modules (avoids transitive import issues)
Two severity levels: FAIL (mathematical impossibility / corruption) and WARN (suspicious but plausible)
Skipped checks emit PASS with a “skipped” message so check counts remain consistent across benchmarks

Three-Tier Validation Model¶

Tier 1: Structural (8 checks)¶

Validates that the result JSON is well-formed and internally consistent.

Check	Severity	Description
`schema_v2`	FAIL	Delegates to `SchemaV2Validator` for schema compliance
`required_keys`	FAIL	Top-level keys present (`benchmark`, `platform`, `summary`, etc.)
`query_count_math`	FAIL	`passed + failed = total` in summary
`timing_non_negative`	FAIL	All timing fields >= 0
`percentile_ordering`	FAIL	p90 <= p95 <= p99
`phase_status_values`	WARN	Phase statuses are recognized values
`query_entry_fields`	FAIL	Each query has required fields (id, ms, status)
`query_ms_non_negative`	FAIL	Per-query millisecond values >= 0

Tier 2: Completeness (5 checks)¶

Validates that the result contains expected content for its benchmark type.

Check	Severity	Description
`expected_query_ids`	FAIL	Query IDs match benchmark spec (e.g., 22 for TPC-H)
`measurement_queries_present`	FAIL	At least one measurement query (run_type=’measurement’ or iter > 0)
`power_phase_completed`	WARN	Power test phase ran successfully
`tables_object`	WARN	Tables object present when spec requires it
`tpc_metrics`	WARN	TPC performance metrics present for TPC-H/DS/Havoc/Skew

Tier 3: Believability (7 checks)¶

Validates that reported metrics are statistically plausible.

Check	Severity	Description
`avg_in_range`	WARN	Average timing is between min and max
`geomean_in_range`	WARN	Geometric mean is between min and max
`success_rate`	WARN	Success rate meets benchmark spec floor (e.g., 100% for TPC-H)
`sf1_row_counts`	WARN	Row counts within +/-1% of spec (SF1 only; warns on missing tables)
`timing_outliers`	WARN	No individual query exceeds 30 minutes
`no_duplicate_entries`	FAIL	No duplicate (id, iter, stream) tuples
`load_time_nonzero`	WARN	Data load time > 0

BenchmarkSpec System¶

Each benchmark has a BenchmarkSpec frozen dataclass defining its expected characteristics:

@dataclass(frozen=True)
class BenchmarkSpec:
    benchmark_id: str
    unique_query_ids: frozenset[str]
    min_unique_queries: int = 0
    min_success_rate: float = 1.0
    high_failure_expected: bool = False
    requires_tables_object: bool = True
    sf1_row_counts: dict[str, int] | None = None
    sf1_power_at_size_range: tuple[float, float] | None = None

Key fields¶

unique_query_ids: The complete set of expected query IDs for the benchmark
min_success_rate: Floor for success rate checks (e.g., 1.0 for TPC-H, 0.95 for TPC-DS)
high_failure_expected: When True, bypasses success rate floor (used by transaction_primitives)
requires_tables_object: When False, skips the tables object check (used by metadata_primitives)
sf1_row_counts: Expected table row counts at scale factor 1, used for believability checks

Coverage¶

Specs exist for 20 of 22 registered benchmarks: tpch, tpcds, tpchavoc, tpch_skew, ssb, clickbench, nyctaxi, h2odb, amplab, joinorder, flightdata, datavault, coffeeshop, tpcdi, tsbs_devops, tpcds_obt, read_primitives, write_primitives, metadata_primitives, transaction_primitives. ai_primitives and vector_search do not yet have integrity specs.

8 legacy aliases are mapped automatically (e.g., star_schema -> ssb, amplab_big_data -> amplab).

Adding a spec for a new benchmark¶

Run the benchmark at SF1 on DuckDB to produce a reference result file
Add a new entry to the BENCHMARK_SPECS dict in benchmark_specs.py:

BENCHMARK_SPECS["my_benchmark"] = BenchmarkSpec(
    benchmark_id="my_benchmark",
    unique_query_ids=frozenset(["q1", "q2", "q3"]),
    min_success_rate=1.0,
    requires_tables_object=True,
    sf1_row_counts={
        "table_a": 100_000,
        "table_b": 50_000,
    },
)

If the benchmark has legacy aliases, add them to LEGACY_ALIASES
Run the test suite: uv run -- python -m pytest tests/unit/core/results/test_integrity_validator.py

IntegrityReport¶

The validator returns an IntegrityReport dataclass:

@dataclass
class IntegrityReport:
    file: str                           # File path validated
    benchmark_id: str                   # Benchmark identifier
    platform: str                       # Platform name
    scale_factor: float                 # Scale factor
    overall_status: CheckStatus         # Worst status across all checks
    checks: list[CheckResult]           # Individual check results
    summary: dict[str, int]             # Counts by status (PASS, WARN, FAIL)

Helper methods:

report.passed() - Returns True if overall status is PASS
report.has_warnings() - Returns True if any check has WARN status

Each CheckResult contains:

@dataclass
class CheckResult:
    category: CheckCategory   # STRUCTURAL, COMPLETENESS, or BELIEVABILITY
    name: str                 # Check identifier (e.g., "query_count_math")
    status: CheckStatus       # PASS, WARN, or FAIL
    message: str              # Human-readable description
    details: dict | None      # Optional structured details

Convenience Functions¶

The module provides two convenience functions for common use cases:

from benchbox.core.results.integrity_validator import validate_file, validate_directory

# Single file
report = validate_file(Path("results/tpch_duckdb_sf1.json"))

# Directory (returns list of reports)
reports = validate_directory(Path("results/"), pattern="*.json")

Both functions handle invalid JSON gracefully - returning a FAIL report rather than raising exceptions.

Access Points¶

Interface	Usage
CLI script	`uv run _project/scripts/validate_results.py <path> [options]`
MCP tool	`validate_results(result_file="...", verbose=True)`
Python API	`from benchbox.core.results.integrity_validator import validate_file`

See the CLI reference and MCP reference for usage details.

Special Cases¶

metadata_primitives: Has requires_tables_object=False and is excluded from load-time checks (no data loading phase)
transaction_primitives: Has high_failure_expected=True - bypasses the success rate floor since failures are an expected part of transaction testing
Non-SF1 results: Row count checks are skipped (emitted as PASS with “skipped” message)
Non-TPC benchmarks: TPC metric checks are skipped similarly