Phase 2 Results Operations Runbook¶
Phase 2 is the PR-based community submission flow for the BenchBox public results
corpus. The product boundary and launch rationale live in
docs/development/benchbox-results-platform-strategy.md.
This runbook documents the current operating model only: contributor PRs target
published-results, CI validates them, maintainers review them, merges trigger the
static explorer rebuild, and no hosted API is involved.
1. Submission Lifecycle¶
Contributor runs
benchbox run ...andbenchbox submit --output ./submission.Contributor copies the bundle files plus
submission-manifest.jsonintoresults-data/bundles/.Contributor regenerates
results-data/corpus-inventory.json.Contributor opens a PR against
published-results.Validate Submissionchecks schema, hash integrity, timing sanity, and inventory drift.Maintainer reviews, requests fixes if needed, and merges.
The documentation workflow rebuilds the explorer from
results-data/.
2. Maintainer Review Checklist¶
Accept only complete benchmark runs with plausible metadata and timings.
Reject bundles that fail CI, omit required schema-v2 fields, or obviously misstate environment details.
Reject partial cohorts that would mislead the compare view.
Confirm the bundle path and filenames are coherent with the existing corpus naming.
Close stale contributor PRs after 14 days without response, with a short thank-you note.
Suggested review reply for missing fixes:
Thanks for the submission. CI found issues we need fixed before merge. Please address the failing checks, rerun the local validation commands from docs/contributing-results.md, and push an update to this PR.
3. CI Failure Triage¶
Use the exact validator output in scripts/validate_submission.py when replying so the
guidance matches the code.
Unsupported schema version: the submitter exported an old result shape; ask them to rerun with a current BenchBox build.Hash mismatch: the bundle changed after packaging; ask them to rerunbenchbox submit.All query timings are 0msor negative durations: reject until the benchmark is rerun.Unknown benchmark idorUnknown platform name: verify whether this is a legitimate new surface before merging.Inventory drift: ask the contributor to run
uv run -- python scripts/generate_corpus_inventory.py --writeand recommit.
Never bypass red validation checks and merge anyway. If the validator is wrong, fix the validator in a separate PR first.
4. Backfilling corpus-inventory.json¶
When a PR updates bundles but forgets the inventory:
uv run -- python scripts/generate_corpus_inventory.py --write
git add results-data/corpus-inventory.json
git commit -m "chore: refresh corpus inventory"
If you are fixing the contributor branch yourself, explain that in the PR before pushing.
5. Rolling Back a Bad Merge¶
Use a fresh branch off the affected target branch. Set REMOTE to the repo you are
operating against.
REMOTE=public
git fetch "$REMOTE" published-results
git switch -c rollback-results "$REMOTE/published-results"
MERGE_SHA="$(git log --merges --oneline -n 1)"
echo "$MERGE_SHA"
git revert -m 1 "${MERGE_SHA%% *}"
git push "$REMOTE" HEAD:published-results
Then comment on the reverted PR explaining whether the bundle was broken or merely misleading, and whether a corrected resubmission is welcome.
6. Re-triggering an Explorer Rebuild¶
If the corpus is correct but the explorer build needs to rerun:
gh workflow run docs.yml --repo joeharris76/BenchBox
gh run watch --repo joeharris76/BenchBox
Use workflow_dispatch only after confirming there is no newer push already rebuilding the site.
7. Data Locations¶
Target branch:
published-resultsCorpus root:
results-data/Bundles:
results-data/bundles/Inventory:
results-data/corpus-inventory.jsonCommunity sidecar:
submission-manifest.jsonGenerated explorer read model:
results-explorer/public/data/
The explorer pipeline treats sidecar presence as the trust-label contract for community submissions.
8. Code Locations¶
Surface |
Path |
|---|---|
Submit CLI |
|
Inventory generator |
|
Submission validator |
|
Validation workflow |
|
Explorer pipeline |
|
Contributor guide |
|
9. Verification Commands¶
Run these locally before concluding the platform is healthy:
uv run -- python scripts/validate_submission.py results-data/bundles/
uv run -- python scripts/generate_corpus_inventory.py --check
uv run -- python results-data/validate_corpus.py
uv run -- python -m pytest tests/unit/scripts/test_validate_submission.py tests/unit/scripts/test_generate_corpus_inventory.py -q
cd results-explorer && npm run typecheck && npm run build
10. Escalation¶
Escalate when any of the following are true:
The validator or inventory generator appears wrong rather than the submission.
The docs workflow rebuild fails after a clean merge.
A trust-label or visibility bug would publish misleading provenance.
A rollback would remove more than the intended submission.
The failure depends on infrastructure or GitHub permissions rather than repo code.
When escalating, link the relevant PR, the failing workflow run, and the exact file or validator message that triggered the escalation.