BenchBox Results Platform - Operational Runbook¶
Created: 2026-04-01
Phase scope: Phase 3 (Hosted API at api.benchbox.dev)
Prerequisite reading: docs/reference/threat-model.md
Phase 1 and Phase 2 have no hosted services. This runbook applies exclusively to Phase 3. Until Phase 3 launches, file GitHub issues for any anomalies found in the static explorer or PR-based submission pipeline.
w4 - Auth Model¶
Token Type: API Keys¶
Phase 3 uses API keys, not OAuth, for initial launch.
Rationale: BenchBox submissions originate from the CLI (benchbox submit),
not a browser. OAuth requires a redirect flow that has no natural place in a
terminal workflow. API keys are simpler to implement, sufficient for the
use-case volume anticipated at Phase 3 launch, and straightforward to revoke.
OAuth can be added later if organization/team features require browser-based
identity.
Token Scopes¶
Scope |
Grants |
|---|---|
|
POST to |
|
Withdraw or update metadata on own results; cannot affect other actors’ results |
Admin operations (trust promotion, forced withdrawal, user management) require
a separate admin token issued only to maintainers. Admin tokens are not
distributable via benchbox setup.
Session Lifetime¶
API keys do not expire by default. However, a maximum key lifetime of 12 months
is recommended. After 12 months, the API returns 401 Unauthorized with
{"error": "token_expired", "message": "API key expired; run benchbox setup --service to provision a new token"}.
A renewal prompt is shown in the CLI 30 days before expiry. This directly
mitigates the Med-High spoofing/impersonation risk identified in the threat
model by limiting the window of exposure for compromised keys.
Admins can revoke any token at any time via
the admin CLI (benchbox admin revoke-token --actor-id X). Revocation takes
effect immediately - the token is removed from the validation store.
Actors are responsible for rotating their own tokens. The provisioning flow makes it easy to generate a replacement token without losing submission history.
First-Time Provisioning¶
benchbox setup --service
Prompts: “Paste your BenchBox API token:” (token obtained from benchbox.dev/account)
Validates the token against the API (single lightweight
GET /mecall)Stores the token under
~/.benchbox/credentials.yaml:service: token: bbx_<token> endpoint: https://api.benchbox.dev
Prints: “Token stored. Ready to submit with
benchbox submit.”
The credentials file must be chmod 600. benchbox setup enforces this on write
and warns if the file is world-readable on subsequent runs.
w5 - Rate Limiting and Quota Model¶
Per-Actor Limits¶
Limit |
Value |
Scope |
|---|---|---|
Burst rate |
10 submissions |
Any rolling 60-second window |
Daily cap |
50 submissions |
Per actor, per calendar day (UTC) |
Storage quota |
500 MB total |
Raw bundles; cumulative across all submissions |
Bundle size cap |
50 MB |
Per single submission |
Grace Period for New Actors¶
Actors in their first 7 days after token issuance receive:
3× burst budget (30 submissions per 60-second window)
Elevated storage quota of 2 GB (instead of the standard 500 MB; facilitates initial corpus seeding while preventing unbounded abuse)
Same daily cap (50/day) - the grace period is about burst flexibility, not unlimited daily volume
After 7 days the standard limits apply automatically with no action required from the actor.
Grace-period abuse prevention: Auto-revoke any API key provisioned from an IP address that has spawned ≥5 new accounts in 24 hours or ≥20 new accounts in 7 days. This check is enforced at token provisioning time, not retroactively.
Limit Breach Responses¶
Limit Hit |
HTTP Status |
Response Body |
|---|---|---|
Burst exceeded |
429 Too Many Requests |
|
Daily cap hit |
429 Too Many Requests |
|
Storage quota exceeded |
413 Content Too Large |
|
Bundle size cap exceeded |
413 Content Too Large |
|
The API does not queue submissions when limits are hit. Submissions are rejected immediately so actors can see the error and retry at the correct time.
w6 - Moderation Workflow¶
Trust Tiers¶
Trust Label |
Meaning |
How Assigned |
|---|---|---|
|
Actor-submitted; not independently verified |
Automatic on successful ingest |
|
Reviewed and approved by a maintainer |
Manual promotion via admin CLI |
|
Failed validation or maintainer review |
Set by ingest pipeline or admin |
|
Removed by actor or admin |
Set via withdrawal API or admin CLI |
Trust Promotion Path (self-reported → curated)¶
Maintainer identifies a candidate submission (manual discovery or triggered by
benchbox admin review --submission-id X).Runs the validation suite:
Schema conformance: bundle matches schema-v2 specification
Bundle integrity: server-stored hash matches bundle content
Cohort compatibility: benchmark, scale factor, and execution mode are consistent with existing cohort members
Sanity checks: no implausible timings (e.g., sub-millisecond TPC-DS), valid platform identifier, valid query count
If all checks pass, maintainer approves:
benchbox admin promote --submission-id X --trust public-curated
This updates the trust label in the metadata DB and triggers a read model rebuild so the result appears in the curated index.
If any check fails, maintainer rejects:
benchbox admin reject --submission-id X --reason schema_invalid|cohort_mismatch|sanity_fail|manual_review
Status is set to
rejectedwith the reason code. The actor receives an email notification (if contact on file). The bundle is retained for 30 days, then purged by lifecycle policy.
Takedown Process¶
Actor-initiated withdrawal (own results only):
benchbox result withdraw --submission-id X
Sets status to
withdrawnimmediatelyRemoves the result from the public index within one read model rebuild cycle (target: < 15 minutes)
Bundle is retained for 90 days in case the actor wants to resubmit after correcting an error
Admin-forced withdrawal (any result; triggered by abuse, false data, or legal request):
benchbox admin withdraw --submission-id X --reason abuse|false_data|legal|policy_violation
Sets status to
withdrawnwith force-withdrawal flag and reason codeTriggers immediate removal from public index (synchronous rebuild)
Actor is notified by email within 24 hours with the reason code
Bundle is retained for 180 days for audit purposes, then purged
Audit Log Schema¶
Every state-changing event is appended to the audit log. The log is append-only; no entries may be modified or deleted. Stored outside the metadata DB (write-once object store or managed audit logging service).
{
"event_id": "evt_01J...",
"timestamp_utc": "2026-04-01T12:34:56.789Z",
"actor_id": "actor_abc123",
"action": "submitted",
"target_submission_id": "sub_xyz789",
"target_bundle_hash": "sha256:deadbeef...",
"reason_code": null,
"metadata": {}
}
Valid action values:
Action |
Triggered By |
|---|---|
|
Actor via submission API |
|
Ingest pipeline (automated) |
|
Ingest pipeline (result becomes public-self-reported) |
|
Ingest pipeline or admin |
|
Actor or admin |
|
Admin |
|
Admin |
|
Admin (sensitive field removed; bundle hash invalidated) |
|
Lifecycle policy or admin (bundle deleted from object store) |
w7 - Platform SLOs¶
Metric |
Target |
Measurement Method |
Alert Threshold |
|---|---|---|---|
Submission API ACK latency p99 |
< 5 seconds |
Measured at API gateway (synthetic probe every 60s) |
Page if p99 > 10s for any 5-minute window |
Explorer page load p95 |
< 500 ms |
GitHub Pages / CDN; real-user monitoring via Performance API |
Page if p95 > 1000ms for any 5-minute window |
Explorer DuckDB-WASM init time |
< 3 seconds (p95) |
Real-user monitoring; sampled 1% of page loads |
Page if p95 > 6s for any 5-minute window |
CI build + deploy end-to-end |
< 10 minutes |
GitHub Actions job duration |
Page if any build exceeds 20 minutes |
Backup RPO |
< 24 hours |
Age of last successful metadata DB and object store backup |
Page if last backup age > 24 hours |
Backup RTO |
< 4 hours |
Restore drill results (run quarterly) |
Flag if drill exceeds 4 hours; schedule remediation |
Alert routing:
All page-level alerts route to the on-call maintainer via PagerDuty or equivalent.
Backup RPO breach alerts to all maintainers immediately (not on-call only).
SLO breaches that persist > 30 minutes trigger a P1 incident (Class A or B, depending on nature).
Restore drills: Run quarterly against a staging environment. Document
restore duration in the audit log under metadata: {drill: true}. Drill
failures trigger a remediation ticket within one week.
w8 - Incident Response Runbook¶
Incident Severity Levels¶
Severity |
Meaning |
Response Target |
|---|---|---|
P1 (Critical) |
Data breach or service fully unavailable |
Acknowledge < 15 minutes; mitigate < 1 hour |
P2 (High) |
Result integrity compromised; partial outage |
Acknowledge < 1 hour; mitigate < 4 hours |
P3 (Medium) |
Abuse campaign; elevated error rate; SLO at risk |
Acknowledge < 4 hours; mitigate < 24 hours |
P4 (Low) |
Degraded performance within SLO; cosmetic issue |
Acknowledge < 24 hours; schedule fix |
Class A: Result Integrity Incident¶
Examples: Tampered bundle reaches the public index; hash verification fails post-publish; incorrect query timings attributed to wrong platform.
Severity: P2 (High)
Detection signals:
Automated bundle-hash re-verification job fails for a published result
Community report via GitHub issue or email to security@benchbox.dev
Ingest pipeline audit log shows unexpected
validated→publishedtransition for a result with a known-bad hash
Response steps:
Withdraw affected result(s) immediately (target: < 1 hour from detection)
benchbox admin withdraw --submission-id X --reason false_data
Confirm result is removed from public index and explorer rebuild is triggered.
Re-verify all results published in the same time window (± 2 hours around the affected result’s publish timestamp). Use:
benchbox admin verify-window --from <ts> --to <ts>
Any result that fails hash verification is withdrawn pending investigation.
Identify root cause. Three categories:
Storage tampering: compare bundle in object store against original ingest hash in audit log
Pipeline bug: compare ingest pipeline version at publish time against current version; check for known regressions
Bad bundle submitted by actor: verify actor’s submitted hash matches the stored bundle; if mismatch, the ingest server failed to reject
Notify affected submitters within 24 hours. Email the actor whose result was affected. Include: which result, what was found, what action was taken.
Post public notice if ≥10 results were affected OR if trust was materially misrepresented (e.g., a
public-curatedresult was found to be tampered). Post to benchbox.dev/blog or GitHub Discussions. Include: what happened, which results were affected (by submission_id), and what was done.File post-incident report within 72 hours covering: timeline, root cause, affected results, remediation, and prevention measures.
Class B: Data Breach¶
Examples: Private bundle or actor contact records exposed; API key database leaked; misconfigured bucket ACL made private bundles world-readable.
Severity: P1 (Critical)
Detection signals:
Anomalous access patterns in storage or metadata DB access logs
External security researcher report to security@benchbox.dev
Automated ACL drift alert (bucket policy changed unexpectedly)
Unexpected download spike for private-prefix objects
Response steps:
Triage: confirm breach with a second independent signal before mass revocation. Correlate the initial detection signal with at least one independent source (e.g., cross-reference anomalous access patterns with audit logs, verify the report with a second team member, or confirm via a different monitoring channel). This step must complete within 30 minutes of detection. If the second signal confirms the breach, proceed to step 2. If the breach cannot be confirmed but also cannot be ruled out, proceed anyway and treat the incident as real until proven otherwise.
Revoke all API keys and rotate service credentials (target: < 1 hour from detection). This is a broad action - err on the side of revoking too many rather than too few.
benchbox admin revoke-all-tokens --reason security_incident
Rotate object store credentials, metadata DB credentials, and any service account keys. Re-deploy services with new credentials before re-enabling access.
Rollback for false positives: If post-triage investigation determines the incident was a false positive, restore actor access promptly: re-enable revoked API keys (or direct affected actors to re-provision via
benchbox setup --service), post a brief explanation to affected actors, and file an internal report documenting the false positive to improve detection accuracy.Audit access logs to determine exposure scope. For each access log entry in the affected time window, identify: which resources were accessed, by which IP or credential, and whether the access was authorized. Document the scope in the incident report.
Notify affected actors within 72 hours. Legal requirement in most jurisdictions. Notification must include: what data was exposed, approximate time window, what BenchBox has done, and what actors should do (e.g., consider rotating passwords if email was exposed). Coordinate with legal counsel if actor volume is large or if regulated data (PII) was involved.
File incident report with full timeline and remediation steps. Report must be retained for a minimum of 2 years.
Re-enable actor access only after root cause is fixed and a security review confirms the breach vector is closed. Do not rush this step. Actors whose keys were revoked are notified and directed to re-provision via
benchbox setup --service.
Class C: Abuse / Spam Campaign¶
Examples: Actor flooding platform with fake or duplicated benchmark results; coordinated campaign to manipulate the public index; automated scraping that triggers rate limit exhaustion for legitimate actors.
Severity: P3 (Medium)
Detection signals:
Rate limit alert: single actor triggering burst limit repeatedly over hours
Moderation queue spike: large volume of new submissions from one or few actors
Community report: users notice implausible results or identical results submitted under different names
Storage quota approaching exhaustion faster than expected
Response steps:
Block the offending actor (target: < 30 minutes from detection).
benchbox admin block-actor --actor-id X --reason abuse
This revokes the actor’s token and adds them to the deny list. Future token requests from the same email or IP range are rejected.
Withdraw all results from the offending actor. Results submitted during an active abuse campaign cannot be considered trustworthy regardless of technical validity.
benchbox admin withdraw-all --actor-id X --reason abuse
A read model rebuild is triggered automatically.
Review and tighten rate limits or eligibility rules if the campaign exploited a gap. Examples: lower burst limit, add CAPTCHA to token provisioning, require email verification before first submission.
No public notice required unless results from the offending actor reached the public index (i.e., were promoted to
public-curated) before detection. If so, post a brief public notice following the Class A process: what happened, which results were affected (by submission_id), and what was done to remove them.
Contact Directory (On-Call)¶
Keep this current. Store the authoritative version in the private ops runbook, not in this document.
Role |
Responsibility |
|---|---|
On-call maintainer |
First responder for all incident classes |
Secondary maintainer |
Backup if primary is unavailable |
Legal counsel |
Required for Class B if PII is exposed |
security@benchbox.dev |
Public intake for external reports; acknowledge within 72 hours, triage within 7 days |