BenchBox Results Platform - Operational Runbook¶

Created: 2026-04-01 Phase scope: Phase 3 (Hosted API at api.benchbox.dev) Prerequisite reading: docs/reference/threat-model.md

Phase 1 and Phase 2 have no hosted services. This runbook applies exclusively to Phase 3. Until Phase 3 launches, file GitHub issues for any anomalies found in the static explorer or PR-based submission pipeline.

w4 - Auth Model¶

Token Type: API Keys¶

Phase 3 uses API keys, not OAuth, for initial launch.

Rationale: BenchBox submissions originate from the CLI (benchbox submit), not a browser. OAuth requires a redirect flow that has no natural place in a terminal workflow. API keys are simpler to implement, sufficient for the use-case volume anticipated at Phase 3 launch, and straightforward to revoke. OAuth can be added later if organization/team features require browser-based identity.

Token Scopes¶

Scope	Grants
`submit:write`	POST to `/submissions`; upload bundles; view own submission status
`result:manage`	Withdraw or update metadata on own results; cannot affect other actors’ results

Admin operations (trust promotion, forced withdrawal, user management) require a separate admin token issued only to maintainers. Admin tokens are not distributable via benchbox setup.

Session Lifetime¶

API keys do not expire by default. However, a maximum key lifetime of 12 months is recommended. After 12 months, the API returns 401 Unauthorized with {"error": "token_expired", "message": "API key expired; run benchbox setup --service to provision a new token"}. A renewal prompt is shown in the CLI 30 days before expiry. This directly mitigates the Med-High spoofing/impersonation risk identified in the threat model by limiting the window of exposure for compromised keys.

Admins can revoke any token at any time via the admin CLI (benchbox admin revoke-token --actor-id X). Revocation takes effect immediately - the token is removed from the validation store.

Actors are responsible for rotating their own tokens. The provisioning flow makes it easy to generate a replacement token without losing submission history.

First-Time Provisioning¶

benchbox setup --service

Prompts: “Paste your BenchBox API token:” (token obtained from benchbox.dev/account)
Validates the token against the API (single lightweight GET /me call)

Stores the token under ~/.benchbox/credentials.yaml:

service:
  token: bbx_<token>
  endpoint: https://api.benchbox.dev

Prints: “Token stored. Ready to submit with benchbox submit.”

The credentials file must be chmod 600. benchbox setup enforces this on write and warns if the file is world-readable on subsequent runs.

w5 - Rate Limiting and Quota Model¶

Per-Actor Limits¶

Limit	Value	Scope
Burst rate	10 submissions	Any rolling 60-second window
Daily cap	50 submissions	Per actor, per calendar day (UTC)
Storage quota	500 MB total	Raw bundles; cumulative across all submissions
Bundle size cap	50 MB	Per single submission

Grace Period for New Actors¶

Actors in their first 7 days after token issuance receive:

3× burst budget (30 submissions per 60-second window)
Elevated storage quota of 2 GB (instead of the standard 500 MB; facilitates initial corpus seeding while preventing unbounded abuse)
Same daily cap (50/day) - the grace period is about burst flexibility, not unlimited daily volume

After 7 days the standard limits apply automatically with no action required from the actor.

Grace-period abuse prevention: Auto-revoke any API key provisioned from an IP address that has spawned ≥5 new accounts in 24 hours or ≥20 new accounts in 7 days. This check is enforced at token provisioning time, not retroactively.

Limit Breach Responses¶

Limit Hit	HTTP Status	Response Body
Burst exceeded	429 Too Many Requests	`{"error": "rate_limit_exceeded", "retry_after": <seconds>}` + `Retry-After` header
Daily cap hit	429 Too Many Requests	`{"error": "daily_limit_exceeded", "message": "daily submission limit reached; resets at 00:00 UTC"}`
Storage quota exceeded	413 Content Too Large	`{"error": "storage_quota_exceeded", "message": "storage quota exceeded; contact support to discuss an increase"}`
Bundle size cap exceeded	413 Content Too Large	`{"error": "bundle_too_large", "max_bytes": 52428800}`

The API does not queue submissions when limits are hit. Submissions are rejected immediately so actors can see the error and retry at the correct time.

w6 - Moderation Workflow¶

Trust Tiers¶

Trust Label	Meaning	How Assigned
`self-reported`	Actor-submitted; not independently verified	Automatic on successful ingest
`public-curated`	Reviewed and approved by a maintainer	Manual promotion via admin CLI
`rejected`	Failed validation or maintainer review	Set by ingest pipeline or admin
`withdrawn`	Removed by actor or admin	Set via withdrawal API or admin CLI

Trust Promotion Path (self-reported → curated)¶

Maintainer identifies a candidate submission (manual discovery or triggered by benchbox admin review --submission-id X).
Runs the validation suite:
- Schema conformance: bundle matches schema-v2 specification
- Bundle integrity: server-stored hash matches bundle content
- Cohort compatibility: benchmark, scale factor, and execution mode are consistent with existing cohort members
- Sanity checks: no implausible timings (e.g., sub-millisecond TPC-DS), valid platform identifier, valid query count
If all checks pass, maintainer approves:
```
benchbox admin promote --submission-id X --trust public-curated
```
This updates the trust label in the metadata DB and triggers a read model rebuild so the result appears in the curated index.
If any check fails, maintainer rejects:
```
benchbox admin reject --submission-id X --reason schema_invalid|cohort_mismatch|sanity_fail|manual_review
```
Status is set to rejected with the reason code. The actor receives an email notification (if contact on file). The bundle is retained for 30 days, then purged by lifecycle policy.

Takedown Process¶

Actor-initiated withdrawal (own results only):

benchbox result withdraw --submission-id X

Sets status to withdrawn immediately
Removes the result from the public index within one read model rebuild cycle (target: < 15 minutes)
Bundle is retained for 90 days in case the actor wants to resubmit after correcting an error

Admin-forced withdrawal (any result; triggered by abuse, false data, or legal request):

benchbox admin withdraw --submission-id X --reason abuse|false_data|legal|policy_violation

Sets status to withdrawn with force-withdrawal flag and reason code
Triggers immediate removal from public index (synchronous rebuild)
Actor is notified by email within 24 hours with the reason code
Bundle is retained for 180 days for audit purposes, then purged

Audit Log Schema¶

Every state-changing event is appended to the audit log. The log is append-only; no entries may be modified or deleted. Stored outside the metadata DB (write-once object store or managed audit logging service).

{
  "event_id": "evt_01J...",
  "timestamp_utc": "2026-04-01T12:34:56.789Z",
  "actor_id": "actor_abc123",
  "action": "submitted",
  "target_submission_id": "sub_xyz789",
  "target_bundle_hash": "sha256:deadbeef...",
  "reason_code": null,
  "metadata": {}
}

Valid action values:

Action	Triggered By
`submitted`	Actor via submission API
`validated`	Ingest pipeline (automated)
`published`	Ingest pipeline (result becomes public-self-reported)
`rejected`	Ingest pipeline or admin
`withdrawn`	Actor or admin
`trust-promoted`	Admin
`trust-revoked`	Admin
`redacted`	Admin (sensitive field removed; bundle hash invalidated)
`purged`	Lifecycle policy or admin (bundle deleted from object store)

w7 - Platform SLOs¶

Metric	Target	Measurement Method	Alert Threshold
Submission API ACK latency p99	< 5 seconds	Measured at API gateway (synthetic probe every 60s)	Page if p99 > 10s for any 5-minute window
Explorer page load p95	< 500 ms	GitHub Pages / CDN; real-user monitoring via Performance API	Page if p95 > 1000ms for any 5-minute window
Explorer DuckDB-WASM init time	< 3 seconds (p95)	Real-user monitoring; sampled 1% of page loads	Page if p95 > 6s for any 5-minute window
CI build + deploy end-to-end	< 10 minutes	GitHub Actions job duration	Page if any build exceeds 20 minutes
Backup RPO	< 24 hours	Age of last successful metadata DB and object store backup	Page if last backup age > 24 hours
Backup RTO	< 4 hours	Restore drill results (run quarterly)	Flag if drill exceeds 4 hours; schedule remediation

Alert routing:

All page-level alerts route to the on-call maintainer via PagerDuty or equivalent.
Backup RPO breach alerts to all maintainers immediately (not on-call only).
SLO breaches that persist > 30 minutes trigger a P1 incident (Class A or B, depending on nature).

Restore drills: Run quarterly against a staging environment. Document restore duration in the audit log under metadata: {drill: true}. Drill failures trigger a remediation ticket within one week.

w8 - Incident Response Runbook¶

Incident Severity Levels¶

Severity	Meaning	Response Target
P1 (Critical)	Data breach or service fully unavailable	Acknowledge < 15 minutes; mitigate < 1 hour
P2 (High)	Result integrity compromised; partial outage	Acknowledge < 1 hour; mitigate < 4 hours
P3 (Medium)	Abuse campaign; elevated error rate; SLO at risk	Acknowledge < 4 hours; mitigate < 24 hours
P4 (Low)	Degraded performance within SLO; cosmetic issue	Acknowledge < 24 hours; schedule fix

Class A: Result Integrity Incident¶

Examples: Tampered bundle reaches the public index; hash verification fails post-publish; incorrect query timings attributed to wrong platform.

Severity: P2 (High)

Detection signals:

Automated bundle-hash re-verification job fails for a published result
Community report via GitHub issue or email to security@benchbox.dev
Ingest pipeline audit log shows unexpected validated → published transition for a result with a known-bad hash

Response steps:

Withdraw affected result(s) immediately (target: < 1 hour from detection)
```
benchbox admin withdraw --submission-id X --reason false_data
```
Confirm result is removed from public index and explorer rebuild is triggered.
Re-verify all results published in the same time window (± 2 hours around the affected result’s publish timestamp). Use:
```
benchbox admin verify-window --from <ts> --to <ts>
```
Any result that fails hash verification is withdrawn pending investigation.
Identify root cause. Three categories:
- Storage tampering: compare bundle in object store against original ingest hash in audit log
- Pipeline bug: compare ingest pipeline version at publish time against current version; check for known regressions
- Bad bundle submitted by actor: verify actor’s submitted hash matches the stored bundle; if mismatch, the ingest server failed to reject
Notify affected submitters within 24 hours. Email the actor whose result was affected. Include: which result, what was found, what action was taken.
Post public notice if ≥10 results were affected OR if trust was materially misrepresented (e.g., a public-curated result was found to be tampered). Post to benchbox.dev/blog or GitHub Discussions. Include: what happened, which results were affected (by submission_id), and what was done.
File post-incident report within 72 hours covering: timeline, root cause, affected results, remediation, and prevention measures.

Class B: Data Breach¶

Examples: Private bundle or actor contact records exposed; API key database leaked; misconfigured bucket ACL made private bundles world-readable.

Severity: P1 (Critical)

Detection signals:

Anomalous access patterns in storage or metadata DB access logs
External security researcher report to security@benchbox.dev
Automated ACL drift alert (bucket policy changed unexpectedly)
Unexpected download spike for private-prefix objects

Response steps:

Triage: confirm breach with a second independent signal before mass revocation. Correlate the initial detection signal with at least one independent source (e.g., cross-reference anomalous access patterns with audit logs, verify the report with a second team member, or confirm via a different monitoring channel). This step must complete within 30 minutes of detection. If the second signal confirms the breach, proceed to step 2. If the breach cannot be confirmed but also cannot be ruled out, proceed anyway and treat the incident as real until proven otherwise.
Revoke all API keys and rotate service credentials (target: < 1 hour from detection). This is a broad action - err on the side of revoking too many rather than too few.
```
benchbox admin revoke-all-tokens --reason security_incident
```
Rotate object store credentials, metadata DB credentials, and any service account keys. Re-deploy services with new credentials before re-enabling access.

Rollback for false positives: If post-triage investigation determines the incident was a false positive, restore actor access promptly: re-enable revoked API keys (or direct affected actors to re-provision via benchbox setup --service), post a brief explanation to affected actors, and file an internal report documenting the false positive to improve detection accuracy.
Audit access logs to determine exposure scope. For each access log entry in the affected time window, identify: which resources were accessed, by which IP or credential, and whether the access was authorized. Document the scope in the incident report.
Notify affected actors within 72 hours. Legal requirement in most jurisdictions. Notification must include: what data was exposed, approximate time window, what BenchBox has done, and what actors should do (e.g., consider rotating passwords if email was exposed). Coordinate with legal counsel if actor volume is large or if regulated data (PII) was involved.
File incident report with full timeline and remediation steps. Report must be retained for a minimum of 2 years.
Re-enable actor access only after root cause is fixed and a security review confirms the breach vector is closed. Do not rush this step. Actors whose keys were revoked are notified and directed to re-provision via benchbox setup --service.

Class C: Abuse / Spam Campaign¶

Examples: Actor flooding platform with fake or duplicated benchmark results; coordinated campaign to manipulate the public index; automated scraping that triggers rate limit exhaustion for legitimate actors.

Severity: P3 (Medium)

Detection signals:

Rate limit alert: single actor triggering burst limit repeatedly over hours
Moderation queue spike: large volume of new submissions from one or few actors
Community report: users notice implausible results or identical results submitted under different names
Storage quota approaching exhaustion faster than expected

Response steps:

Block the offending actor (target: < 30 minutes from detection).
```
benchbox admin block-actor --actor-id X --reason abuse
```
This revokes the actor’s token and adds them to the deny list. Future token requests from the same email or IP range are rejected.
Withdraw all results from the offending actor. Results submitted during an active abuse campaign cannot be considered trustworthy regardless of technical validity.
```
benchbox admin withdraw-all --actor-id X --reason abuse
```
A read model rebuild is triggered automatically.
Review and tighten rate limits or eligibility rules if the campaign exploited a gap. Examples: lower burst limit, add CAPTCHA to token provisioning, require email verification before first submission.
No public notice required unless results from the offending actor reached the public index (i.e., were promoted to public-curated) before detection. If so, post a brief public notice following the Class A process: what happened, which results were affected (by submission_id), and what was done to remove them.

Contact Directory (On-Call)¶

Keep this current. Store the authoritative version in the private ops runbook, not in this document.

Role	Responsibility
On-call maintainer	First responder for all incident classes
Secondary maintainer	Backup if primary is unavailable
Legal counsel	Required for Class B if PII is exposed
security@benchbox.dev	Public intake for external reports; acknowledge within 72 hours, triage within 7 days