BenchBox Results Platform Product + Architecture Strategy¶

Created: 2026-03-29 Revised: 2026-04-15 (see Primary Surface Revision) Originating TODO: productize-result-publishing-and-artifact-sharing

Executive Summary¶

BenchBox should not treat local artifact publication, hosted result submission, and public result analysis as one feature. They are adjacent, but they have different product contracts, trust models, and operational requirements.

Recommended split:

benchbox publish Local/cloud artifact publication. Publishes canonical result bundles to local or cloud storage (S3, R2, etc.). Does not interact with the hosted results platform or public corpus.
benchbox submit Public corpus contribution. Packages a canonical bundle with a submission manifest and either prepares a PR against results-data/ (Phase 2) or uploads directly to the hosted API (Phase 3).
benchbox.dev/results/ A static-first public explorer for browsing, comparing, and analyzing curated public results.

Phase 1 Status: Launched (2026-04-04)¶

Phase 1 launched on 2026-04-04. The launch corpus had 6 maintainer-run bundles across 2 cohorts (TPC-H SF 0.01 and SSB/star_schema SF 0.01), each with ≥3 platforms. As of 2026-04-12, the repository corpus has expanded to 12 bundles across 4 cohorts by adding SF 0.1 for both benchmark families. All launch criteria from the checklist below were met.

Revised Launch Phases¶

Phase	Goal	Write Path	Infrastructure	Priority
1: Static Explorer MVP	Curated seed corpus + read-only explorer at `benchbox.dev/results/`	Maintainer-only: CI-generated results committed under `results-data/` in this repo	Static only: GitHub Pages, no API, no auth, no hosted services	Ship first
2: Community Contributions	Community-submitted results via PR-based workflow	PRs against `results-data/` in this repo with CI validation + automated ingestion	Still static: GitHub Actions validates + merges + rebuilds; extract to a dedicated data repo only if churn justifies it	Ship when Phase 1 UX is proven
3: Hosted Platform	Self-service submission API, org/team spaces, richer features	Hosted API + object storage + async ingest	API server, metadata DB, auth, rate limiting, moderation	Only if demand warrants the operational burden

Key Architecture Decision¶

Phase 1 and Phase 2 require zero backend services. The entire read path is static (derived JSON manifests + DuckDB/Parquet snapshots served via GitHub Pages). The write path is git + CI/CD inside this repository (results-data/ → transform → build → deploy).

A hosted API (Phase 3) is explicitly deferred. The “submit via PR” model used by many successful open-source benchmark databases (e.g., js-framework-benchmark, ClickBench contribution model) proves that community contributions scale well without a custom API until volume demands one.

Product Intent and Positioning¶

Core Hypothesis¶

“People want to browse and compare public benchmark results across platforms.” This is the primary value proposition that Phase 1 must validate.

Target Audience¶

The broader data/analytics community - not just existing BenchBox users. The explorer is a credibility and marketing play: transparent, reproducible, multi-benchmark results that visitors can explore and compare themselves.

Differentiator vs ClickBench¶

ClickBench covers a single workload in a single format. BenchBox’s explorer differentiates on three axes:

Multi-benchmark coverage - TPC-H, TPC-DS, SSB, and future benchmarks in one place, not siloed sites
Rich per-query detail - execution plans, tuning configurations, validation status, companion files
Reproducibility - any published result can be re-run with benchbox run using the same parameters

ClickBench was considered and rejected because its format cannot capture what BenchBox measures (items 2 and 3 above).

Note (2026-04-14): Axes 2 and 3 are currently aspirational. The Phase 1 explorer exposes a narrow read model (11 fields; see Fidelity Gaps section below) that does not yet surface tuning config, execution plans, validation status, execution mode, or cost data. These must be realized before the differentiation claim is credible to a visitor who clicks through to a result detail page. The fidelity gap TODOs (explorer-extend-manifest-and-pipeline, explorer-add-tuning-config-visibility, explorer-add-methodology-disclosure) address this.

Explorer as Dynamic Tool¶

Superseded 2026-04-15. The “dynamic comparison tool, not a curated leaderboard” framing was correct as a rejection of vanity-ranking pages but wrong as a rejection of the matrix leaderboard pattern. See Primary Surface Revision. The revised position: the per-benchmark primary surface is a ClickBench-style Platform × Query matrix leaderboard, with the dynamic Compare page as a secondary deeper-analysis surface. Both ship; the matrix is the landing view.

The explorer is a dynamic comparison tool, not a static shootout page. Visitors pick benchmark, scale factor, and platforms to build their own comparisons. This is the core UX - not a curated leaderboard.

Corpus Size and DuckDB-WASM Justification¶

The benchmark × platform × scale matrix is large even at launch. Phase 1 should cover most supported benchmarks and platforms at limited scale factors. DuckDB-WASM is justified because:

The corpus needs to be large enough for a dynamic tool to be useful
It will grow quickly as new platforms and benchmarks are added
It demonstrates BenchBox’s DuckDB expertise (audience alignment)

The primary corpus criterion is depth per benchmark: each included benchmark must have ≥3 comparable platforms at the same scale factor. The ≥30 total bundle count is a secondary estimate that falls out of the coverage matrix, not an independent target. Depth is the right primary target because the core deliverable is a comparison tool - a cohort needs multiple platforms at the same scale to be meaningful, regardless of total bundle count.

Visual Identity and Build Pipeline¶

The explorer is a standalone Vite app with its own build pipeline, but shares the benchbox.dev visual identity (header, nav, styling). It is not embedded in the docs or blog - it is a distinct feature of the site.

SEO¶

Search engine discoverability is nice-to-have but not a launch blocker. Users will primarily arrive through benchbox.dev directly. Pre-rendering can be added post-launch if organic traffic becomes a goal.

Timeline¶

No hard deadline or external event. Quality over speed.

Brand Decision¶

Resolved 2026-04-03: The results explorer lives at benchbox.dev/results/ under the BenchBox brand. See results-explorer-brand-ownership.md for the full decision rationale.

Summary: Oxbow Research has no independent web presence, domain, or active product identity - it was an earlier brand concept that was deliberately separated from BenchBox in git history. The explorer’s core value proposition (reproducible results, benchbox run re-execution) is inseparable from BenchBox, making BenchBox placement the coherent choice. The working default is confirmed; no changes to existing scaffolding or CI/CD are needed.

Primary Surface Revision (2026-04-15)¶

Added following a strategic review of the explorer vs ClickBench paradigm and an adversarial review of the initial proposal.

What changed and why¶

The 2026-04-14 fidelity work identified that the explorer dropped too many bundle fields in its derived read model. The 2026-04-15 review went one level up: the UX paradigm itself is wrong. The current explorer is browse-first (land on home → navigate to benchmark → get a sorted list → maybe click Compare). ClickBench is compare-first: the landing view for a benchmark is already the cross-platform answer.

That is not a cosmetic difference. It is the product definition. The revised product position:

The per-benchmark landing view is a Platform × Query matrix leaderboard (ClickBench-style). The dynamic Compare page remains as a secondary surface for deeper per-query analysis. The detail page remains as the reproducibility-story surface. All three ship.

Why the original “matrix only, dynamic only” framings were both wrong¶

The original strategy doc rejected a “curated leaderboard” page. That rejection still stands - there is no vanity-ranking page in the product. But a matrix leaderboard is not a vanity ranking: it is a cohort-aware, per-query, per-column-normalized view where ranking emerges from the data rather than from editorial weighting. It is the comparison tool, rendered densely, rather than a competitor to it.

The initial redesign proposal (see the companion review) was correct in its main recommendation but wrong in several specifics. The final adopted shape incorporates the following corrections:

Area	Initial proposal	Final adopted position
Artifact key	`(benchmark, scale)`	`(benchmark, scale, phase)` - phase cannot be silently collapsed
Color scale	linear min-max	log10 ratio-to-fastest, clamped at 10×
Primary ranking metric	geomean_ms default	per-family registry: `power_score` for TPC-H/TPC-DS, `geomean_ms` fallback
Trust default	filter to `maintainer-run` only	show all tiers with visible badges; trust is an optional filter
Compare URL	full result_ids in query string	short hash IDs with full-form fallback
Home page	recent results list	cross-benchmark meta-leaderboard (avg-rank aggregation) + recent results secondary
Accessibility	unspecified	aria-labels on every data cell, keyboard nav, reduced-color mode, axe-core in CI

Revised derived read model¶

In addition to the artifacts already specified in Derived Read Model Schema, the pipeline emits:

benchmarks/{benchmark}_sf{sf}_{phase}.json.gz - one file per unique (benchmark, scale_factor, phase) tuple, containing the full cross-platform query matrix plus per-platform power_score, geomean_ms, cost_usd, trust_label, tuning_mode, and short_id. Gzip compressed on write.
short_ids.json - lookup table {short_id: full_result_id} to allow compact Compare URLs without breaking existing long-form URLs.
meta_leaderboard.json - cross-benchmark rank aggregation used by the home page. Platforms ranked within each cohort (cohorts with ≥2 platforms only), then aggregated as simple mean of ranks across appearances. No weighted composite score.

Revised UX surfaces¶

Surface	Revised role
Home (`/results/`)	Cross-benchmark meta-leaderboard is the hero panel. Recent results list drops to secondary position.
Benchmark landing (`/results/tpch/`)	Matrix leaderboard as the default view. List view available via toggle for mobile.
Compare (`/results/compare?ids=...`)	Unchanged in role; accepts short IDs. Reached from matrix-row checkboxes.
Detail (`/results/r/<id>`)	Unchanged. Surface for methodology disclosure and reproducibility.

What explicitly does not change¶

No branded leaderboard page remains the rule. The matrix leaderboard is cohort-aware (single scale, single phase, same benchmark); it is not a cross-context vanity ranking.
All three differentiators remain (multi-benchmark coverage, rich per-query detail, reproducibility). The matrix leaderboard strengthens differentiator #1 by making the multi-benchmark story visible on the home page.
Cohort-breaking guardrails (benchmark, scale, tuning mode, execution mode, phase) remain enforced. The matrix view exposes all five as explicit axes in the filter bar instead of hiding them.

TODO cluster¶

Addressed by the explorer- TODO cluster added 2026-04-15. See TODO Cluster and Priority for the full table.

Explorer Fidelity: Known Gaps and Requirements¶

Added 2026-04-14 following a post-launch audit of explorer vs CLI divergence.

Canonical Duration Metric¶

The explorer currently uses total_duration_s (wall-clock sum from the bundle’s run.total_duration_ms). The CLI uses geometric mean of per-query times as the primary comparison metric for OLAP workloads. These produce different numbers for the same result, making cross-referencing between CLI output and the explorer confusing and misleading for results with different query counts.

Decision: The explorer’s primary comparison metric must be geometric mean of per-query execution times (geomean_ms), computed at pipeline build time and stored in both the manifest and DuckDB schema. Wall-clock total duration may be surfaced as a secondary metric with a clear label, but it must not be the default sort/compare axis. This aligns with CLI behavior and with standard OLAP benchmarking practice.

Refinement (2026-04-15): The geomean-is-canonical decision applies where the benchmark family has no published aggregate metric (ClickBench, SSB, custom benchmarks). For TPC-H and TPC-DS, the canonical metric is power_score as defined by the TPC specification - users arriving from published TPC results expect to see that number. The explorer pipeline exposes a per-family RANKING_METRIC_BY_FAMILY registry (benchbox/core/explorer_pipeline/models.py) that selects the primary metric per benchmark family and serializes the choice into each BenchmarkSummary artifact’s ranking field - no ranking logic lives in TypeScript. geomean_ms / display_geomean_ms remains as the universal secondary metric shown alongside. Implemented in explorer-align-ranking-metric-with-tpc-standards.

Comparability Model¶

Two results are comparability-breaking on the following dimensions - the compare view must warn or hard-block when they differ:

Dimension	Breaking?	Action
Benchmark family	Yes (hard-block)	Already enforced
Scale factor	Yes (hard-block)	Already enforced
Execution mode (SQL vs DataFrame)	Yes (warn)	Not yet surfaced
Tuning mode (tuned vs notuning)	Yes (warn)	Not yet surfaced
Query subset vs full benchmark	Yes (warn)	Not yet surfaced
Platform version	No (label)	Partially surfaced

The explorer cannot enforce the tuning and execution-mode warnings until those fields reach the manifest. See explorer-extend-manifest-and-pipeline.

Pipeline Fidelity Gap¶

The schema-v2 bundle contains 13 blocks and 50+ fields. The Phase 1 pipeline transformer emits 11 manifest fields, silently dropping:

Category	Dropped Fields	Impact
Execution metadata	`execution_mode` (SQL/DataFrame)	Can’t distinguish modes
Tuning config	`config.tuning_mode`, full config detail	No tuning UI, can’t cross-compare
Platform identity	`platform.version`, `engine_version`	Only driver version surfaced
Run config	`config.platform_options`, `config.compression`	No config disclosure
Phase timing	All `phases.*` breakdown	Can’t see load vs query time
Cost	`cost.total_usd`	No cost analysis
Validation	`summary.validation`	No validation status visible
Test type	Power vs throughput distinction	Can’t label test type
Query granularity	`run_type`, `iter`, `stream` per timing	Can’t identify warmup vs measurement

The DuckDB results table currently has 11 columns. It should grow to ~20 to enable meaningful client-side filtering. See explorer-extend-manifest-and-pipeline.

Methodology Transparency¶

A visitor to a result detail page currently cannot determine:

Whether total_duration_s is a sum, mean, or median
Whether the run used tuning (a configuration that can significantly affect results)
Whether warmup queries are included in the timing
What hardware the result was produced on beyond OS/arch/CPU count

Every result detail page must include a methodology disclosure panel that states the aggregation method, tuning state, execution mode, and key environment parameters. See explorer-add-methodology-disclosure.

Chart Parity Targets¶

The CLI ships 15 chart types; the explorer ships 2 inline SVG components. Full parity is not required, but the following gap is strategically significant for the “dynamic comparison tool” claim:

Missing chart	Value	Priority
Normalized speedup (log scale, baseline-relative)	Primary comparison metric for large perf deltas	High
Diverging bar (per-query regression/improvement)	Makes “which queries got slower?” obvious	High
Phase breakdown (stacked load vs query)	Distinguishes load bottleneck from query bottleneck	Medium
Percentile ladder (P50/P90/P95/P99)	Identifies tail-latency outliers	Medium
Cost scatter	“Cost vs performance” is a primary analyst question	Low (Phase 2+)

See explorer-add-comparison-charts.

Phase 1 MVP Definition¶

Primary Deliverable: Dynamic Comparison Tool¶

The core UX of Phase 1 is a dynamic comparison tool where visitors pick a benchmark, scale factor, and platforms, then see a side-by-side query-level timing breakdown. This is the feature that differentiates the explorer from a file listing, makes users come back and share links, and validates the core hypothesis (“people want to browse and compare public benchmark results”).

All other Phase 1 deliverables (home page, browse pages, detail pages) exist to support navigation into and out of comparisons. They are necessary but secondary.

What Ships¶

Component	Role	Description	Done When
Compare view	Primary	Dynamic comparison tool: pick benchmark/scale/platforms, see side-by-side query-level timing breakdown with cohort validation	Compare works for any compatible results in the corpus; shareable URLs
Seed corpus	Data	Result bundles with depth per benchmark (many platforms at key scale factors) rather than just breadth	Bundles exported, validated, committed under `results-data/bundles/`
Static build pipeline	Data	Transforms canonical schema-v2 bundles into navigation manifest JSON, per-result detail JSON, and DuckDB database snapshot	Pipeline runs in CI, output deployed to GitHub Pages
Explorer home + browse	Navigation	Landing page with summary cards; benchmark and platform index pages with filterable result lists and “compare” checkboxes	Users can navigate to any result and select results for comparison
Result detail page	Supporting	Stable URL per result showing metadata, query timings, validation status, and raw bundle download	Detail page works for all seed corpus results
DuckDB-WASM filtering	Interaction	Client-side filtering by benchmark, platform, scale factor, date range	Filters work over the full corpus
GitHub Pages integration	Deployment	Explorer builds and deploys alongside existing landing + docs + blog	Single `git push` to `main` deploys everything

What Does NOT Ship in Phase 1¶

No user accounts, authentication, or authorization
No hosted submission API or benchbox submit command
No anonymous or community uploads
No branded “leaderboard” page, but cohort views may be sorted by total duration or geometric mean - this is a sorted table within a validated cohort, not a cross-context ranking claim
No organization accounts or private workspaces
No moderation, trust labels, or abuse controls (not needed - corpus is maintainer-curated)

Launch Criteria¶

Seed corpus has sufficient depth for the dynamic comparison tool to show meaningful cross-platform comparisons: each included benchmark has ≥3 comparable platforms at the same scale factor
Compare view is the primary entry point and works end-to-end: select benchmark/scale/platforms → see query-level timing breakdown → share URL
All explorer pages render correctly with real data
DuckDB-WASM filtering works in Chrome, Firefox, Safari
GitHub Pages deployment succeeds end-to-end from CI
Explorer is navigable from the existing benchbox.dev site header/nav
Brand ownership decision is resolved and reflected in domain/hosting/identity

Phase 2: Community Contributions (Deferred)¶

Model: Submit via Pull Request¶

Instead of building a hosted API, Phase 2 uses a PR-based contribution model:

Contributor runs benchbox submit --output ./submission/ which packages the canonical schema-v2 bundle with a submission manifest (contributor metadata, benchmark context, optional notes)
Contributor opens a PR against this repository touching results-data/
GitHub Actions CI validates: schema conformance, bundle integrity (hash check), cohort compatibility, and basic sanity checks (no absurd timings, valid platform)
Maintainers review and merge
Merge triggers rebuild of derived read models + redeploy of explorer

A dedicated benchbox-results repository remains an optional future extraction if corpus size or contribution volume starts to overwhelm the main repo. It is not a Phase 1 or early Phase 2 requirement.

Why PR-Based Before API-Based¶

Concern	PR model	API model
Auth	GitHub identity (free)	Custom auth system (build + operate)
Moderation	PR review (familiar, auditable)	Custom moderation UI (build + operate)
Abuse prevention	PR rate = human rate	Rate limiting, quotas, captchas
Trust labels	Commit author = attribution	Custom trust promotion workflow
Operational cost	Zero (GitHub Actions)	API server + DB + storage + monitoring
Scalability ceiling	~100s of submissions/month	Thousands+/month

The PR model is sufficient until submission volume exceeds what maintainer review can handle. That threshold is unlikely in the first year of a results platform.

What Ships in Phase 2¶

benchbox submit command to package and submit results to the public corpus
Submission manifest schema (contributor, context, notes)
CI validation workflow for the data repository
Trust labels in explorer: “maintainer” vs “community-submitted”
Contributor guidelines documentation

Phase 3: Hosted Platform (Deferred)¶

Phase 3 is explicitly contingent on Phase 2 reaching a scale where the PR model becomes a bottleneck. Indicators that Phase 3 is needed:

Submission volume exceeds ~50/month sustained
Maintainer review becomes a sustained bottleneck
Users need private/unlisted results (not possible in a public repo)
Organization/team features are requested by paying or strategic users

What Would Ship in Phase 3¶

Hosted submission API at api.benchbox.dev
Authentication (API keys, OAuth)
Private and unlisted visibility states
Automated trust promotion workflow
Rate limiting, quotas, abuse controls
Organization/team spaces
Richer APIs and embedded widgets

The planned Phase 3 operating model is documented in ../operations/results-phase-3-runbook.md.

Cost and Operational Complexity¶

A hosted platform requires:

Component	Estimated Cost	Operational Burden
API server (e.g., Fly.io, Railway)	$20-100/mo	Deployment, monitoring, on-call
Metadata database (Postgres)	$15-50/mo	Backups, migrations, scaling
Object storage (S3/R2)	$5-20/mo	Lifecycle policies, access control
Auth provider (Auth0/Clerk)	$0-25/mo	Token management, session handling
Monitoring (Sentry, metrics)	$0-30/mo	Alert triage, incident response

Total: $40-225/month + significant engineering time. This is only justified if the results platform becomes a core product with sustained community usage.

Current BenchBox Constraints¶

Constraint	Evidence	Strategy implication
Canonical results already exist as schema-v2 bundles with companion files	`benchbox/core/results/exporter.py`	All downstream paths ingest the real exported bundle, not a second payload
Public site is currently static GitHub Pages assembled from landing + docs + blog	`.github/workflows/docs.yml`, `docs/conf.py`	The public explorer must be a static subsite - no server dependency
BenchBox already hints at a hosted service contract	`_project/_archive/specs/cli/config.md` documents `submit_to_service` and `service_url` (archived)	CLI public submission (`benchbox submit`) is a legitimate future direction, but Phase 1 does not require it
Existing publishing prototype is process-local	`benchbox/core/publishing/artifacts.py`, `benchbox/core/publishing/permalink.py`	The prototype is not the hosted service architecture; it is only a source of reusable concepts

Reference Matrix¶

Reference	Strong pattern	What BenchBox should copy	What BenchBox should not copy
Geekbench	Stable public result pages, comparison flows, account-linked online result management, offline vs online distinction	Stable result detail pages, obvious compare actions, explicit separation between local results and hosted results	A closed scoring model or consumer-device-centric assumptions
CloudSpecs	Static browser app, GitHub Pages hosting, browser-side DuckDB-WASM analysis over a curated dataset	Phase 1 reference architecture: static-first explorer, DuckDB-WASM for browser-side analytics, downloadable snapshots, reproducible analysis artifacts	No-write-path assumptions for the whole product
OpenBenchmarking	Centralized submission ecosystem, aggregate comparison, rich result metadata, public/private policy	Cohort-aware comparison, richer metadata, trust labels, aggregate analysis	Day-1 open public firehose without curation, moderation, or clear verification state
ASV	Results stored as files, publish to a static website, precomputed regression views	Derived read models published as static assets, regression/change-oriented views, offline-friendly read path	Limiting the product to codebase-over-time regressions only

Product Boundary¶

BenchBox needs three explicit user contracts, but they do NOT all ship at once.

Contract	Primary actor	Phase	Runtime boundary
Publish (local/cloud)	BenchBox user sharing files or mirroring artifacts	Independent (existing TODO)	CLI + local/cloud storage backend
Explore	Reader/analyst comparing public results	Phase 1	Static subsite on GitHub Pages
Submit (PR)	Community contributor adding results	Phase 2	`benchbox submit --output` → GitHub PR + CI validation
Submit (API)	Self-service submitter	Phase 3	`benchbox submit` → Hosted API + async ingest

Technology Recommendations for Phase 1¶

Explorer Frontend Stack¶

Choice	Recommendation	Rationale
Build tool	Vite	Fast, modern, used by CloudSpecs reference. Produces optimized static bundles.
Framework	Vanilla TypeScript + Preact (or no framework)	Minimizes bundle size for a content-heavy site. Preact if component model helps; plain TS if it stays simple.
Browser analytics	DuckDB-WASM	BenchBox already has deep DuckDB expertise. Enables SQL-powered filtering, comparison, and ad-hoc analysis in the browser. Proven by CloudSpecs.
Data format	Static JSON manifests + DuckDB database file	JSON for navigation/SEO/fast page loads. DuckDB `.db` file for rich filtering and comparison.
Routing	File-based with real paths	`/results/tpch/`, `/results/duckdb/`, `/results/r/{result_id}` - not hash routing. Required for stable share URLs and SEO.
Styling	Tailwind CSS or minimal custom CSS	Consistent with modern static sites. Light enough for GitHub Pages.

Derived Read Model Schema¶

The static build pipeline transforms canonical schema-v2 bundles into:

manifest.json - Global navigation index. Target schema (fields marked * are Phase 1 launched; unmarked fields are required additions per the fidelity gap work):

{
  "results": [
    {
      "id": "tpch-duckdb-sf1-20260315",          // *
      "benchmark": "tpch",                         // *
      "platform": "duckdb",                        // *
      "scale_factor": 1.0,                         // *
      "run_date": "2026-03-15",                    // *
      "total_duration_s": 1.234,                   // * (wall-clock; secondary metric)
      "geomean_ms": 56.2,                          // target: canonical comparison metric
      "query_count": 22,                           // *
      "trust_label": "maintainer-run",             // *
      "visibility": "public-curated",              // *
      "driver_version": "1.1.0",                  // *
      "platform_version": "1.1.0",                // target
      "execution_mode": "sql",                     // target: "sql" | "dataframe"
      "tuning_mode": "tuned",                      // target: "tuned" | "notuning" | "auto"
      "tuning_hash": "abc123",                     // target: stable hash for cross-compare grouping
      "test_type": "power",                        // target: "power" | "throughput"
      "validation_status": "passed",               // target: "passed" | "failed" | "skipped"
      "cost_usd": null,                            // target: null if unavailable
      "bundle_path": "bundles/tpch-duckdb-sf1-20260315.json"  // *
    }
  ],
  "benchmarks": ["tpch", "tpcds", "ssb"],
  "platforms": ["duckdb", "datafusion", "clickhouse", "polars-df"],
  "generated_at": "2026-03-29T00:00:00Z"
}

Per-result detail JSON - Full query timings + metadata for result pages. Target query record:

{
  "id": "tpch-duckdb-sf1-20260315",
  "metadata": {
    "benchmark": "tpch", "platform": "duckdb", "environment": "...",
    "execution_mode": "sql",
    "tuning_mode": "tuned", "tuning_summary": "DuckDB default tuning profile",
    "platform_version": "1.1.0", "engine_version": null
  },
  "queries": [
    {"id": "Q1", "ms": 45.2, "rows": 4, "status": "passed",
     "run_type": "measurement", "iter": 1, "stream": 1}
  ],
  "summary": {
    "total_ms": 1234, "geomean_ms": 56.2,
    "passed": 22, "failed": 0,
    "validation_status": "passed"
  },
  "bundle_download": "bundles/tpch-duckdb-sf1-20260315.json"
}

results.duckdb - DuckDB database for browser-side analysis:
- results table: one row per result run - target ~20 columns (see manifest target above; all manifest fields should be filterable via DuckDB-WASM)
- queries table: one row per query execution (result_id, query_id, ms, rows, status, run_type, iter, stream)
- Enables: SELECT * FROM queries WHERE benchmark='tpch' AND platform='duckdb' ORDER BY ms
bundles/ - Raw canonical schema-v2 bundles for download

Compare URL Format¶

Compare views use query parameters for flexibility:

/results/compare?ids=tpch-duckdb-sf1-20260315,tpch-datafusion-sf1-20260315
Cohort validation happens client-side: same benchmark + same scale factor required
Cohort mismatch is a hard-block, not a warning. The compare view refuses to render incompatible comparisons (different benchmark or different scale factor). Non-blocking differences (query subset, tuning mode) produce warnings but do not prevent rendering.

Site Integration¶

The explorer is built as a standalone Vite app in results-explorer/. The static read-model pipeline writes build inputs to results-explorer/public/data/; the Vite build emits static files to results-explorer/dist/; the existing GitHub Pages workflow then copies that output into site/results/.

Build flow:

results-data/ + static build pipeline → results-explorer/public/data/
results-explorer/dist/ + [landing page] + [sphinx docs] + [blog] → /site/ → GitHub Pages

Navigation integration: add “Results” link to the shared site header/nav.

Architecture by Phase¶

Phase 1 Architecture (Static Only)¶

CI benchmark runs → schema-v2 bundles in results-data/
                                        ↓
                          static build pipeline → results-explorer/public/data/
                                                  ↓
                                            Vite build (dist/)
                                                  ↓
                           GitHub Pages assembly copies dist/ → site/results/
                                                  ↓
                                   GitHub Pages (benchbox.dev/results/)
                                                  ↓
                                     Vite app + DuckDB-WASM in browser

No API. No database. No auth. No hosted services.

Phase 2 Architecture (PR-Based Contributions)¶

Contributor: benchbox submit --output ./submission/
                ↓
        PR touching results-data/ in this repo
                ↓
        CI validates (schema, hash, cohort, sanity)
                ↓
        Maintainer reviews + merges
                ↓
        Same static build pipeline as Phase 1

Still no API. Still no hosted services. GitHub is the auth + moderation layer.

Phase 3 Architecture (Hosted - Deferred)¶

benchbox submit → api.benchbox.dev → object store + metadata DB
                                            ↓
                                    async ingest + validation
                                            ↓
                                    derived read model rebuild
                                            ↓
                                    static explorer update

Only build this if Phase 2 PR volume exceeds maintainer capacity.

Storage Layers (Phase 3 Only)¶

Layer	Purpose	Properties
Object store	Immutable raw bundle + companions	Content-addressable, versioned, durable
Metadata store	Submission, run, visibility, trust, cohort metadata	Queryable, transactional, auditable
Derived public store	Static projections for public reads	Rebuildable, cacheable, CDN-friendly

Result Identity¶

Result identity is phase-dependent:

Phase	Identity Scheme
Phase 1	`{benchmark}-{platform}-sf{scale}-{date}` - human-readable, derived from bundle metadata
Phase 2	Same, plus contributor attribution from PR author
Phase 3	Adds `bundle_hash` (content identity), `submission_id` (API identity), `result_id` (public identity)

Trust, Visibility, and Ranking¶

Trust complexity scales with phases:

Phase	Trust Model
Phase 1	All results are maintainer-curated. A simple “Maintainer Run” label is fine, but no richer trust model is needed.
Phase 2	Two labels: maintainer (generated by BenchBox CI) and community (submitted via PR). Both public.
Phase 3	Full trust tiers: private, unlisted, public-self-reported, public-curated, public-verified

Comparison and ranking should be cohort-aware. Public pages must avoid mixing incompatible runs across materially different contexts:

benchmark family and version
scale factor
execution mode
phase set
query subset vs full benchmark
tuning mode
hardware or platform family where relevant

If a cohort is too heterogeneous for a clean ranking, the explorer should fall back to filters and pairwise comparison rather than pretend the leaderboard is authoritative.

No branded “leaderboard” page. Phase 1 browse views may sort results within a validated cohort (e.g., by total duration or geometric mean), but this is presented as a sorted table, not a ranked competition. Cross-context ranking claims are never shown.

Impact on Existing Publishing TODO¶

productize-result-publishing-and-artifact-sharing remains the BenchBox-local artifact publication track. It owns benchbox publish workflows for copying canonical result bundles to local/cloud storage. It does NOT own the results platform or explorer.

TODO Cluster and Priority¶

Planning and Infrastructure¶

TODO	Phase	Priority	Status
`define-results-platform-product-and-launch-strategy`	Planning	High	Done
`resolve-results-explorer-brand-ownership`	Planning	High	Done
`build-results-explorer-subsite-on-benchbox-dev`	Phase 1	High	Done
`implement-results-compare-view`	Phase 1	High	Done
`define-hosted-results-contract-and-governance-model`	Phase 2-3 prep	Medium	Done
`design-results-ingest-storage-and-derived-read-model`	Phase 3	Medium	Not started
`integrate-benchbox-cli-submit-and-service-auth`	Phase 2-3	Medium	Not started
`operate-results-platform-security-observability-and-abuse-controls`	Phase 3	Low	Not started

Explorer Fidelity (added 2026-04-14)¶

These TODOs address the gaps identified in the post-launch audit. They are sequenced: pipeline extension first (all others depend on the richer manifest), then UI features that consume the new fields.

TODO	Phase	Priority	Rationale
`explorer-extend-manifest-and-pipeline`	Phase 1.5	High	Unblocks all other fidelity work; adds geomean_ms, execution_mode, tuning_mode, tuning_hash, platform_version, test_type, validation_status, cost_usd to manifest + DuckDB
`explorer-align-duration-metric`	Phase 1.5	High	Fixes the CLI vs explorer metric discrepancy; geomean becomes the primary comparison axis
`explorer-add-tuning-config-visibility`	Phase 1.5	Medium-High	Surfaces tuning config in detail pages and enables cross-tuning comparison
`explorer-add-methodology-disclosure`	Phase 1.5	Medium-High	Adds “how this was measured” panel; required before differentiation claims are credible
`explorer-add-comparison-charts`	Phase 1.5	Medium	Normalized speedup + diverging bar charts; closes the most visible CLI parity gap
`explorer-comparability-warnings`	Phase 2	Medium	Warn/block when comparing results that differ on execution mode or tuning; depends on pipeline extension

Primary Surface Revision (added 2026-04-15)¶

These TODOs implement the ClickBench-style matrix leaderboard pivot documented in Primary Surface Revision. They are sequenced: artifact first, then matrix UI + ranking, then trust/URLs, then home-page meta-leaderboard and a11y.

TODO	Phase	Priority	Rationale
`explorer-emit-benchmark-summary-artifact`	Phase 1.6	High	New derived artifact keyed on (benchmark, scale, phase) - foundation for the matrix leaderboard
`explorer-matrix-leaderboard-view`	Phase 1.6	High	Replaces BenchmarkIndex sorted list with ClickBench-style Platform × Query matrix; log-ratio coloring; row-checkbox → Compare
`explorer-align-ranking-metric-with-tpc-standards`	Phase 1.6	High	Per-family registry so TPC-H/TPC-DS rank by power_score; geomean fallback elsewhere
`explorer-surface-trust-tiers-as-badges`	Phase 1.6	Medium-High	Trust tiers as visible badges (not a default hide filter); BenchBox’s differentiator made explicit
`explorer-compare-url-short-ids`	Phase 1.6	Medium-High	Short hash IDs avoid URL-length limits; backward-compatible
`explorer-cross-benchmark-meta-leaderboard`	Phase 1.6	Medium	Home page hero: cross-benchmark rank aggregation - leans into multi-benchmark differentiator
`explorer-heatmap-accessibility-and-tests`	Phase 1.6	Medium	aria-labels, keyboard nav, reduced-color mode, axe-core in CI

DuckDB-Only Browser Metric Contract (2026-04-18)¶

Added as the W1 research gate for TODO explorer-canonical-browser-duckdb-read-model. This section is the schema decision record, direct ingest contract, and cutover plan that must be locked before any code changes in W2-W7.

Definition of Terms¶

Canonical Python reference computation - the pipeline’s own Python implementation inside benchbox/core/explorer_pipeline/ plus read-only imports from benchbox/core/results/. TypeScript code, pre-calculated CLI JSON artifacts, and bridge artifacts are never a reference source.
Final-value parity - the rendered user-visible value produced by the DuckDB-backed read path equals the canonical Python reference computation for the same inputs, within the per-metric tolerance declared in _project/planning/visible_metrics.yaml. Parity is NOT defined against whatever the CLI, a pre-calculated JSON artifact, or the current TS reduction libraries happen to emit today.
Source-fidelity parity - raw fields copied into DuckDB equal the corresponding field in the committed JSON bundle verbatim.
Bridge artifact - a metric-bearing JSON file emitted by the pipeline during the W2-W4 transition window so pages not yet migrated to DuckDB can still render. Bridge artifacts are scaffolding only, regenerated from the same pipeline pass that writes DuckDB, never hand-patched, and always subordinate to DuckDB when the two disagree.

Call-Site Inventory¶

Every user-visible metric read, current data source, and DuckDB target:

Surface	Current source	Metrics consumed	Target DuckDB table/view
`Home.tsx`	`manifest.json` via `getManifest()`	total_results, platform list, benchmark list, power_score, geomean_ms, run_date (Recent Results)	`results`
`Home.tsx`	`meta_leaderboard.json` via `getMetaLeaderboard()`	rank, metric_value, speedup_vs_best, avg_rank, n_cohorts	`meta_leaderboard` + `cohort_metadata`
`PlatformIndex.tsx`	`manifest.json` via `getManifest()`	power_score, geomean_ms, run_date, trust_label, tuning_mode	`results` (filtered by platform_id)
`BenchmarkIndex.tsx`	`manifest.json` via `getManifest()`	scale_factor list, phase list, display_geomean_ms (list view)	`results` (filtered by benchmark)
`BenchmarkIndex.tsx`	`benchmarks/*.json.gz` via `getBenchmarkSummary()`	timings (display_ms per query/platform), power_score, display_geomean_ms, is_ranking_eligible, percentile_stats, compliance_class	`benchmark_matrix_cells` + `benchmark_rankings`
`Compare.tsx`	`short_ids.json` via `resolveShortId()` / `toShortIds()`	short_id → result_id mapping	`short_ids`
`Compare.tsx`	`compare/{hash16}.json` via `getComparisonArtifact()`	display_geomean_ms, power_score, speedup_vs_slowest_per_row, fastest_ms, display_ms per query	DuckDB query over `query_display_timings` + `results`
`Compare.tsx`	`details/{id}.json` via `getDetail()` (fallback)	same as above	DuckDB query (fallback path deleted once W4 lands)
`ResultDetail.tsx`	`details/{id}.json` via `getDetail()`	power_score, geomean_ms, display_geomean_ms, total_duration_s, display_timings (display_ms + sample_count), queries (raw timings), environment	`result_detail_metrics` VIEW + `query_display_timings` + `query_executions`
`Query.tsx`	`results_schema.json` (fetch)	schema column names + types (column picker)	DuckDB introspection (`duckdb_columns`/`DESCRIBE`) or `schema_metadata` table
`Query.tsx`	`bench.results` in DuckDB	all `results` columns for workbench queries	`results` (already DuckDB - keep as-is)

Bootstrap JSON surviving after W7 (non-metric only):

If manifest.json survives, its top-level keys must be a strict subset of {routes, navigation, build_meta, schema_version}. Any metric-looking key (benchmark_summary, result_count, power_score, geomean_ms, etc.) fails G-8. The preferred outcome is to delete manifest.json entirely in the final W4 slice and derive all navigation from DuckDB.

results_schema.json is deleted in W3 (replaced by DuckDB introspection).

Ten Canonical DuckDB Tables and Views¶

All ten surfaces must exist with stable schemas before W3 (frontend migration) begins. The frozen DDL is in docs/development/browser-duckdb-schema.sql.

#	Name	Kind	Serves
1	`results`	Base table	Home, PlatformIndex, BenchmarkIndex list, Query workbench
2	`query_display_timings`	Base table	ResultDetail, BenchmarkIndex matrix, Compare
3	`query_executions`	Base table	ResultDetail individual samples
4	`result_detail_metrics`	G-11 view	ResultDetail (projects results + environment tables)
5	`benchmark_matrix_cells`	Base table	BenchmarkIndex matrix heatmap
6	`benchmark_rankings`	Base table	BenchmarkIndex matrix/ranks views
7	`platform_index_rows`	G-11 view	PlatformIndex (projects `results`)
8	`cohort_metadata`	Base table	Home meta-leaderboard (cohort + per-platform ranks)
9	`meta_leaderboard`	Base table	Home meta-leaderboard (cross-cohort platform summary)
10	`short_ids`	Base table	Compare URL resolution

Supporting base tables (not canonical surfaces, required by views):

result_environment - OS, arch, CPU, memory, Python per result
result_phase_durations - phase timing breakdown per result

G-11 compliance: Views project only bare column references, bare aliased columns, CAST, and COALESCE(col, literal). No arithmetic, aggregation, CASE, window functions, or non-whitelisted function calls in view projections.

Metric Parity Registry¶

The complete metric registry is at _project/planning/visible_metrics.yaml. It classifies every user-visible metric as raw_copy (source-fidelity contract) or derived (final-value parity contract), with canonical_ref and tolerance per metric.

Derived metrics with existing Python reference computation:

Metric	Python reference
`display_ms` per query	`benchbox.core.explorer_pipeline.transformer._query_display_ms`
`display_geomean_ms`	`benchbox.core.explorer_pipeline.transformer._display_geomean_ms`
`geomean_ms`	`benchbox.core.explorer_pipeline.transformer._geomean_ms`
`compliance_class`	`benchbox.core.explorer_pipeline.transformer._compliance_class`
`is_ranking_eligible`	`benchbox.core.explorer_pipeline.models.is_ranking_eligible`
`platform_id`	`benchbox.core.explorer_pipeline.models._platform_id`
`tuning_hash`	`benchbox.core.explorer_pipeline.transformer._tuning_hash`
`short_id`	`benchbox.core.explorer_pipeline.pipeline._build_short_ids`
`rank` (cohort)	`benchbox.core.explorer_pipeline.pipeline._build_benchmark_summaries`
`rank`, `metric_value`, `speedup_vs_best`, `avg_rank`, `n_cohorts` (meta)	`benchbox.core.explorer_pipeline.pipeline._build_meta_leaderboard`
`percentile_p50/p90/p95/p99`	`benchbox.core.explorer_pipeline.transformer._platform_percentile_stats`

Derived metrics currently TS-only - must be ported to Python in W2:

TS source	Function	Target in W2
`chartMath.ts: computeRankTable`	Per-query rank across platforms in a cohort	Port into `benchbox/core/explorer_pipeline/` and materialize into `benchmark_matrix_cells` or a companion rank table
`chartMath.ts: perQuerySpeedup`	`slowest_ms / fastest_ms` spread	Not persisted; computed as SQL window function `MAX/MIN` in the Compare DuckDB query - no Python porting needed
`chartMath.ts: vsSlowestRatio`	`slowest_ms / this_ms` per result per query	Already pre-computed in `transformer.build_comparison_artifact()` for the comparison artifact path; for the DuckDB path, computed as a SQL window function in the Compare query
`chartMath.ts: colorForCell`, `lightnessForCell`	Heatmap color from timing ratio	Presentation-only (CSS hue/lightness from `display_ms`). No metric value. Stays TS-side.
`ranking.ts: primaryMetricFor`	Benchmark → primary metric enum	Already in `models.get_ranking_config()`. Used in Compare.tsx as a fallback when loading DetailResult files. In the DuckDB path, `benchmark_rankings.primary_metric` column carries this. No new Python needed.
`compliance.ts: complianceLabel`	Compliance class → display string	Presentation-only formatting. Stays TS-side.

Direct Ingest Contract¶

committed JSON bundles (benchmarks/*.json)
  → pipeline validate (SchemaV2Validator)
  → pipeline transform (BundleTransformer)
  → compute derived values in Python (display_ms, display_geomean_ms, short_id,
     is_ranking_eligible, compliance_class, matrix cells, rankings, meta-leaderboard)
  → bulk-insert all ten canonical tables in one DuckDB transaction
  → copy raw bundles to public/data/bundles/ (download-only affordance)
  → [W2-W4 bridge only] regenerate metric-bearing JSON from the same pipeline pass

Rule: JSON is an immutable source input and a download-only output. It is never a steady-state metric read source. If a bridge artifact and DuckDB disagree, DuckDB is canonical and the bridge artifact is regenerated from the same pipeline pass, never hand-patched.

Cutover and Deletion Order¶

Phase	Pipeline side	Frontend side	Artifacts deleted
W2	Add ten canonical tables; all existing JSON emitters continue as bridge artifacts	No frontend changes	None
W3	No pipeline changes	Add `duckdbQueries.ts` + `duckdbSchema.ts`; cut `Query.tsx` from `results_schema.json`	`results_schema.json` (emitter + file)
W4 Home slice	Delete `meta_leaderboard.json` emitter	Home migrates to DuckDB	`meta_leaderboard.json`
W4 PlatformIndex slice	-	PlatformIndex migrates to DuckDB	(reads `results` table; no unique artifact)
W4 BenchmarkIndex slice	Delete `benchmarks/*.json.gz` emitter	BenchmarkIndex migrates to DuckDB	`benchmarks/*.json.gz` files
W4 ResultDetail slice	Delete `details/*.json` emitter	ResultDetail migrates to DuckDB	`details/*.json` files
W4 Compare + final slice	Delete `compare/*.json` + `short_ids.json` emitters	Compare migrates to DuckDB; `manifest.ts` deleted; `db.ts` hardening lands	`compare/*.json`, `short_ids.json`, `manifest.ts`
W7	Final sweep of any remaining legacy emitters	-	`manifest.json` (or demote to routes-only)

db.ts hardening (deferred from W3 to W4 final slice):

Reject getDb() on attach failure instead of console.warn + silent continue
Delete the soft JSON fallback at db.ts:86-89
Delete the stale db.ts:1-17 “Phase 1/2 scaffold” header comment
Wire per-page ErrorMessage UI (G-7 requirement)

Research Gate Findings (RG-1 through RG-9)¶

Gate	Status	Finding
RG-1 Cold-start init budget (p50 <1500ms, p95 <3500ms)	TBD	Requires browser measurement against deployed corpus. Measure after W2 builds the full ten-table DB. Accepted risk: proceed to W2; gate must be green before W3 ships to users.
RG-2 HTTP range-reads (≤10% total file for single-row lookup)	HIGH CONFIDENCE PASS	`DuckDBDataProtocol.HTTP` is designed for HTTP range reads; DuckDB-WASM issues `Range: bytes=` requests for column chunks. GitHub Pages returns `Accept-Ranges: bytes` with 206 responses. No implementation change needed. Measure cold single-row lookup after W2 to confirm the ≤10% bound.
RG-3 DB size budget (sub-linear growth, ≤20 MB at 100x)	TBD	Requires synthetic corpus build at 1x, 5x, 20x, 100x. The full ten-table schema (especially `query_executions` with ~3 rows per query per result) will be larger than the current 19-column stub. Measure after W2. If 100x projection exceeds 20 MB, escalate to RG-5.
RG-4 COEP/COOP hosting	HIGH CONFIDENCE PASS	`duckdb.selectBundle()` automatically selects the `mvp` single-threaded bundle when COEP/COOP headers are absent (e.g. GitHub Pages without custom headers). No cross-origin isolation required for the `mvp` bundle. Confirm with bundle selection log after W2.
RG-5 Single DB vs partitioned	N/A unless RG-3 fails	Default: single `results.duckdb`. Activate only if RG-3 measures > revised ceiling.
RG-6 View materialization budget (<250ms p95 cold at 100x)	TBD	Benchmark canonical surface queries against 100x synthetic corpus after W2. Views that fail promote to materialized `CREATE TABLE AS` in W2 before W3 frontend migration begins.
RG-7 Attach-failure UX (per-page ErrorMessage, no silent fallback)	PLANNED FOR W4	The `db.ts:86-89` soft fallback stays in W2-W3 while manifest-backed pages still exist. The final W4 slice (Compare + manifest.ts deletion) lands the hardening. Per-page ErrorMessage components must exist on all six pages by W4 exit.
RG-8 toJSON fidelity (no precision loss on DOUBLE/BIGINT/DECIMAL)	RISK: MEDIUM	DuckDB-WASM’s Arrow→JSON path has known BIGINT→BigInt issues (JSON.stringify throws on BigInt). `queryRows` uses `.toJSON()` which returns plain objects; BIGINT columns must use explicit `CAST(col AS VARCHAR)` or `CAST(col AS DOUBLE)` at the query layer. Plan: audit every `BIGINT` column in the schema and add explicit CASTs in `duckdbQueries.ts`. Cover in G-8 Vitest matrix.
RG-9 Read-only fuzz (adversarial SQL harmlessly fails)	HIGH CONFIDENCE PASS	DuckDB’s `ATTACH ... (READ_ONLY)` rejects all DDL and DML at the engine level. HTTPFS, INSTALL, LOAD may be compiled out of the `mvp` bundle or rejected at runtime. Cover in G-9 Vitest suite (W6).

Research gate exit criterion met when: RG-1 and RG-3 are measured (after W2 DB build), all others are GREEN or have an accepted-risk note above, and the frozen schema in docs/development/browser-duckdb-schema.sql is committed alongside this decision record.

External Sources¶

Geekbench editions: https://www.geekbench.com/editions/
Geekbench Browser: https://browser.geekbench.com/
CloudSpecs site: https://cloudspecs.fyi/
CloudSpecs repository: https://github.com/TUM-DIS/cloudspecs
OpenBenchmarking: https://openbenchmarking.org/
airspeed velocity: https://asv.readthedocs.io/en/latest/using.html
js-framework-benchmark (PR-based contribution model): https://github.com/nicholasgasior/js-framework-benchmark
ClickBench contribution model: https://github.com/ClickHouse/ClickBench