BenchBox Results Platform Product + Architecture Strategy¶
Created: 2026-03-29
Revised: 2026-04-15 (see Primary Surface Revision)
Originating TODO: productize-result-publishing-and-artifact-sharing
Executive Summary¶
BenchBox should not treat local artifact publication, hosted result submission, and public result analysis as one feature. They are adjacent, but they have different product contracts, trust models, and operational requirements.
Recommended split:
benchbox publishLocal/cloud artifact publication. Publishes canonical result bundles to local or cloud storage (S3, R2, etc.). Does not interact with the hosted results platform or public corpus.benchbox submitPublic corpus contribution. Packages a canonical bundle with a submission manifest and either prepares a PR againstresults-data/(Phase 2) or uploads directly to the hosted API (Phase 3).benchbox.dev/results/A static-first public explorer for browsing, comparing, and analyzing curated public results.
Phase 1 Status: Launched (2026-04-04)¶
Phase 1 launched on 2026-04-04. The launch corpus had 6 maintainer-run bundles
across 2 cohorts (TPC-H SF 0.01 and SSB/star_schema SF 0.01), each with ≥3
platforms. As of 2026-04-12, the repository corpus has expanded to 12 bundles
across 4 cohorts by adding SF 0.1 for both benchmark families. All launch
criteria from the checklist below were met.
Revised Launch Phases¶
Phase |
Goal |
Write Path |
Infrastructure |
Priority |
|---|---|---|---|---|
1: Static Explorer MVP |
Curated seed corpus + read-only explorer at |
Maintainer-only: CI-generated results committed under |
Static only: GitHub Pages, no API, no auth, no hosted services |
Ship first |
2: Community Contributions |
Community-submitted results via PR-based workflow |
PRs against |
Still static: GitHub Actions validates + merges + rebuilds; extract to a dedicated data repo only if churn justifies it |
Ship when Phase 1 UX is proven |
3: Hosted Platform |
Self-service submission API, org/team spaces, richer features |
Hosted API + object storage + async ingest |
API server, metadata DB, auth, rate limiting, moderation |
Only if demand warrants the operational burden |
Key Architecture Decision¶
Phase 1 and Phase 2 require zero backend services. The entire read path is
static (derived JSON manifests + DuckDB/Parquet snapshots served via GitHub
Pages). The write path is git + CI/CD inside this repository
(results-data/ → transform → build → deploy).
A hosted API (Phase 3) is explicitly deferred. The “submit via PR” model used by many successful open-source benchmark databases (e.g., js-framework-benchmark, ClickBench contribution model) proves that community contributions scale well without a custom API until volume demands one.
Product Intent and Positioning¶
Core Hypothesis¶
“People want to browse and compare public benchmark results across platforms.” This is the primary value proposition that Phase 1 must validate.
Target Audience¶
The broader data/analytics community - not just existing BenchBox users. The explorer is a credibility and marketing play: transparent, reproducible, multi-benchmark results that visitors can explore and compare themselves.
Differentiator vs ClickBench¶
ClickBench covers a single workload in a single format. BenchBox’s explorer differentiates on three axes:
Multi-benchmark coverage - TPC-H, TPC-DS, SSB, and future benchmarks in one place, not siloed sites
Rich per-query detail - execution plans, tuning configurations, validation status, companion files
Reproducibility - any published result can be re-run with
benchbox runusing the same parameters
ClickBench was considered and rejected because its format cannot capture what BenchBox measures (items 2 and 3 above).
Note (2026-04-14): Axes 2 and 3 are currently aspirational. The Phase 1 explorer exposes a narrow read model (11 fields; see Fidelity Gaps section below) that does not yet surface tuning config, execution plans, validation status, execution mode, or cost data. These must be realized before the differentiation claim is credible to a visitor who clicks through to a result detail page. The fidelity gap TODOs (
explorer-extend-manifest-and-pipeline,explorer-add-tuning-config-visibility,explorer-add-methodology-disclosure) address this.
Explorer as Dynamic Tool¶
Superseded 2026-04-15. The “dynamic comparison tool, not a curated leaderboard” framing was correct as a rejection of vanity-ranking pages but wrong as a rejection of the matrix leaderboard pattern. See Primary Surface Revision. The revised position: the per-benchmark primary surface is a ClickBench-style Platform × Query matrix leaderboard, with the dynamic Compare page as a secondary deeper-analysis surface. Both ship; the matrix is the landing view.
The explorer is a dynamic comparison tool, not a static shootout page. Visitors pick benchmark, scale factor, and platforms to build their own comparisons. This is the core UX - not a curated leaderboard.
Corpus Size and DuckDB-WASM Justification¶
The benchmark × platform × scale matrix is large even at launch. Phase 1 should cover most supported benchmarks and platforms at limited scale factors. DuckDB-WASM is justified because:
The corpus needs to be large enough for a dynamic tool to be useful
It will grow quickly as new platforms and benchmarks are added
It demonstrates BenchBox’s DuckDB expertise (audience alignment)
The primary corpus criterion is depth per benchmark: each included benchmark must have ≥3 comparable platforms at the same scale factor. The ≥30 total bundle count is a secondary estimate that falls out of the coverage matrix, not an independent target. Depth is the right primary target because the core deliverable is a comparison tool - a cohort needs multiple platforms at the same scale to be meaningful, regardless of total bundle count.
Visual Identity and Build Pipeline¶
The explorer is a standalone Vite app with its own build pipeline, but shares the benchbox.dev visual identity (header, nav, styling). It is not embedded in the docs or blog - it is a distinct feature of the site.
SEO¶
Search engine discoverability is nice-to-have but not a launch blocker. Users will primarily arrive through benchbox.dev directly. Pre-rendering can be added post-launch if organic traffic becomes a goal.
Timeline¶
No hard deadline or external event. Quality over speed.
Brand Decision¶
Resolved 2026-04-03: The results explorer lives at benchbox.dev/results/ under
the BenchBox brand. See
results-explorer-brand-ownership.md for
the full decision rationale.
Summary: Oxbow Research has no independent web presence, domain, or active product
identity - it was an earlier brand concept that was deliberately separated from
BenchBox in git history. The explorer’s core value proposition (reproducible results,
benchbox run re-execution) is inseparable from BenchBox, making BenchBox placement
the coherent choice. The working default is confirmed; no changes to existing
scaffolding or CI/CD are needed.
Primary Surface Revision (2026-04-15)¶
Added following a strategic review of the explorer vs ClickBench paradigm and an adversarial review of the initial proposal.
What changed and why¶
The 2026-04-14 fidelity work identified that the explorer dropped too many bundle fields in its derived read model. The 2026-04-15 review went one level up: the UX paradigm itself is wrong. The current explorer is browse-first (land on home → navigate to benchmark → get a sorted list → maybe click Compare). ClickBench is compare-first: the landing view for a benchmark is already the cross-platform answer.
That is not a cosmetic difference. It is the product definition. The revised product position:
The per-benchmark landing view is a Platform × Query matrix leaderboard (ClickBench-style). The dynamic Compare page remains as a secondary surface for deeper per-query analysis. The detail page remains as the reproducibility-story surface. All three ship.
Why the original “matrix only, dynamic only” framings were both wrong¶
The original strategy doc rejected a “curated leaderboard” page. That rejection still stands - there is no vanity-ranking page in the product. But a matrix leaderboard is not a vanity ranking: it is a cohort-aware, per-query, per-column-normalized view where ranking emerges from the data rather than from editorial weighting. It is the comparison tool, rendered densely, rather than a competitor to it.
The initial redesign proposal (see the companion review) was correct in its main recommendation but wrong in several specifics. The final adopted shape incorporates the following corrections:
Area |
Initial proposal |
Final adopted position |
|---|---|---|
Artifact key |
|
|
Color scale |
linear min-max |
log10 ratio-to-fastest, clamped at 10× |
Primary ranking metric |
geomean_ms default |
per-family registry: |
Trust default |
filter to |
show all tiers with visible badges; trust is an optional filter |
Compare URL |
full result_ids in query string |
short hash IDs with full-form fallback |
Home page |
recent results list |
cross-benchmark meta-leaderboard (avg-rank aggregation) + recent results secondary |
Accessibility |
unspecified |
aria-labels on every data cell, keyboard nav, reduced-color mode, axe-core in CI |
Revised derived read model¶
In addition to the artifacts already specified in Derived Read Model Schema, the pipeline emits:
benchmarks/{benchmark}_sf{sf}_{phase}.json.gz- one file per unique(benchmark, scale_factor, phase)tuple, containing the full cross-platform query matrix plus per-platformpower_score,geomean_ms,cost_usd,trust_label,tuning_mode, andshort_id. Gzip compressed on write.short_ids.json- lookup table{short_id: full_result_id}to allow compact Compare URLs without breaking existing long-form URLs.meta_leaderboard.json- cross-benchmark rank aggregation used by the home page. Platforms ranked within each cohort (cohorts with ≥2 platforms only), then aggregated as simple mean of ranks across appearances. No weighted composite score.
Revised UX surfaces¶
Surface |
Revised role |
|---|---|
Home ( |
Cross-benchmark meta-leaderboard is the hero panel. Recent results list drops to secondary position. |
Benchmark landing ( |
Matrix leaderboard as the default view. List view available via toggle for mobile. |
Compare ( |
Unchanged in role; accepts short IDs. Reached from matrix-row checkboxes. |
Detail ( |
Unchanged. Surface for methodology disclosure and reproducibility. |
What explicitly does not change¶
No branded leaderboard page remains the rule. The matrix leaderboard is cohort-aware (single scale, single phase, same benchmark); it is not a cross-context vanity ranking.
All three differentiators remain (multi-benchmark coverage, rich per-query detail, reproducibility). The matrix leaderboard strengthens differentiator #1 by making the multi-benchmark story visible on the home page.
Cohort-breaking guardrails (benchmark, scale, tuning mode, execution mode, phase) remain enforced. The matrix view exposes all five as explicit axes in the filter bar instead of hiding them.
TODO cluster¶
Addressed by the explorer- TODO cluster added 2026-04-15. See TODO
Cluster and Priority for the full table.
Explorer Fidelity: Known Gaps and Requirements¶
Added 2026-04-14 following a post-launch audit of explorer vs CLI divergence.
Canonical Duration Metric¶
The explorer currently uses total_duration_s (wall-clock sum from the bundle’s
run.total_duration_ms). The CLI uses geometric mean of per-query times as
the primary comparison metric for OLAP workloads. These produce different numbers
for the same result, making cross-referencing between CLI output and the explorer
confusing and misleading for results with different query counts.
Decision: The explorer’s primary comparison metric must be geometric mean
of per-query execution times (geomean_ms), computed at pipeline build time
and stored in both the manifest and DuckDB schema. Wall-clock total duration may
be surfaced as a secondary metric with a clear label, but it must not be the
default sort/compare axis. This aligns with CLI behavior and with standard OLAP
benchmarking practice.
Refinement (2026-04-15): The geomean-is-canonical decision applies where
the benchmark family has no published aggregate metric (ClickBench, SSB,
custom benchmarks). For TPC-H and TPC-DS, the canonical metric is
power_score as defined by the TPC specification - users arriving from
published TPC results expect to see that number. The explorer pipeline
exposes a per-family RANKING_METRIC_BY_FAMILY registry
(benchbox/core/explorer_pipeline/models.py) that selects the primary
metric per benchmark family and serializes the choice into each
BenchmarkSummary artifact’s ranking field - no ranking logic lives in
TypeScript. geomean_ms / display_geomean_ms remains as the universal
secondary metric shown alongside. Implemented in explorer-align-ranking-metric-with-tpc-standards.
Comparability Model¶
Two results are comparability-breaking on the following dimensions - the compare view must warn or hard-block when they differ:
Dimension |
Breaking? |
Action |
|---|---|---|
Benchmark family |
Yes (hard-block) |
Already enforced |
Scale factor |
Yes (hard-block) |
Already enforced |
Execution mode (SQL vs DataFrame) |
Yes (warn) |
Not yet surfaced |
Tuning mode (tuned vs notuning) |
Yes (warn) |
Not yet surfaced |
Query subset vs full benchmark |
Yes (warn) |
Not yet surfaced |
Platform version |
No (label) |
Partially surfaced |
The explorer cannot enforce the tuning and execution-mode warnings until those
fields reach the manifest. See explorer-extend-manifest-and-pipeline.
Pipeline Fidelity Gap¶
The schema-v2 bundle contains 13 blocks and 50+ fields. The Phase 1 pipeline transformer emits 11 manifest fields, silently dropping:
Category |
Dropped Fields |
Impact |
|---|---|---|
Execution metadata |
|
Can’t distinguish modes |
Tuning config |
|
No tuning UI, can’t cross-compare |
Platform identity |
|
Only driver version surfaced |
Run config |
|
No config disclosure |
Phase timing |
All |
Can’t see load vs query time |
Cost |
|
No cost analysis |
Validation |
|
No validation status visible |
Test type |
Power vs throughput distinction |
Can’t label test type |
Query granularity |
|
Can’t identify warmup vs measurement |
The DuckDB results table currently has 11 columns. It should grow to ~20 to
enable meaningful client-side filtering. See explorer-extend-manifest-and-pipeline.
Methodology Transparency¶
A visitor to a result detail page currently cannot determine:
Whether
total_duration_sis a sum, mean, or medianWhether the run used tuning (a configuration that can significantly affect results)
Whether warmup queries are included in the timing
What hardware the result was produced on beyond OS/arch/CPU count
Every result detail page must include a methodology disclosure panel that
states the aggregation method, tuning state, execution mode, and key environment
parameters. See explorer-add-methodology-disclosure.
Chart Parity Targets¶
The CLI ships 15 chart types; the explorer ships 2 inline SVG components. Full parity is not required, but the following gap is strategically significant for the “dynamic comparison tool” claim:
Missing chart |
Value |
Priority |
|---|---|---|
Normalized speedup (log scale, baseline-relative) |
Primary comparison metric for large perf deltas |
High |
Diverging bar (per-query regression/improvement) |
Makes “which queries got slower?” obvious |
High |
Phase breakdown (stacked load vs query) |
Distinguishes load bottleneck from query bottleneck |
Medium |
Percentile ladder (P50/P90/P95/P99) |
Identifies tail-latency outliers |
Medium |
Cost scatter |
“Cost vs performance” is a primary analyst question |
Low (Phase 2+) |
See explorer-add-comparison-charts.
Phase 1 MVP Definition¶
Primary Deliverable: Dynamic Comparison Tool¶
The core UX of Phase 1 is a dynamic comparison tool where visitors pick a benchmark, scale factor, and platforms, then see a side-by-side query-level timing breakdown. This is the feature that differentiates the explorer from a file listing, makes users come back and share links, and validates the core hypothesis (“people want to browse and compare public benchmark results”).
All other Phase 1 deliverables (home page, browse pages, detail pages) exist to support navigation into and out of comparisons. They are necessary but secondary.
What Ships¶
Component |
Role |
Description |
Done When |
|---|---|---|---|
Compare view |
Primary |
Dynamic comparison tool: pick benchmark/scale/platforms, see side-by-side query-level timing breakdown with cohort validation |
Compare works for any compatible results in the corpus; shareable URLs |
Seed corpus |
Data |
Result bundles with depth per benchmark (many platforms at key scale factors) rather than just breadth |
Bundles exported, validated, committed under |
Static build pipeline |
Data |
Transforms canonical schema-v2 bundles into navigation manifest JSON, per-result detail JSON, and DuckDB database snapshot |
Pipeline runs in CI, output deployed to GitHub Pages |
Explorer home + browse |
Navigation |
Landing page with summary cards; benchmark and platform index pages with filterable result lists and “compare” checkboxes |
Users can navigate to any result and select results for comparison |
Result detail page |
Supporting |
Stable URL per result showing metadata, query timings, validation status, and raw bundle download |
Detail page works for all seed corpus results |
DuckDB-WASM filtering |
Interaction |
Client-side filtering by benchmark, platform, scale factor, date range |
Filters work over the full corpus |
GitHub Pages integration |
Deployment |
Explorer builds and deploys alongside existing landing + docs + blog |
Single |
What Does NOT Ship in Phase 1¶
No user accounts, authentication, or authorization
No hosted submission API or
benchbox submitcommandNo anonymous or community uploads
No branded “leaderboard” page, but cohort views may be sorted by total duration or geometric mean - this is a sorted table within a validated cohort, not a cross-context ranking claim
No organization accounts or private workspaces
No moderation, trust labels, or abuse controls (not needed - corpus is maintainer-curated)
Launch Criteria¶
Seed corpus has sufficient depth for the dynamic comparison tool to show meaningful cross-platform comparisons: each included benchmark has ≥3 comparable platforms at the same scale factor
Compare view is the primary entry point and works end-to-end: select benchmark/scale/platforms → see query-level timing breakdown → share URL
All explorer pages render correctly with real data
DuckDB-WASM filtering works in Chrome, Firefox, Safari
GitHub Pages deployment succeeds end-to-end from CI
Explorer is navigable from the existing benchbox.dev site header/nav
Brand ownership decision is resolved and reflected in domain/hosting/identity
Phase 2: Community Contributions (Deferred)¶
Model: Submit via Pull Request¶
Instead of building a hosted API, Phase 2 uses a PR-based contribution model:
Contributor runs
benchbox submit --output ./submission/which packages the canonical schema-v2 bundle with a submission manifest (contributor metadata, benchmark context, optional notes)Contributor opens a PR against this repository touching
results-data/GitHub Actions CI validates: schema conformance, bundle integrity (hash check), cohort compatibility, and basic sanity checks (no absurd timings, valid platform)
Maintainers review and merge
Merge triggers rebuild of derived read models + redeploy of explorer
A dedicated benchbox-results repository remains an optional future extraction
if corpus size or contribution volume starts to overwhelm the main repo. It is
not a Phase 1 or early Phase 2 requirement.
Why PR-Based Before API-Based¶
Concern |
PR model |
API model |
|---|---|---|
Auth |
GitHub identity (free) |
Custom auth system (build + operate) |
Moderation |
PR review (familiar, auditable) |
Custom moderation UI (build + operate) |
Abuse prevention |
PR rate = human rate |
Rate limiting, quotas, captchas |
Trust labels |
Commit author = attribution |
Custom trust promotion workflow |
Operational cost |
Zero (GitHub Actions) |
API server + DB + storage + monitoring |
Scalability ceiling |
~100s of submissions/month |
Thousands+/month |
The PR model is sufficient until submission volume exceeds what maintainer review can handle. That threshold is unlikely in the first year of a results platform.
What Ships in Phase 2¶
benchbox submitcommand to package and submit results to the public corpusSubmission manifest schema (contributor, context, notes)
CI validation workflow for the data repository
Trust labels in explorer: “maintainer” vs “community-submitted”
Contributor guidelines documentation
Phase 3: Hosted Platform (Deferred)¶
Phase 3 is explicitly contingent on Phase 2 reaching a scale where the PR model becomes a bottleneck. Indicators that Phase 3 is needed:
Submission volume exceeds ~50/month sustained
Maintainer review becomes a sustained bottleneck
Users need private/unlisted results (not possible in a public repo)
Organization/team features are requested by paying or strategic users
What Would Ship in Phase 3¶
Hosted submission API at
api.benchbox.devAuthentication (API keys, OAuth)
Private and unlisted visibility states
Automated trust promotion workflow
Rate limiting, quotas, abuse controls
Organization/team spaces
Richer APIs and embedded widgets
The planned Phase 3 operating model is documented in
../operations/results-phase-3-runbook.md.
Cost and Operational Complexity¶
A hosted platform requires:
Component |
Estimated Cost |
Operational Burden |
|---|---|---|
API server (e.g., Fly.io, Railway) |
$20-100/mo |
Deployment, monitoring, on-call |
Metadata database (Postgres) |
$15-50/mo |
Backups, migrations, scaling |
Object storage (S3/R2) |
$5-20/mo |
Lifecycle policies, access control |
Auth provider (Auth0/Clerk) |
$0-25/mo |
Token management, session handling |
Monitoring (Sentry, metrics) |
$0-30/mo |
Alert triage, incident response |
Total: $40-225/month + significant engineering time. This is only justified if the results platform becomes a core product with sustained community usage.
Current BenchBox Constraints¶
Constraint |
Evidence |
Strategy implication |
|---|---|---|
Canonical results already exist as schema-v2 bundles with companion files |
|
All downstream paths ingest the real exported bundle, not a second payload |
Public site is currently static GitHub Pages assembled from landing + docs + blog |
|
The public explorer must be a static subsite - no server dependency |
BenchBox already hints at a hosted service contract |
|
CLI public submission ( |
Existing publishing prototype is process-local |
|
The prototype is not the hosted service architecture; it is only a source of reusable concepts |
Reference Matrix¶
Reference |
Strong pattern |
What BenchBox should copy |
What BenchBox should not copy |
|---|---|---|---|
Geekbench |
Stable public result pages, comparison flows, account-linked online result management, offline vs online distinction |
Stable result detail pages, obvious compare actions, explicit separation between local results and hosted results |
A closed scoring model or consumer-device-centric assumptions |
CloudSpecs |
Static browser app, GitHub Pages hosting, browser-side DuckDB-WASM analysis over a curated dataset |
Phase 1 reference architecture: static-first explorer, DuckDB-WASM for browser-side analytics, downloadable snapshots, reproducible analysis artifacts |
No-write-path assumptions for the whole product |
OpenBenchmarking |
Centralized submission ecosystem, aggregate comparison, rich result metadata, public/private policy |
Cohort-aware comparison, richer metadata, trust labels, aggregate analysis |
Day-1 open public firehose without curation, moderation, or clear verification state |
ASV |
Results stored as files, publish to a static website, precomputed regression views |
Derived read models published as static assets, regression/change-oriented views, offline-friendly read path |
Limiting the product to codebase-over-time regressions only |
Product Boundary¶
BenchBox needs three explicit user contracts, but they do NOT all ship at once.
Contract |
Primary actor |
Phase |
Runtime boundary |
|---|---|---|---|
Publish (local/cloud) |
BenchBox user sharing files or mirroring artifacts |
Independent (existing TODO) |
CLI + local/cloud storage backend |
Explore |
Reader/analyst comparing public results |
Phase 1 |
Static subsite on GitHub Pages |
Submit (PR) |
Community contributor adding results |
Phase 2 |
|
Submit (API) |
Self-service submitter |
Phase 3 |
|
Technology Recommendations for Phase 1¶
Explorer Frontend Stack¶
Choice |
Recommendation |
Rationale |
|---|---|---|
Build tool |
Vite |
Fast, modern, used by CloudSpecs reference. Produces optimized static bundles. |
Framework |
Vanilla TypeScript + Preact (or no framework) |
Minimizes bundle size for a content-heavy site. Preact if component model helps; plain TS if it stays simple. |
Browser analytics |
DuckDB-WASM |
BenchBox already has deep DuckDB expertise. Enables SQL-powered filtering, comparison, and ad-hoc analysis in the browser. Proven by CloudSpecs. |
Data format |
Static JSON manifests + DuckDB database file |
JSON for navigation/SEO/fast page loads. DuckDB |
Routing |
File-based with real paths |
|
Styling |
Tailwind CSS or minimal custom CSS |
Consistent with modern static sites. Light enough for GitHub Pages. |
Derived Read Model Schema¶
The static build pipeline transforms canonical schema-v2 bundles into:
manifest.json- Global navigation index. Target schema (fields marked*are Phase 1 launched; unmarked fields are required additions per the fidelity gap work):{ "results": [ { "id": "tpch-duckdb-sf1-20260315", // * "benchmark": "tpch", // * "platform": "duckdb", // * "scale_factor": 1.0, // * "run_date": "2026-03-15", // * "total_duration_s": 1.234, // * (wall-clock; secondary metric) "geomean_ms": 56.2, // target: canonical comparison metric "query_count": 22, // * "trust_label": "maintainer-run", // * "visibility": "public-curated", // * "driver_version": "1.1.0", // * "platform_version": "1.1.0", // target "execution_mode": "sql", // target: "sql" | "dataframe" "tuning_mode": "tuned", // target: "tuned" | "notuning" | "auto" "tuning_hash": "abc123", // target: stable hash for cross-compare grouping "test_type": "power", // target: "power" | "throughput" "validation_status": "passed", // target: "passed" | "failed" | "skipped" "cost_usd": null, // target: null if unavailable "bundle_path": "bundles/tpch-duckdb-sf1-20260315.json" // * } ], "benchmarks": ["tpch", "tpcds", "ssb"], "platforms": ["duckdb", "datafusion", "clickhouse", "polars-df"], "generated_at": "2026-03-29T00:00:00Z" }
Per-result detail JSON - Full query timings + metadata for result pages. Target query record:
{ "id": "tpch-duckdb-sf1-20260315", "metadata": { "benchmark": "tpch", "platform": "duckdb", "environment": "...", "execution_mode": "sql", "tuning_mode": "tuned", "tuning_summary": "DuckDB default tuning profile", "platform_version": "1.1.0", "engine_version": null }, "queries": [ {"id": "Q1", "ms": 45.2, "rows": 4, "status": "passed", "run_type": "measurement", "iter": 1, "stream": 1} ], "summary": { "total_ms": 1234, "geomean_ms": 56.2, "passed": 22, "failed": 0, "validation_status": "passed" }, "bundle_download": "bundles/tpch-duckdb-sf1-20260315.json" }
results.duckdb- DuckDB database for browser-side analysis:resultstable: one row per result run - target ~20 columns (see manifest target above; all manifest fields should be filterable via DuckDB-WASM)queriestable: one row per query execution(result_id, query_id, ms, rows, status, run_type, iter, stream)Enables:
SELECT * FROM queries WHERE benchmark='tpch' AND platform='duckdb' ORDER BY ms
bundles/- Raw canonical schema-v2 bundles for download
Compare URL Format¶
Compare views use query parameters for flexibility:
/results/compare?ids=tpch-duckdb-sf1-20260315,tpch-datafusion-sf1-20260315Cohort validation happens client-side: same benchmark + same scale factor required
Cohort mismatch is a hard-block, not a warning. The compare view refuses to render incompatible comparisons (different benchmark or different scale factor). Non-blocking differences (query subset, tuning mode) produce warnings but do not prevent rendering.
Site Integration¶
The explorer is built as a standalone Vite app in results-explorer/. The
static read-model pipeline writes build inputs to results-explorer/public/data/;
the Vite build emits static files to results-explorer/dist/; the existing
GitHub Pages workflow then copies that output into site/results/.
Build flow:
results-data/ + static build pipeline → results-explorer/public/data/
results-explorer/dist/ + [landing page] + [sphinx docs] + [blog] → /site/ → GitHub Pages
Navigation integration: add “Results” link to the shared site header/nav.
Architecture by Phase¶
Phase 1 Architecture (Static Only)¶
CI benchmark runs → schema-v2 bundles in results-data/
↓
static build pipeline → results-explorer/public/data/
↓
Vite build (dist/)
↓
GitHub Pages assembly copies dist/ → site/results/
↓
GitHub Pages (benchbox.dev/results/)
↓
Vite app + DuckDB-WASM in browser
No API. No database. No auth. No hosted services.
Phase 2 Architecture (PR-Based Contributions)¶
Contributor: benchbox submit --output ./submission/
↓
PR touching results-data/ in this repo
↓
CI validates (schema, hash, cohort, sanity)
↓
Maintainer reviews + merges
↓
Same static build pipeline as Phase 1
Still no API. Still no hosted services. GitHub is the auth + moderation layer.
Phase 3 Architecture (Hosted - Deferred)¶
benchbox submit → api.benchbox.dev → object store + metadata DB
↓
async ingest + validation
↓
derived read model rebuild
↓
static explorer update
Only build this if Phase 2 PR volume exceeds maintainer capacity.
Storage Layers (Phase 3 Only)¶
Layer |
Purpose |
Properties |
|---|---|---|
Object store |
Immutable raw bundle + companions |
Content-addressable, versioned, durable |
Metadata store |
Submission, run, visibility, trust, cohort metadata |
Queryable, transactional, auditable |
Derived public store |
Static projections for public reads |
Rebuildable, cacheable, CDN-friendly |
Result Identity¶
Result identity is phase-dependent:
Phase |
Identity Scheme |
|---|---|
Phase 1 |
|
Phase 2 |
Same, plus contributor attribution from PR author |
Phase 3 |
Adds |
Trust, Visibility, and Ranking¶
Trust complexity scales with phases:
Phase |
Trust Model |
|---|---|
Phase 1 |
All results are maintainer-curated. A simple “Maintainer Run” label is fine, but no richer trust model is needed. |
Phase 2 |
Two labels: maintainer (generated by BenchBox CI) and community (submitted via PR). Both public. |
Phase 3 |
Full trust tiers: private, unlisted, public-self-reported, public-curated, public-verified |
Comparison and ranking should be cohort-aware. Public pages must avoid mixing incompatible runs across materially different contexts:
benchmark family and version
scale factor
execution mode
phase set
query subset vs full benchmark
tuning mode
hardware or platform family where relevant
If a cohort is too heterogeneous for a clean ranking, the explorer should fall back to filters and pairwise comparison rather than pretend the leaderboard is authoritative.
No branded “leaderboard” page. Phase 1 browse views may sort results within a validated cohort (e.g., by total duration or geometric mean), but this is presented as a sorted table, not a ranked competition. Cross-context ranking claims are never shown.
Impact on Existing Publishing TODO¶
productize-result-publishing-and-artifact-sharing remains the BenchBox-local
artifact publication track. It owns benchbox publish workflows for copying
canonical result bundles to local/cloud storage. It does NOT own the results
platform or explorer.
TODO Cluster and Priority¶
Planning and Infrastructure¶
TODO |
Phase |
Priority |
Status |
|---|---|---|---|
|
Planning |
High |
Done |
|
Planning |
High |
Done |
|
Phase 1 |
High |
Done |
|
Phase 1 |
High |
Done |
|
Phase 2-3 prep |
Medium |
Done |
|
Phase 3 |
Medium |
Not started |
|
Phase 2-3 |
Medium |
Not started |
|
Phase 3 |
Low |
Not started |
Explorer Fidelity (added 2026-04-14)¶
These TODOs address the gaps identified in the post-launch audit. They are sequenced: pipeline extension first (all others depend on the richer manifest), then UI features that consume the new fields.
TODO |
Phase |
Priority |
Rationale |
|---|---|---|---|
|
Phase 1.5 |
High |
Unblocks all other fidelity work; adds geomean_ms, execution_mode, tuning_mode, tuning_hash, platform_version, test_type, validation_status, cost_usd to manifest + DuckDB |
|
Phase 1.5 |
High |
Fixes the CLI vs explorer metric discrepancy; geomean becomes the primary comparison axis |
|
Phase 1.5 |
Medium-High |
Surfaces tuning config in detail pages and enables cross-tuning comparison |
|
Phase 1.5 |
Medium-High |
Adds “how this was measured” panel; required before differentiation claims are credible |
|
Phase 1.5 |
Medium |
Normalized speedup + diverging bar charts; closes the most visible CLI parity gap |
|
Phase 2 |
Medium |
Warn/block when comparing results that differ on execution mode or tuning; depends on pipeline extension |
Primary Surface Revision (added 2026-04-15)¶
These TODOs implement the ClickBench-style matrix leaderboard pivot documented in Primary Surface Revision. They are sequenced: artifact first, then matrix UI + ranking, then trust/URLs, then home-page meta-leaderboard and a11y.
TODO |
Phase |
Priority |
Rationale |
|---|---|---|---|
|
Phase 1.6 |
High |
New derived artifact keyed on (benchmark, scale, phase) - foundation for the matrix leaderboard |
|
Phase 1.6 |
High |
Replaces BenchmarkIndex sorted list with ClickBench-style Platform × Query matrix; log-ratio coloring; row-checkbox → Compare |
|
Phase 1.6 |
High |
Per-family registry so TPC-H/TPC-DS rank by power_score; geomean fallback elsewhere |
|
Phase 1.6 |
Medium-High |
Trust tiers as visible badges (not a default hide filter); BenchBox’s differentiator made explicit |
|
Phase 1.6 |
Medium-High |
Short hash IDs avoid URL-length limits; backward-compatible |
|
Phase 1.6 |
Medium |
Home page hero: cross-benchmark rank aggregation - leans into multi-benchmark differentiator |
|
Phase 1.6 |
Medium |
aria-labels, keyboard nav, reduced-color mode, axe-core in CI |
DuckDB-Only Browser Metric Contract (2026-04-18)¶
Added as the W1 research gate for TODO explorer-canonical-browser-duckdb-read-model.
This section is the schema decision record, direct ingest contract, and cutover plan
that must be locked before any code changes in W2-W7.
Definition of Terms¶
Canonical Python reference computation - the pipeline’s own Python implementation inside
benchbox/core/explorer_pipeline/plus read-only imports frombenchbox/core/results/. TypeScript code, pre-calculated CLI JSON artifacts, and bridge artifacts are never a reference source.Final-value parity - the rendered user-visible value produced by the DuckDB-backed read path equals the canonical Python reference computation for the same inputs, within the per-metric tolerance declared in
_project/planning/visible_metrics.yaml. Parity is NOT defined against whatever the CLI, a pre-calculated JSON artifact, or the current TS reduction libraries happen to emit today.Source-fidelity parity - raw fields copied into DuckDB equal the corresponding field in the committed JSON bundle verbatim.
Bridge artifact - a metric-bearing JSON file emitted by the pipeline during the W2-W4 transition window so pages not yet migrated to DuckDB can still render. Bridge artifacts are scaffolding only, regenerated from the same pipeline pass that writes DuckDB, never hand-patched, and always subordinate to DuckDB when the two disagree.
Call-Site Inventory¶
Every user-visible metric read, current data source, and DuckDB target:
Surface |
Current source |
Metrics consumed |
Target DuckDB table/view |
|---|---|---|---|
|
|
total_results, platform list, benchmark list, power_score, geomean_ms, run_date (Recent Results) |
|
|
|
rank, metric_value, speedup_vs_best, avg_rank, n_cohorts |
|
|
|
power_score, geomean_ms, run_date, trust_label, tuning_mode |
|
|
|
scale_factor list, phase list, display_geomean_ms (list view) |
|
|
|
timings (display_ms per query/platform), power_score, display_geomean_ms, is_ranking_eligible, percentile_stats, compliance_class |
|
|
|
short_id → result_id mapping |
|
|
|
display_geomean_ms, power_score, speedup_vs_slowest_per_row, fastest_ms, display_ms per query |
DuckDB query over |
|
|
same as above |
DuckDB query (fallback path deleted once W4 lands) |
|
|
power_score, geomean_ms, display_geomean_ms, total_duration_s, display_timings (display_ms + sample_count), queries (raw timings), environment |
|
|
|
schema column names + types (column picker) |
DuckDB introspection ( |
|
|
all |
|
Bootstrap JSON surviving after W7 (non-metric only):
If manifest.json survives, its top-level keys must be a strict subset of
{routes, navigation, build_meta, schema_version}. Any metric-looking key
(benchmark_summary, result_count, power_score, geomean_ms, etc.) fails G-8.
The preferred outcome is to delete manifest.json entirely in the final W4 slice
and derive all navigation from DuckDB.
results_schema.json is deleted in W3 (replaced by DuckDB introspection).
Ten Canonical DuckDB Tables and Views¶
All ten surfaces must exist with stable schemas before W3 (frontend migration) begins.
The frozen DDL is in docs/development/browser-duckdb-schema.sql.
# |
Name |
Kind |
Serves |
|---|---|---|---|
1 |
|
Base table |
Home, PlatformIndex, BenchmarkIndex list, Query workbench |
2 |
|
Base table |
ResultDetail, BenchmarkIndex matrix, Compare |
3 |
|
Base table |
ResultDetail individual samples |
4 |
|
G-11 view |
ResultDetail (projects results + environment tables) |
5 |
|
Base table |
BenchmarkIndex matrix heatmap |
6 |
|
Base table |
BenchmarkIndex matrix/ranks views |
7 |
|
G-11 view |
PlatformIndex (projects |
8 |
|
Base table |
Home meta-leaderboard (cohort + per-platform ranks) |
9 |
|
Base table |
Home meta-leaderboard (cross-cohort platform summary) |
10 |
|
Base table |
Compare URL resolution |
Supporting base tables (not canonical surfaces, required by views):
result_environment- OS, arch, CPU, memory, Python per resultresult_phase_durations- phase timing breakdown per result
G-11 compliance: Views project only bare column references, bare aliased columns,
CAST, and COALESCE(col, literal). No arithmetic, aggregation, CASE, window functions,
or non-whitelisted function calls in view projections.
Metric Parity Registry¶
The complete metric registry is at _project/planning/visible_metrics.yaml. It classifies
every user-visible metric as raw_copy (source-fidelity contract) or derived
(final-value parity contract), with canonical_ref and tolerance per metric.
Derived metrics with existing Python reference computation:
Metric |
Python reference |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Derived metrics currently TS-only - must be ported to Python in W2:
TS source |
Function |
Target in W2 |
|---|---|---|
|
Per-query rank across platforms in a cohort |
Port into |
|
|
Not persisted; computed as SQL window function |
|
|
Already pre-computed in |
|
Heatmap color from timing ratio |
Presentation-only (CSS hue/lightness from |
|
Benchmark → primary metric enum |
Already in |
|
Compliance class → display string |
Presentation-only formatting. Stays TS-side. |
Direct Ingest Contract¶
committed JSON bundles (benchmarks/*.json)
→ pipeline validate (SchemaV2Validator)
→ pipeline transform (BundleTransformer)
→ compute derived values in Python (display_ms, display_geomean_ms, short_id,
is_ranking_eligible, compliance_class, matrix cells, rankings, meta-leaderboard)
→ bulk-insert all ten canonical tables in one DuckDB transaction
→ copy raw bundles to public/data/bundles/ (download-only affordance)
→ [W2-W4 bridge only] regenerate metric-bearing JSON from the same pipeline pass
Rule: JSON is an immutable source input and a download-only output. It is never a steady-state metric read source. If a bridge artifact and DuckDB disagree, DuckDB is canonical and the bridge artifact is regenerated from the same pipeline pass, never hand-patched.
Cutover and Deletion Order¶
Phase |
Pipeline side |
Frontend side |
Artifacts deleted |
|---|---|---|---|
W2 |
Add ten canonical tables; all existing JSON emitters continue as bridge artifacts |
No frontend changes |
None |
W3 |
No pipeline changes |
Add |
|
W4 Home slice |
Delete |
Home migrates to DuckDB |
|
W4 PlatformIndex slice |
- |
PlatformIndex migrates to DuckDB |
(reads |
W4 BenchmarkIndex slice |
Delete |
BenchmarkIndex migrates to DuckDB |
|
W4 ResultDetail slice |
Delete |
ResultDetail migrates to DuckDB |
|
W4 Compare + final slice |
Delete |
Compare migrates to DuckDB; |
|
W7 |
Final sweep of any remaining legacy emitters |
- |
|
db.ts hardening (deferred from W3 to W4 final slice):
Reject
getDb()on attach failure instead ofconsole.warn+ silent continueDelete the soft JSON fallback at
db.ts:86-89Delete the stale
db.ts:1-17“Phase 1/2 scaffold” header commentWire per-page
ErrorMessageUI (G-7 requirement)
Research Gate Findings (RG-1 through RG-9)¶
Gate |
Status |
Finding |
|---|---|---|
RG-1 Cold-start init budget (p50 <1500ms, p95 <3500ms) |
TBD |
Requires browser measurement against deployed corpus. Measure after W2 builds the full ten-table DB. Accepted risk: proceed to W2; gate must be green before W3 ships to users. |
RG-2 HTTP range-reads (≤10% total file for single-row lookup) |
HIGH CONFIDENCE PASS |
|
RG-3 DB size budget (sub-linear growth, ≤20 MB at 100x) |
TBD |
Requires synthetic corpus build at 1x, 5x, 20x, 100x. The full ten-table schema (especially |
RG-4 COEP/COOP hosting |
HIGH CONFIDENCE PASS |
|
RG-5 Single DB vs partitioned |
N/A unless RG-3 fails |
Default: single |
RG-6 View materialization budget (<250ms p95 cold at 100x) |
TBD |
Benchmark canonical surface queries against 100x synthetic corpus after W2. Views that fail promote to materialized |
RG-7 Attach-failure UX (per-page ErrorMessage, no silent fallback) |
PLANNED FOR W4 |
The |
RG-8 toJSON fidelity (no precision loss on DOUBLE/BIGINT/DECIMAL) |
RISK: MEDIUM |
DuckDB-WASM’s Arrow→JSON path has known BIGINT→BigInt issues (JSON.stringify throws on BigInt). |
RG-9 Read-only fuzz (adversarial SQL harmlessly fails) |
HIGH CONFIDENCE PASS |
DuckDB’s |
Research gate exit criterion met when: RG-1 and RG-3 are measured (after W2 DB build),
all others are GREEN or have an accepted-risk note above, and the frozen schema in
docs/development/browser-duckdb-schema.sql is committed alongside this decision record.
External Sources¶
Geekbench editions: https://www.geekbench.com/editions/
Geekbench Browser: https://browser.geekbench.com/
CloudSpecs site: https://cloudspecs.fyi/
CloudSpecs repository: https://github.com/TUM-DIS/cloudspecs
OpenBenchmarking: https://openbenchmarking.org/
airspeed velocity: https://asv.readthedocs.io/en/latest/using.html
js-framework-benchmark (PR-based contribution model): https://github.com/nicholasgasior/js-framework-benchmark
ClickBench contribution model: https://github.com/ClickHouse/ClickBench