Splitting approximate analytics across BenchBox read and write benchmarks¶

Approximate aggregates answer one query over raw rows. Sketches store compact summary state that can be merged and queried later. BenchBox now tests those workloads separately.

TL;DR: Approximate analytics is not just a function-mapping problem. A platform can expose approximate aggregate functions without exposing the full sketch lifecycle (build compact state, store it, merge it later, query it). BenchBox now separates those two claims:

read_primitives measures one-shot approximate aggregates: HLL distinct counts, KLL or T-Digest quantiles, vector quantiles, and Top-K.
write_primitives measures persisted sketch state: store compact summaries, merge them later, validate the extracted answer, and track storage size.

The current develop branch also adds ClickHouse-native lifecycle coverage, DuckDB-only CPC and REQ variants, parameter sweeps, and a PySpark DataFrame HLL persist-merge path. Tuple sketches and cloud live verification remain deferred.

Terminal screenshot of the BenchBox write_primitives sketch lifecycle on ClickHouse Local, showing the eight sketch operations (DDL create, three inserts, three merges, DDL drop) with per-operation latencies rendered as a horizontal bar chart

Introduction¶

A support matrix would have been the obvious first response to Databricks’ April 29, 2026 sketch announcement.[1]

Theta: supported here, substituted there. KLL: supported here, skipped there. Top-K: supported here, missing there. That table is useful, and BenchBox has one now. But it does not answer the question the announcement actually raises.

Sketches become interesting when they stop being a single function call: built over partitions, stored as columns, merged across days or regions, and queried against compact state instead of raw rows. Once that is the claim, a benchmark cannot stop at “which function name maps where?”

The announcement was a useful forcing function for BenchBox: one-shot aggregate tests can measure function latency, but they cannot exercise stored, mergeable sketch artifacts.

A note up front about what BenchBox can and cannot claim here. The catalog covers many platforms and families, but catalog coverage is not the same as live, per-platform verification. Local read_primitives paths are straightforward to exercise; chDB gives a local ClickHouse-shaped path for the native -State and -Merge lifecycle; DuckDB KLL, CPC, and REQ work against the installed community extension, while DuckDB Theta and frequent-items remain blocked by community-extension drift. Snowflake, BigQuery, Databricks, and Redshift lifecycle runs remain credential-gated.

Databricks’ SQL docs make that lifecycle explicit for Theta (theta_sketch_agg, theta_union_agg, theta_sketch_estimate), KLL (kll_sketch_agg_double, kll_merge_agg_double, kll_sketch_get_quantile_double), and Top-K (approx_top_k_accumulate, approx_top_k_combine, approx_top_k_estimate).[2][3][4][5][6] BenchBox already had a single-query read benchmark. It needed a second benchmark shape for the persist, merge, and requery loop.

Why parity was only the starting point¶

The announcement moved Databricks closer to sketch surfaces already present elsewhere in BenchBox’s platform matrix. Snowflake documents DataSketches functions for approximate distinct counts and stateful Top-K functions that accumulate, combine, and estimate stored state.[7][8] BigQuery exposes HLL++ sketches and KLL quantile sketches as BYTES values that can be initialized, merged, and extracted later.[9][10] ClickHouse has long exposed the same lifecycle shape through aggregate-state combinators such as uniqState and uniqMerge. DuckDB covers the one-shot approximate aggregate surface with approx_count_distinct, approx_quantile, and approx_top_k.[11][12]

That competitive context makes the benchmark question sharper. BenchBox needs to show where Databricks now overlaps with existing platform capabilities, where the SQL shapes still differ, and which parts of the lifecycle are actually measurable across platforms.

Why read_primitives was not enough¶

read_primitives is intentionally simple. Each query runs in isolation against benchmark tables and reports latency for one read shape: a filter, a join, a group-by, a percentile, a top-N query. That is the right home for one-shot approximate aggregates:

Query	Capability
`approx_count_distinct_simple`	HLL distinct count, single value
`approx_count_distinct_groupby`	HLL distinct count per group
`approx_quantile_groupby`	KLL or T-Digest style single quantile
`approx_quantiles_array`	Multiple quantiles from one aggregate
`approx_top_k_lineitem`	Approximate most-frequent values

Those queries are useful. They answer “how fast is this platform’s approximate aggregate surface compared with its exact aggregate surface?” That is still worth benchmarking.

But the sketch lifecycle is a different workload. It has at least three steps:

Build per-partition sketch state.
Persist that state in a table or file.
Merge stored state and extract a scalar or list later.

The second and third steps are the interesting part. They turn “approximate aggregate” from a function call into a data management pattern. A single-query catalog can produce numbers for step one. It cannot exercise the durable state loop.

This surfaced as a framework-gap finding in our own planning notes: function-parity evaluation under-weights claims that depend on the benchmark execution model. Mapping names is necessary, but it is not enough. We also have to ask whether the benchmark can express the capability under test.

From function parity to lifecycle tests¶

The first pass was the parity table. We mapped Databricks’ sketch families against the other platforms that BenchBox targets:

Platform	Distinct	Quantile	Top-K
DuckDB	`APPROX_COUNT_DISTINCT`	`APPROX_QUANTILE`	`APPROX_TOP_K`
Snowflake	`APPROX_COUNT_DISTINCT`	`APPROX_PERCENTILE`	`APPROX_TOP_K`
BigQuery	`APPROX_COUNT_DISTINCT`	`APPROX_QUANTILES`	`APPROX_TOP_COUNT`
Databricks	`approx_count_distinct`	`PERCENTILE_APPROX`	`approx_top_k`
ClickHouse	`uniq`	`quantileTDigest`	`topK`
Redshift	`APPROXIMATE COUNT(DISTINCT ...)`	`APPROXIMATE PERCENTILE_DISC`	no native top-K
DataFusion	`approx_distinct`	`approx_percentile_cont`	no native top-K

That table belongs in read_primitives, and it now lives in the public approximate-functions doc.[13] It is enough for aggregate latency, but not for persisted sketch state.

Then we looked at persistence. Two implementation options were on the table:

Option	Benefit	Weakness
Add sketch-like queries only to `read_primitives`	Small change, easy to explain, broad platform coverage	Only measures one-shot aggregate latency
Add a `sketch` category to `write_primitives`	Exercises persist, merge, and requery	Requires platform-specific DDL, binary column handling, validation tolerances, and skip rationale

We chose both. read_primitives owns the one-query surface. write_primitives owns the lifecycle.

That choice also forced architecture work we would not have seen from the parity table alone. write_primitives already had per-platform SQL overrides for the operation body, but validation SQL was still single-body. That breaks as soon as ClickHouse writes with uniqState while validation still calls a DuckDB-only datasketch_* function. We added validation_query.platform_overrides so each platform can validate with its own sketch syntax.

The DataFrame write path had a separate gap. Its operation enum modeled row-level work: insert, bulk-load, update, delete, merge, transaction. Sketch persistence is aggregate-state work, not row maintenance. We added AGGREGATE_PERSIST and AGGREGATE_MERGE as explicit operation shapes so DataFrame platforms with sketch APIs can report build latency, durable state size, and merge latency without pretending the workload is row CRUD.

That is the larger lesson from the attempt: the benchmark shape mattered as much as the function map.

The write_primitives sketch lifecycle¶

The first write_primitives sketch category models a full sketch lifecycle with eight core SQL operations:

Operation	Stage	What it measures
`sketch_ddl_create_persistent_table`	DDL	Creating a table with sketch-bearing columns
`sketch_insert_theta_per_partition`	INSERT	Building distinct-count sketches per partition
`sketch_insert_kll_per_partition`	INSERT	Building quantile sketches per partition
`sketch_insert_topk_per_shard`	INSERT	Building frequent-item sketches per shard
`sketch_query_theta_union_merge`	MERGE	Merging distinct-count sketches and extracting an estimate
`sketch_query_kll_quantiles_merge`	MERGE	Merging quantile sketches and extracting a median
`sketch_query_topk_combine`	MERGE	Merging frequent-item sketches and extracting top items
`sketch_drop_persistent_table`	DDL	Cleaning up the sketch table

The catalog separates three kinds of support:

Family	DataSketches-family or binary-oriented path	Native or substituted path	No-support cases
Theta distinct	Databricks, Snowflake, DuckDB extension	ClickHouse `uniqState` and `uniqMerge`; BigQuery and Redshift HLL substitution	DataFusion
KLL quantile	Databricks, Snowflake, BigQuery, DuckDB extension	ClickHouse `quantileTDigestState` and `quantileTDigestMerge`	Redshift, DataFusion
Top-K frequent items	Databricks, Snowflake, DuckDB extension	ClickHouse `topKState` and `topKMerge`	BigQuery, Redshift, DataFusion
CPC distinct	DuckDB extension	none today	Databricks, Snowflake, BigQuery, ClickHouse, Redshift, DataFusion
REQ quantile	DuckDB extension	none today	Databricks, Snowflake, BigQuery, ClickHouse, Redshift, DataFusion

That table defines execution contracts, not rankings. When BenchBox says Redshift is HLL-only here, the catalog emits HLL-specific DDL and HLL merge SQL for the distinct-count path, then skips KLL and Top-K with rationale. When BenchBox says ClickHouse is native-but-distinct, it means the persisted state shape matches the benchmark lifecycle, while the bytes are not portable to Apache DataSketches readers.

The later DuckDB-only CPC and REQ rows are a different kind of coverage. They do not broaden the vendor matrix. They make local sketch-family trade-offs measurable: CPC versus Theta for serialized distinct-count state size, and REQ versus KLL for quantile error semantics and state size.

Validation has two layers for the headline merge operations.

The first layer is scalar bounds. Sketch outputs can vary across algorithms, parameters, and platforms. Exact equality would be the wrong contract. Bounds are wide enough to avoid false failures on healthy sketches and tight enough to catch obvious regressions, such as returning zero or merging the wrong column.

The second layer is storage-size validation where the platform exposes a practical byte-length probe. A sketch benchmark that measures only latency misses half of the cost model. Persisted state size affects table storage, cache behavior, network movement, and merge throughput. The current catalog uses DuckDB and ClickHouse probes where verified, while cloud byte-length checks remain deferred until cloud credentials are available.

Tuple sketches stay out of this first lifecycle category. Databricks includes them in the announcement, and they are important because they combine distinct counting with associated metrics such as sums or mins and maxes.[1] BenchBox defers them because the cross-platform story is different: Databricks and Snowflake have a natural path; BigQuery, Redshift, and ClickHouse do not map cleanly to the same family. Adding Tuple now would make the first lifecycle benchmark broader while mixing unlike support models in the same table.

What changed in the benchmark model¶

Approximate analytics has two operational phases: aggregate functions and sketch persistence.

The aggregate functions phase is a query like “give me an approximate distinct count now.” Most analytical platforms have something here. Some use HLL, some use T-Digest, some use platform-specific Top-K accumulators, and a few still rely on SQL transpilation or hand-written variants. This is the path read_primitives measures.

The persisted-state phase asks a different question: can the platform build sketches over partitions, store them, merge them later, and extract answers cheaply? This is the path write_primitives measures. The platform matrix narrows at this tier, and the portability story becomes more specific. Some platforms expose Apache DataSketches-family artifacts. Some expose native aggregate-state combinators. Some expose HLL only. A few still do not expose any sketch lifecycle at all. (See the catalog-versus-live note in the introduction for what BenchBox can claim today.)

DataFrame support needs its own accounting. Polars, PySpark, DataFusion, pandas, Dask, Modin, and cuDF do not expose the same sketch surfaces. On the read side, some rows are sketch-backed and some rows fall back to exact aggregates. A pandas exact nunique timing is not semantically comparable to a PySpark HLL timing. On the write side, PySpark now has a CLI-runnable HLL persist-merge path; Top-K is guarded by runtime support; the other DataFrame platforms still lack a comparable sketch persistence API.

The pattern is reusable beyond sketches. Any feature announced as “stored, mergeable, requeryable” needs an execution-model check before we start writing parity tables. Vector indexes, materialized aggregates, search indexes, incremental materialized views, and ML feature stores all have the same shape. The question is not only “what function name maps to what?” It is “does this benchmark have the phases needed to test the claim?”

Try it yourself¶

Run the one-shot approximate aggregate path first:

uv run -- benchbox run --platform duckdb --benchmark read_primitives \
  --queries approx_count_distinct_simple,approx_count_distinct_groupby,approx_quantile_groupby,approx_quantiles_array,approx_top_k_lineitem \
  --scale 1 \
  --non-interactive

Then run the sketch lifecycle where your target platform supports it. The DuckDB Theta and Top-K paths currently need the community-extension drift resolved first, so KLL is the safer local DuckDB lifecycle smoke:

uv run -- benchbox run --platform duckdb --benchmark write_primitives \
  --queries sketch_insert_kll_per_partition,sketch_query_kll_quantiles_merge \
  --scale 0.01 \
  --non-interactive

ClickHouse-native sketch state can be exercised locally through clickhouse-local. In a clean checkout, install the optional local ClickHouse dependency first with uv sync --extra clickhouse-local.

QUERIES="sketch_ddl_create_persistent_table,sketch_insert_theta_per_partition,sketch_insert_kll_per_partition,sketch_insert_topk_per_shard"
QUERIES="$QUERIES,sketch_query_theta_union_merge,sketch_query_kll_quantiles_merge,sketch_query_topk_combine,sketch_drop_persistent_table"

uv run -- benchbox run --platform clickhouse-local --benchmark write_primitives \
  --queries "$QUERIES" \
  --scale 0.01 \
  --non-interactive

For DataFrame read aggregates, run each platform separately and label sketch-backed versus exact-fallback results before comparing:

for platform in polars-df pyspark-df datafusion-df pandas-df dask-df; do
  uv run -- benchbox run --platform "$platform" --benchmark read_primitives \
    --queries approx_count_distinct_simple,approx_count_distinct_groupby,approx_quantile_groupby \
    --scale 0.1 \
    --non-interactive
done

For the PySpark write-side DataFrame path, start with HLL. Top-K depends on whether the active Spark distribution exposes the approx_top_k_* Python functions:

uv run -- benchbox run --platform pyspark-df --benchmark write_primitives \
  --queries sketch_df_hll_persist_merge \
  --scale 0.01 \
  --non-interactive

The public reference docs are the best place to start:

Test environment¶

This post was reviewed against origin/develop at commit c48c299b4. The v0.3.0 version bump has not yet landed on develop, so pyproject.toml still reads 0.2.1; the public release will pin against the bump commit.

Evidence status for this post:

Area	Evidence status	Notes
`read_primitives` approximate aggregate catalog	Implemented and documented	Covers the one-shot approximate aggregate surface
DataFrame approximate aggregate surface	Implemented and documented	Labels sketch-backed and exact-fallback paths separately
`write_primitives` KLL Databricks SQL overrides	Docs-checked	Uses typed Databricks SQL names; live Databricks verification is still credential-gated
`validation_query.platform_overrides`	Implemented and unit-covered	Lets each platform validate with its own sketch syntax
`AGGREGATE_PERSIST` / `AGGREGATE_MERGE` DataFrame operation shapes	Implemented and unit-covered	Models aggregate-state work separately from row maintenance
ClickHouse-native sketch lifecycle and storage-size probes	Locally exercised	Uses chDB / `clickhouse-local` for native `-State` and `-Merge` paths
DuckDB sketch families	Mixed local status	KLL, CPC, and REQ are locally exercised; Theta and frequent-items remain blocked by community `datasketches` extension drift
PySpark DataFrame HLL persist-merge path	CLI-runnable and runtime-gated	HLL runs through `sketch_df_hll_persist_merge`; Top-K skips when the Python client lacks `approx_top_k_*` functions
Cloud lifecycle runs	Deferred	Snowflake, BigQuery, Databricks, and Redshift runs require credentials

Limitations¶

Approximation error is out of scope here. A separate benchmark would need paired exact and approximate outputs, relative-error reporting, and workload-specific tolerance choices.

Cross-platform binary sketch portability is also out of scope. Persisting a sketch in one platform and reading it in another needs orchestration across two platforms and byte-level compatibility checks. BenchBox does not have that workflow today.

The support matrix separates catalog support from live verification. Cloud verification, cloud storage-size probes, and DuckDB extension pinning can fill in more of the matrix later.

References¶

Questions or feedback? Open an issue or join the discussion.

May 18, 2026 · Joe Harris · benchbox sketches datasketches databricks approximate-analytics write-primitives read-primitives clickhouse duckdb redshift

← Reworking JoinOrder around the IMDb 2013 dataset