TPC-DS-OBT Benchmark¶
CLI name:
tpcds_obt- usebenchbox run --benchmark tpcds_obt
Overview¶
The TPC-DS-OBT (One Big Table) benchmark adapts the standard TPC-DS benchmark to run against a single denormalized table instead of the traditional 25-table normalized schema. This experimental benchmark tests how databases handle wide tables with hundreds of columns, a pattern increasingly common in modern data warehouses and lakehouse architectures.
The benchmark is ideal for evaluating column pruning efficiency, wide table scan performance, and storage format effectiveness (Parquet, Delta Lake, Iceberg) on denormalized schemas.
Key Features¶
Single wide table - All TPC-DS data flattened into one denormalized table
89 of 99 queries - TPC-DS queries rewritten for flat schema (10 queries are not OBT-convertible; see below)
No joins required - Tests pure scan and aggregation performance
Column pruning focus - Evaluates optimizer column projection efficiency
Modern lakehouse pattern - Simulates real-world denormalized data models
Storage format comparison - Ideal for Parquet vs Delta vs Iceberg testing
Use Cases¶
When to Use TPC-DS-OBT¶
Lakehouse performance testing - Evaluate denormalized table performance
Column pruning benchmarks - Test how efficiently engines skip unused columns
Wide table handling - Stress test databases with 200+ column tables
Storage format comparison - Compare Parquet, Delta Lake, Iceberg on wide tables
Scan-heavy workloads - Benchmark pure analytical scan performance without join overhead
When to Use Standard TPC-DS¶
Join performance testing - Evaluating multi-table join strategies
Normalized schema workloads - Traditional data warehouse patterns
TPC compliance - Official TPC-DS compliance requires normalized schema
Data Model¶
One Big Table Schema¶
The OBT schema denormalizes all 25 TPC-DS tables into a single wide table:
Aspect |
Value |
|---|---|
Tables |
1 (denormalized) |
Columns |
~200+ |
Source Tables |
All 25 TPC-DS tables flattened |
Primary Grain |
store_sales fact table |
Column Groups¶
The denormalized table contains columns from all TPC-DS dimensions:
Source Table |
Columns Added |
Prefix |
|---|---|---|
store_sales |
~23 |
|
customer |
~18 |
|
customer_address |
~13 |
|
customer_demographics |
~9 |
|
date_dim |
~28 |
|
item |
~22 |
|
store |
~29 |
|
promotion |
~19 |
|
household_demographics |
~5 |
|
time_dim |
~10 |
|
… |
… |
… |
Scale Factors¶
Scale Factor |
Approximate Rows |
Approximate Size |
|---|---|---|
1 |
~2.8 million |
~2 GB |
10 |
~28 million |
~20 GB |
100 |
~280 million |
~200 GB |
1000 |
~2.8 billion |
~2 TB |
Quick Start¶
# Run TPC-DS-OBT on DuckDB
benchbox run --platform duckdb --benchmark tpcds_obt --scale 1.0
# Run specific queries
benchbox run --platform duckdb --benchmark tpcds_obt --scale 1.0 --queries Q1,Q3,Q7
# Compare with standard TPC-DS
benchbox run --platform duckdb --benchmark tpcds --scale 1.0
benchbox run --platform duckdb --benchmark tpcds_obt --scale 1.0
CLI Options (--benchmark-option)¶
Configure TPC-DS-OBT via --benchmark-option KEY=VALUE:
Option |
Default |
Description |
|---|---|---|
|
|
OBT dimension mode: |
|
all |
Sales channels: |
|
|
Output format: |
|
- |
Directory containing TPC-DS source data |
|
- |
Force data regeneration ( |
Accepts hyphenated aliases (e.g. dimension-mode, output-format, tpcds-source-dir).
# Minimal dimensions, store channel only (parquet is the default - no flag needed)
benchbox run --platform duckdb --benchmark tpcds_obt --scale 1 \
--benchmark-option dimension_mode=minimal \
--benchmark-option channels=store
# Force the legacy pipe-delimited .dat output (rarely needed):
benchbox run --platform duckdb --benchmark tpcds_obt --scale 1 \
--benchmark-option output_format=dat
Storage format¶
Parquet is the default because the OBT is a very wide table (518 columns at dimension_mode=full)
with high null density - store rows carry NULL for all web_site_*, web_page_*, and
call_center_* columns, and so on per channel. Columnar storage with null suppression
reduces the SF1 artifact by an order of magnitude vs the 11 GB pipe-delimited .dat,
and makes the DataFrame platforms (DataFusion, Polars, Pandas) viable for SF1+.
Use --benchmark-option output_format=dat when you need to reproduce results from runs
that pre-date this default change, or when feeding a tool that can only read pipe-delimited
text. Existing .dat files on disk are not deleted automatically; BenchBox will regenerate
as parquet and log the path of the stale .dat so you can reclaim disk space manually.
Query Adaptations¶
TPC-DS-OBT provides 89 of the standard 99 TPC-DS queries rewritten to work with the flat schema. 10 queries are excluded because they cannot be expressed against a single denormalized fact:
Category |
Queries |
Reason for exclusion |
|---|---|---|
Inventory fact |
Q21, Q22, Q37, Q39, Q72, Q82 |
Use the inventory fact table, a separate fact domain not included in the OBT denormalization |
External customer dimensions |
Q46, Q64, Q68, Q84 |
Require the customer’s current address/demographics (independent of the sales fact), which the OBT cannot recover from the single-row grain. Q64 additionally uses a cross-channel self-join that defeats OBT’s single-scan design |
Queries 14 and 49 are included but hand-written rather than produced by the automatic converter because their semantics require manual rewrites.
Example: Query 1¶
Standard TPC-DS Q1 (with joins):
SELECT c_customer_id, c_first_name, c_last_name, ...
FROM customer, store_sales, date_dim, store
WHERE c_customer_sk = ss_customer_sk
AND ss_sold_date_sk = d_date_sk
AND ss_store_sk = s_store_sk
...
TPC-DS-OBT Q1 (flat table):
SELECT c_customer_id, c_first_name, c_last_name, ...
FROM tpcds_obt
WHERE d_year = 2000
AND s_state = 'TN'
...
Performance Considerations¶
Advantages of OBT¶
No join overhead - Eliminates multi-table join costs
Simplified query plans - Single table scan with filters
Columnar format efficiency - Modern formats excel at column pruning
Predictable performance - Less optimizer variability
Challenges of OBT¶
Storage overhead - Denormalization increases data redundancy
Column count - Wide tables stress metadata handling
Update complexity - Changes require full table rewrites
Memory pressure - Wide rows can stress memory buffers
Platform Support¶
Platform |
Status |
Notes |
|---|---|---|
DuckDB |
✅ Full |
Excellent wide table handling |
ClickHouse |
✅ Full |
Strong columnar performance |
Databricks |
✅ Full |
Native Delta Lake support |
Snowflake |
✅ Full |
Automatic micro-partitioning |
BigQuery |
✅ Full |
Columnar storage optimized |
Polars |
✅ Full |
Efficient Arrow-based scans |
PostgreSQL |
⚠️ Limited |
Row-store less efficient for wide tables |
See Also¶
TPC-DS Benchmark - Standard normalized TPC-DS
ClickBench - Another single-table analytics benchmark
Data Vault - Alternative schema modeling approach
Platform Comparison - Platform capabilities