TPC-DS-OBT Benchmark

Tags advanced concept tpc-ds-obt tpc-ds experimental

CLI name: tpcds_obt - use benchbox run --benchmark tpcds_obt

Overview

The TPC-DS-OBT (One Big Table) benchmark adapts the standard TPC-DS benchmark to run against a single denormalized table instead of the traditional 25-table normalized schema. This experimental benchmark tests how databases handle wide tables with hundreds of columns, a pattern increasingly common in modern data warehouses and lakehouse architectures.

The benchmark is ideal for evaluating column pruning efficiency, wide table scan performance, and storage format effectiveness (Parquet, Delta Lake, Iceberg) on denormalized schemas.

Key Features

  • Single wide table - All TPC-DS data flattened into one denormalized table

  • 89 of 99 queries - TPC-DS queries rewritten for flat schema (10 queries are not OBT-convertible; see below)

  • No joins required - Tests pure scan and aggregation performance

  • Column pruning focus - Evaluates optimizer column projection efficiency

  • Modern lakehouse pattern - Simulates real-world denormalized data models

  • Storage format comparison - Ideal for Parquet vs Delta vs Iceberg testing

Use Cases

When to Use TPC-DS-OBT

  • Lakehouse performance testing - Evaluate denormalized table performance

  • Column pruning benchmarks - Test how efficiently engines skip unused columns

  • Wide table handling - Stress test databases with 200+ column tables

  • Storage format comparison - Compare Parquet, Delta Lake, Iceberg on wide tables

  • Scan-heavy workloads - Benchmark pure analytical scan performance without join overhead

When to Use Standard TPC-DS

  • Join performance testing - Evaluating multi-table join strategies

  • Normalized schema workloads - Traditional data warehouse patterns

  • TPC compliance - Official TPC-DS compliance requires normalized schema

Data Model

One Big Table Schema

The OBT schema denormalizes all 25 TPC-DS tables into a single wide table:

Aspect

Value

Tables

1 (denormalized)

Columns

~200+

Source Tables

All 25 TPC-DS tables flattened

Primary Grain

store_sales fact table

Column Groups

The denormalized table contains columns from all TPC-DS dimensions:

Source Table

Columns Added

Prefix

store_sales

~23

ss_

customer

~18

c_

customer_address

~13

ca_

customer_demographics

~9

cd_

date_dim

~28

d_

item

~22

i_

store

~29

s_

promotion

~19

p_

household_demographics

~5

hd_

time_dim

~10

t_

Scale Factors

Scale Factor

Approximate Rows

Approximate Size

1

~2.8 million

~2 GB

10

~28 million

~20 GB

100

~280 million

~200 GB

1000

~2.8 billion

~2 TB

Quick Start

# Run TPC-DS-OBT on DuckDB
benchbox run --platform duckdb --benchmark tpcds_obt --scale 1.0

# Run specific queries
benchbox run --platform duckdb --benchmark tpcds_obt --scale 1.0 --queries Q1,Q3,Q7

# Compare with standard TPC-DS
benchbox run --platform duckdb --benchmark tpcds --scale 1.0
benchbox run --platform duckdb --benchmark tpcds_obt --scale 1.0

CLI Options (--benchmark-option)

Configure TPC-DS-OBT via --benchmark-option KEY=VALUE:

Option

Default

Description

dimension_mode

full

OBT dimension mode: full, minimal

channels

all

Sales channels: store, web, catalog (comma-separated)

output_format

parquet

Output format: dat, parquet

tpcds_source_dir

-

Directory containing TPC-DS source data

force_regenerate

-

Force data regeneration (true/false)

Accepts hyphenated aliases (e.g. dimension-mode, output-format, tpcds-source-dir).

# Minimal dimensions, store channel only (parquet is the default - no flag needed)
benchbox run --platform duckdb --benchmark tpcds_obt --scale 1 \
  --benchmark-option dimension_mode=minimal \
  --benchmark-option channels=store

# Force the legacy pipe-delimited .dat output (rarely needed):
benchbox run --platform duckdb --benchmark tpcds_obt --scale 1 \
  --benchmark-option output_format=dat

Storage format

Parquet is the default because the OBT is a very wide table (518 columns at dimension_mode=full) with high null density - store rows carry NULL for all web_site_*, web_page_*, and call_center_* columns, and so on per channel. Columnar storage with null suppression reduces the SF1 artifact by an order of magnitude vs the 11 GB pipe-delimited .dat, and makes the DataFrame platforms (DataFusion, Polars, Pandas) viable for SF1+.

Use --benchmark-option output_format=dat when you need to reproduce results from runs that pre-date this default change, or when feeding a tool that can only read pipe-delimited text. Existing .dat files on disk are not deleted automatically; BenchBox will regenerate as parquet and log the path of the stale .dat so you can reclaim disk space manually.

Query Adaptations

TPC-DS-OBT provides 89 of the standard 99 TPC-DS queries rewritten to work with the flat schema. 10 queries are excluded because they cannot be expressed against a single denormalized fact:

Category

Queries

Reason for exclusion

Inventory fact

Q21, Q22, Q37, Q39, Q72, Q82

Use the inventory fact table, a separate fact domain not included in the OBT denormalization

External customer dimensions

Q46, Q64, Q68, Q84

Require the customer’s current address/demographics (independent of the sales fact), which the OBT cannot recover from the single-row grain. Q64 additionally uses a cross-channel self-join that defeats OBT’s single-scan design

Queries 14 and 49 are included but hand-written rather than produced by the automatic converter because their semantics require manual rewrites.

Example: Query 1

Standard TPC-DS Q1 (with joins):

SELECT c_customer_id, c_first_name, c_last_name, ...
FROM customer, store_sales, date_dim, store
WHERE c_customer_sk = ss_customer_sk
  AND ss_sold_date_sk = d_date_sk
  AND ss_store_sk = s_store_sk
  ...

TPC-DS-OBT Q1 (flat table):

SELECT c_customer_id, c_first_name, c_last_name, ...
FROM tpcds_obt
WHERE d_year = 2000
  AND s_state = 'TN'
  ...

Performance Considerations

Advantages of OBT

  • No join overhead - Eliminates multi-table join costs

  • Simplified query plans - Single table scan with filters

  • Columnar format efficiency - Modern formats excel at column pruning

  • Predictable performance - Less optimizer variability

Challenges of OBT

  • Storage overhead - Denormalization increases data redundancy

  • Column count - Wide tables stress metadata handling

  • Update complexity - Changes require full table rewrites

  • Memory pressure - Wide rows can stress memory buffers

Platform Support

Platform

Status

Notes

DuckDB

✅ Full

Excellent wide table handling

ClickHouse

✅ Full

Strong columnar performance

Databricks

✅ Full

Native Delta Lake support

Snowflake

✅ Full

Automatic micro-partitioning

BigQuery

✅ Full

Columnar storage optimized

Polars

✅ Full

Efficient Arrow-based scans

PostgreSQL

⚠️ Limited

Row-store less efficient for wide tables

See Also