TPC-DS-OBT Benchmark

Tags advanced concept tpc-ds-obt tpc-ds experimental

Overview

The TPC-DS-OBT (One Big Table) benchmark adapts the standard TPC-DS benchmark to run against a single denormalized table instead of the traditional 25-table normalized schema. This experimental benchmark tests how databases handle wide tables with hundreds of columns, a pattern increasingly common in modern data warehouses and lakehouse architectures.

The benchmark is ideal for evaluating column pruning efficiency, wide table scan performance, and storage format effectiveness (Parquet, Delta Lake, Iceberg) on denormalized schemas.

Key Features

  • Single wide table - All TPC-DS data flattened into one denormalized table

  • Same 99 queries - Standard TPC-DS queries rewritten for flat schema

  • No joins required - Tests pure scan and aggregation performance

  • Column pruning focus - Evaluates optimizer column projection efficiency

  • Modern lakehouse pattern - Simulates real-world denormalized data models

  • Storage format comparison - Ideal for Parquet vs Delta vs Iceberg testing

Use Cases

When to Use TPC-DS-OBT

  • Lakehouse performance testing - Evaluate denormalized table performance

  • Column pruning benchmarks - Test how efficiently engines skip unused columns

  • Wide table handling - Stress test databases with 200+ column tables

  • Storage format comparison - Compare Parquet, Delta Lake, Iceberg on wide tables

  • Scan-heavy workloads - Benchmark pure analytical scan performance without join overhead

When to Use Standard TPC-DS

  • Join performance testing - Evaluating multi-table join strategies

  • Normalized schema workloads - Traditional data warehouse patterns

  • TPC compliance - Official TPC-DS compliance requires normalized schema

Data Model

One Big Table Schema

The OBT schema denormalizes all 25 TPC-DS tables into a single wide table:

Aspect

Value

Tables

1 (denormalized)

Columns

~200+

Source Tables

All 25 TPC-DS tables flattened

Primary Grain

store_sales fact table

Column Groups

The denormalized table contains columns from all TPC-DS dimensions:

Source Table

Columns Added

Prefix

store_sales

~23

ss_

customer

~18

c_

customer_address

~13

ca_

customer_demographics

~9

cd_

date_dim

~28

d_

item

~22

i_

store

~29

s_

promotion

~19

p_

household_demographics

~5

hd_

time_dim

~10

t_

Scale Factors

Scale Factor

Approximate Rows

Approximate Size

1

~2.8 million

~2 GB

10

~28 million

~20 GB

100

~280 million

~200 GB

1000

~2.8 billion

~2 TB

Quick Start

# Run TPC-DS-OBT on DuckDB
benchbox run --platform duckdb --benchmark tpc-ds-obt --scale 1.0

# Run specific queries
benchbox run --platform duckdb --benchmark tpc-ds-obt --scale 1.0 --queries Q1,Q3,Q7

# Compare with standard TPC-DS
benchbox run --platform duckdb --benchmark tpcds --scale 1.0
benchbox run --platform duckdb --benchmark tpc-ds-obt --scale 1.0

Query Adaptations

TPC-DS-OBT rewrites the standard 99 TPC-DS queries to work with the flat schema:

Example: Query 1

Standard TPC-DS Q1 (with joins):

SELECT c_customer_id, c_first_name, c_last_name, ...
FROM customer, store_sales, date_dim, store
WHERE c_customer_sk = ss_customer_sk
  AND ss_sold_date_sk = d_date_sk
  AND ss_store_sk = s_store_sk
  ...

TPC-DS-OBT Q1 (flat table):

SELECT c_customer_id, c_first_name, c_last_name, ...
FROM tpcds_obt
WHERE d_year = 2000
  AND s_state = 'TN'
  ...

Performance Considerations

Advantages of OBT

  • No join overhead - Eliminates multi-table join costs

  • Simplified query plans - Single table scan with filters

  • Columnar format efficiency - Modern formats excel at column pruning

  • Predictable performance - Less optimizer variability

Challenges of OBT

  • Storage overhead - Denormalization increases data redundancy

  • Column count - Wide tables stress metadata handling

  • Update complexity - Changes require full table rewrites

  • Memory pressure - Wide rows can stress memory buffers

Platform Support

Platform

Status

Notes

DuckDB

✅ Full

Excellent wide table handling

ClickHouse

✅ Full

Strong columnar performance

Databricks

✅ Full

Native Delta Lake support

Snowflake

✅ Full

Automatic micro-partitioning

BigQuery

✅ Full

Columnar storage optimized

Polars

✅ Full

Efficient Arrow-based scans

PostgreSQL

⚠️ Limited

Row-store less efficient for wide tables

See Also