DuckDB Platform¶

Tags intermediate guide duckdb embedded-platform

DuckDB is an in-process analytical database optimized for OLAP workloads. It’s included by default with BenchBox and suitable for local development, testing, and small-to-medium scale benchmarks.

Features¶

Zero configuration - No server setup required
In-process execution - Embedded in Python process
Columnar vectorized - Optimized for analytics
SQLite-compatible - File-based persistence
Parallel execution - Multi-threaded queries

Why DuckDB is Included by Default¶

DuckDB is included with BenchBox by default for convenience:

No setup required - Works out of the box with no external dependencies
Native file format support - Parquet and CSV support
Local execution - No network or service variability
Free and open source - No licensing costs
Suitable for development - Quick iteration on benchmark workflows

Choose the platform that best fits your specific requirements. See the Platform Selection Guide for help selecting the right platform for your use case.

Installation¶

# DuckDB is included with BenchBox by default
uv add benchbox

# Or with pip
pip install benchbox

Configuration¶

Default Usage¶

No configuration needed - DuckDB works immediately:

benchbox run --platform duckdb --benchmark tpch --scale 0.01

Persistent Database¶

By default, BenchBox uses in-memory databases. For persistence:

benchbox run --platform duckdb --benchmark tpch --scale 1.0 \
  --platform-option database=/path/to/benchmark.duckdb

CLI Options¶

benchbox run --platform duckdb --benchmark tpch --scale 1.0 \
  --platform-option threads=8 \
  --platform-option memory_limit=4GB

Platform Options¶

Option	Default	Description
`database`	:memory:	Database path or :memory:
`threads`	(auto)	Number of threads
`memory_limit`	(auto)	Maximum memory usage
`temp_directory`	(auto)	Temp file location
`enable_progress_bar`	true	Show query progress

Usage Examples¶

Quick Start¶

# Minimal benchmark
benchbox run --platform duckdb --benchmark tpch --scale 0.01

# With specific queries
benchbox run --platform duckdb --benchmark tpch --scale 0.1 \
  --queries Q1,Q6,Q17

Scale Factor Guide¶

Scale Factor	Data Size	Use Case
0.01	~10 MB	Unit testing, CI/CD
0.1	~100 MB	Integration testing
1.0	~1 GB	Standard benchmarking
10.0	~10 GB	Performance testing

Note: Execution times vary based on hardware, query complexity, and configuration. Run benchmarks to establish baselines for your environment.

With Tuning¶

# Apply optimizations (indexes, etc.)
benchbox run --platform duckdb --benchmark tpch --scale 1.0 \
  --tuning tuned

Python API¶

from benchbox import TPCH
from benchbox.platforms.duckdb import DuckDBAdapter

# In-memory database
adapter = DuckDBAdapter()

# Or persistent
adapter = DuckDBAdapter(database="./benchmarks.duckdb")

benchmark = TPCH(scale_factor=0.1)
benchmark.generate_data()
adapter.load_benchmark(benchmark)
results = adapter.run_benchmark(benchmark)

print(f"Total runtime: {results.total_time:.2f}s")

Comparison Across Scales¶

from benchbox import TPCH
from benchbox.platforms.duckdb import DuckDBAdapter

scales = [0.01, 0.1, 1.0]
for sf in scales:
    adapter = DuckDBAdapter()
    benchmark = TPCH(scale_factor=sf)
    benchmark.generate_data()
    adapter.load_benchmark(benchmark)
    results = adapter.run_benchmark(benchmark)
    print(f"SF {sf}: {results.total_time:.2f}s")

Performance Features¶

Thread Configuration¶

# Control parallelism
benchbox run --platform duckdb --benchmark tpch \
  --platform-option threads=4

Memory Limits¶

# Limit memory usage
benchbox run --platform duckdb --benchmark tpch --scale 10.0 \
  --platform-option memory_limit=8GB

Temporary Storage¶

For large datasets that exceed memory:

benchbox run --platform duckdb --benchmark tpch --scale 10.0 \
  --platform-option temp_directory=/fast/ssd/tmp

Data Loading¶

DuckDB supports fast data loading from multiple formats:

Parquet (Default)¶

# BenchBox generates Parquet by default
benchbox run --platform duckdb --benchmark tpch --scale 1.0

CSV¶

# Force CSV format
benchbox run --platform duckdb --benchmark tpch --scale 1.0 \
  --format csv

Direct Query (No Load)¶

For testing queries without loading:

import duckdb

conn = duckdb.connect()
# Query Parquet files directly
result = conn.execute("""
    SELECT count(*) FROM read_parquet('lineitem/*.parquet')
""").fetchone()

Best Practices¶

1. Start Small¶

Begin with SF 0.01 to validate your workflow:

benchbox run --platform duckdb --benchmark tpch --scale 0.01

2. Use In-Memory for Speed¶

For benchmarks, in-memory is fastest:

# Default - no database option needed
benchbox run --platform duckdb --benchmark tpch --scale 1.0

3. Persist for Development¶

When iterating on queries, persist the database:

# Load once
benchbox run --platform duckdb --benchmark tpch --scale 1.0 \
  --phases generate,load \
  --platform-option database=./dev.duckdb

# Run queries multiple times
benchbox run --platform duckdb --benchmark tpch --scale 1.0 \
  --phases power \
  --platform-option database=./dev.duckdb

4. Match Production Scale¶

Test at similar scale to production platforms:

# If planning to run SF 100 on Snowflake, test at SF 1-10 on DuckDB
benchbox run --platform duckdb --benchmark tpch --scale 10.0

Troubleshooting¶

Out of Memory¶

# Increase memory or use spilling
benchbox run --platform duckdb --benchmark tpch --scale 10.0 \
  --platform-option memory_limit=16GB \
  --platform-option temp_directory=/tmp/duckdb

Slow Queries¶

# Check thread count
benchbox run --platform duckdb --benchmark tpch \
  --platform-option threads=$(nproc)

# Enable progress for visibility
benchbox run --platform duckdb --benchmark tpch \
  --platform-option enable_progress_bar=true

Database Locked¶

# Only one connection allowed for write operations
# Close other DuckDB connections or use a new database path
benchbox run --platform duckdb --benchmark tpch \
  --platform-option database=./new_benchmark.duckdb

Disk Space for Temp Files¶

# Check temp directory space
df -h /tmp

# Use different temp location
benchbox run --platform duckdb --benchmark tpch --scale 10.0 \
  --platform-option temp_directory=/data/tmp

Comparison with Other Platforms¶

Feature	DuckDB	SQLite	PostgreSQL
Setup	None	None	Server
Best for	OLAP	OLTP	General
Parallelism	Multi-thread	Single-thread	Multi-process
Memory	In-process	In-process	Separate
Scale	~100 GB	~10 GB	~TB