DuckDB Platform¶
DuckDB is an in-process analytical database optimized for OLAP workloads. It’s included by default with BenchBox and suitable for local development, testing, and small-to-medium scale benchmarks.
Features¶
Zero configuration - No server setup required
In-process execution - Embedded in Python process
Columnar vectorized - Optimized for analytics
SQLite-compatible - File-based persistence
Parallel execution - Multi-threaded queries
Why DuckDB is Included by Default¶
DuckDB is included with BenchBox by default for convenience:
No setup required - Works out of the box with no external dependencies
Native file format support - Parquet and CSV support
Local execution - No network or service variability
Free and open source - No licensing costs
Suitable for development - Quick iteration on benchmark workflows
Choose the platform that best fits your specific requirements. See the Platform Selection Guide for help selecting the right platform for your use case.
Installation¶
# DuckDB is included with BenchBox by default
uv add benchbox
# Or with pip
pip install benchbox
Configuration¶
Default Usage¶
No configuration needed - DuckDB works immediately:
benchbox run --platform duckdb --benchmark tpch --scale 0.01
Persistent Database¶
By default, BenchBox uses in-memory databases. For persistence:
benchbox run --platform duckdb --benchmark tpch --scale 1.0 \
--platform-option database=/path/to/benchmark.duckdb
CLI Options¶
benchbox run --platform duckdb --benchmark tpch --scale 1.0 \
--platform-option threads=8 \
--platform-option memory_limit=4GB
Platform Options¶
Option |
Default |
Description |
|---|---|---|
|
:memory: |
Database path or :memory: |
|
(auto) |
Number of threads |
|
(auto) |
Maximum memory usage |
|
(auto) |
Temp file location |
|
true |
Show query progress |
Usage Examples¶
Quick Start¶
# Minimal benchmark
benchbox run --platform duckdb --benchmark tpch --scale 0.01
# With specific queries
benchbox run --platform duckdb --benchmark tpch --scale 0.1 \
--queries Q1,Q6,Q17
Scale Factor Guide¶
Scale Factor |
Data Size |
Use Case |
|---|---|---|
0.01 |
~10 MB |
Unit testing, CI/CD |
0.1 |
~100 MB |
Integration testing |
1.0 |
~1 GB |
Standard benchmarking |
10.0 |
~10 GB |
Performance testing |
Note: Execution times vary based on hardware, query complexity, and configuration. Run benchmarks to establish baselines for your environment.
With Tuning¶
# Apply optimizations (indexes, etc.)
benchbox run --platform duckdb --benchmark tpch --scale 1.0 \
--tuning tuned
Python API¶
from benchbox import TPCH
from benchbox.platforms.duckdb import DuckDBAdapter
# In-memory database
adapter = DuckDBAdapter()
# Or persistent
adapter = DuckDBAdapter(database="./benchmarks.duckdb")
benchmark = TPCH(scale_factor=0.1)
benchmark.generate_data()
adapter.load_benchmark(benchmark)
results = adapter.run_benchmark(benchmark)
print(f"Total runtime: {results.total_time:.2f}s")
Comparison Across Scales¶
from benchbox import TPCH
from benchbox.platforms.duckdb import DuckDBAdapter
scales = [0.01, 0.1, 1.0]
for sf in scales:
adapter = DuckDBAdapter()
benchmark = TPCH(scale_factor=sf)
benchmark.generate_data()
adapter.load_benchmark(benchmark)
results = adapter.run_benchmark(benchmark)
print(f"SF {sf}: {results.total_time:.2f}s")
Performance Features¶
Thread Configuration¶
# Control parallelism
benchbox run --platform duckdb --benchmark tpch \
--platform-option threads=4
Memory Limits¶
# Limit memory usage
benchbox run --platform duckdb --benchmark tpch --scale 10.0 \
--platform-option memory_limit=8GB
Temporary Storage¶
For large datasets that exceed memory:
benchbox run --platform duckdb --benchmark tpch --scale 10.0 \
--platform-option temp_directory=/fast/ssd/tmp
Data Loading¶
DuckDB supports fast data loading from multiple formats:
Parquet (Default)¶
# BenchBox generates Parquet by default
benchbox run --platform duckdb --benchmark tpch --scale 1.0
CSV¶
# Force CSV format
benchbox run --platform duckdb --benchmark tpch --scale 1.0 \
--format csv
Direct Query (No Load)¶
For testing queries without loading:
import duckdb
conn = duckdb.connect()
# Query Parquet files directly
result = conn.execute("""
SELECT count(*) FROM read_parquet('lineitem/*.parquet')
""").fetchone()
Best Practices¶
1. Start Small¶
Begin with SF 0.01 to validate your workflow:
benchbox run --platform duckdb --benchmark tpch --scale 0.01
2. Use In-Memory for Speed¶
For benchmarks, in-memory is fastest:
# Default - no database option needed
benchbox run --platform duckdb --benchmark tpch --scale 1.0
3. Persist for Development¶
When iterating on queries, persist the database:
# Load once
benchbox run --platform duckdb --benchmark tpch --scale 1.0 \
--phases generate,load \
--platform-option database=./dev.duckdb
# Run queries multiple times
benchbox run --platform duckdb --benchmark tpch --scale 1.0 \
--phases power \
--platform-option database=./dev.duckdb
4. Match Production Scale¶
Test at similar scale to production platforms:
# If planning to run SF 100 on Snowflake, test at SF 1-10 on DuckDB
benchbox run --platform duckdb --benchmark tpch --scale 10.0
Troubleshooting¶
Out of Memory¶
# Increase memory or use spilling
benchbox run --platform duckdb --benchmark tpch --scale 10.0 \
--platform-option memory_limit=16GB \
--platform-option temp_directory=/tmp/duckdb
Slow Queries¶
# Check thread count
benchbox run --platform duckdb --benchmark tpch \
--platform-option threads=$(nproc)
# Enable progress for visibility
benchbox run --platform duckdb --benchmark tpch \
--platform-option enable_progress_bar=true
Database Locked¶
# Only one connection allowed for write operations
# Close other DuckDB connections or use a new database path
benchbox run --platform duckdb --benchmark tpch \
--platform-option database=./new_benchmark.duckdb
Disk Space for Temp Files¶
# Check temp directory space
df -h /tmp
# Use different temp location
benchbox run --platform duckdb --benchmark tpch --scale 10.0 \
--platform-option temp_directory=/data/tmp
Comparison with Other Platforms¶
Feature |
DuckDB |
SQLite |
PostgreSQL |
|---|---|---|---|
Setup |
None |
None |
Server |
Best for |
OLAP |
OLTP |
General |
Parallelism |
Multi-thread |
Single-thread |
Multi-process |
Memory |
In-process |
In-process |
Separate |
Scale |
~100 GB |
~10 GB |
~TB |