Data Generation¶
BenchBox provides systematic data generation capabilities for all supported benchmarks. This guide covers both basic data generation concepts and the advanced smart data generation features that optimize performance and developer productivity.
Overview¶
Data generation in BenchBox is designed to be:
Scalable: Generate data from small test datasets to production-scale benchmarks
Intelligent: Automatically validate and skip regeneration when valid data exists
Fast: Optimized generation algorithms and caching
Reliable: Comprehensive error handling and validation
Flexible: Support for different output formats and directories
Basic Data Generation¶
Simple Generation¶
All benchmarks support a consistent generate_data() interface:
from benchbox import TPCH, TPCDS
# Generate TPC-H data
tpch = TPCH(scale_factor=0.1)
tables = tpch.generate_data()
print(f"Generated {len(tables)} TPC-H tables")
# Generate TPC-DS data
tpcds = TPCDS(scale_factor=0.1)
tables = tpcds.generate_data()
print(f"Generated {len(tables)} TPC-DS tables")
Scale Factor Configuration¶
Scale factors control the amount of data generated:
# Development and testing - small datasets
benchmark = TPCH(scale_factor=0.001) # ~1MB data
# CI/CD pipelines - fast execution
benchmark = TPCH(scale_factor=0.01) # ~10MB data
# Performance testing - moderate size
benchmark = TPCH(scale_factor=0.1) # ~100MB data
# Production validation - full scale
benchmark = TPCH(scale_factor=1.0) # ~1GB data
Output Directory Management¶
Control where data files are generated:
from pathlib import Path
# Default location (current directory)
benchmark = TPCH(scale_factor=0.1)
# Custom directory
benchmark = TPCH(
scale_factor=0.1,
output_dir="./benchmark-data/tpch"
)
# Temporary directory for testing
import tempfile
temp_dir = tempfile.mkdtemp()
benchmark = TPCH(scale_factor=0.01, output_dir=temp_dir)
Parallel Generation¶
Speed up data generation with parallel processing:
# Use multiple processes for faster generation
benchmark = TPCH(
scale_factor=1.0,
parallel=4 # Use 4 parallel processes
)
tables = benchmark.generate_data()
Smart Data Generation¶
BenchBox includes intelligent data generation that validates existing data and automatically skips regeneration when valid data is present.
How Smart Generation Works¶
Smart generation follows this process:
Check for existing data files in the output directory
Validate row counts against expected values for the scale factor
Verify file integrity and minimum size requirements
Handle platform-specific differences (e.g., TPC-H minimum scale factors)
Skip generation if valid data exists, or regenerate if needed
Default Behavior (Smart Generation)¶
Smart validation is enabled by default and works transparently:
from benchbox import TPCH
# First run - no data exists, will generate
benchmark = TPCH(scale_factor=1.0, output_dir="./data")
tables = benchmark.generate_data()
# Output: ⚠️️ Data validation failed, generating TPC-H data...
# Second run - valid data exists, will skip
benchmark2 = TPCH(scale_factor=1.0, output_dir="./data")
tables2 = benchmark2.generate_data()
# Output: ✅ Valid TPC-H data found, skipping generation
Force Regeneration¶
When you need to regenerate data regardless of existing files:
# Force regeneration even if valid data exists
benchmark = TPCH(
scale_factor=1.0,
output_dir="./data",
force_regenerate=True
)
tables = benchmark.generate_data()
# Output: ⚠️️ Force regeneration requested, generating data...
Verbose Validation Reports¶
Enable detailed validation information:
benchmark = TPCH(
scale_factor=1.0,
output_dir="./data",
verbose=True
)
tables = benchmark.generate_data()
Successful validation output:
✅ Valid TPC-H data found for scale factor 1.0
✅ Data validation PASSED
Validated 8 tables
Total data size: 1.2 GB
Skipping data generation
Failed validation output:
⚠️️ Data validation failed for scale factor 1.0
❌ Data validation FAILED
Row count mismatches:
customer: expected 150,000, found 75,000
lineitem: expected 6,001,215, found 3,000,000
Issues:
- Table customer: expected ~150000 rows, found 75000 rows
- Table lineitem: expected ~6001215 rows, found 3000000 rows
Generating TPC-H data...
Benchmark-Specific Generation¶
TPC-H Data Generation¶
TPC-H uses the official dbgen tool for standards-compliant data:
from benchbox import TPCH
# Standard generation
tpch = TPCH(scale_factor=1.0)
tables = tpch.generate_data()
# Expected tables and approximate row counts (SF=1.0)
tables_info = {
'customer': 150_000,
'lineitem': 6_001_215,
'nation': 25, # Fixed size
'orders': 1_500_000,
'part': 200_000,
'partsupp': 800_000,
'region': 5, # Fixed size
'supplier': 10_000
}
TPC-H specific features:
Minimum scale factor of 0.1 (dbgen limitation)
Fixed-size dimension tables (nation, region)
Standards-compliant data distribution
CoffeeShop Data Generation¶
The CoffeeShop benchmark has been rewritten to align with the public reference
generator. It now produces three tables (dim_locations, dim_products, and
order_lines) with an exploded fact table that averages 1.5 lines per order.
from benchbox import CoffeeShop
# Reference-aligned generator (small scale for development)
coffeeshop = CoffeeShop(scale_factor=0.001)
data_files = coffeeshop.generate_data()
print(data_files["order_lines"]) # Path to generated CSV file
Scale factors map directly to order counts using the approved formula
order_count = 50_000_000 * scale_factor. The table below outlines the
recommended operating ranges:
Scale Factor |
Orders Generated |
Approx. Order Lines |
Approx. Size |
|---|---|---|---|
0.001 |
50,000 |
~75,000 |
~5 MB |
0.01 |
500,000 |
~750,000 |
~50 MB |
0.1 |
5,000,000 |
~7,500,000 |
~500 MB |
1.0 |
50,000,000 |
~75,000,000 |
~5 GB |
CoffeeShop specific highlights:
Seasonal (monthly) weighting, regional bias, and linear growth trend.
Deterministic order-line explosion following the 60/30/5/4/1 distribution.
Vendored seed files for locations and products ensure deterministic IDs and pricing windows.
Default output directories align with the BenchBox data layout (
benchmark_runs/datagen/coffeeshop_{format_scale_factor(scale)}regardless of whether you use the high-levelCoffeeShopwrapper or instantiate the generator directly) and honour optional compression settings.Query suite updated (
SA*,PR*,TR*,TM*,QC*) to operate exclusively on the new schema.Repeated runs with the same inputs overwrite the existing CSV files deterministically, ensuring identical table contents across regenerations.
Seed CSVs are vendored verbatim from
JosueBogran/coffeeshopdatageneratorv2(MIT licensed, commit2a99993b6bca94c0bc04fae7c695e86cd152add1) and verified via SHA256 checksums in the unit test suite.
TPC-DS Data Generation¶
TPC-DS uses the official dsdgen tool for complex analytical data:
from benchbox import TPCDS
# Standard generation
tpcds = TPCDS(scale_factor=1.0)
tables = tpcds.generate_data()
# TPC-DS generates 24 tables with complex relationships
print(f"Generated {len(tables)} TPC-DS tables")
TPC-DS specific features:
24 interconnected tables
Complex dimension and fact table relationships
Time-series data with seasonal patterns
Advanced-level query generation integration
Streaming output via
-FILTER Yflag (see below)
TPC-DS Streaming Generation¶
BenchBox includes patched dsdgen binaries that support streaming data to stdout, enabling efficient compression pipelines:
# Generate table directly to compressed file
dsdgen -TABLE ship_mode -SCALE 1 -FILTER Y | zstd > ship_mode.dat.zst
# Generate with fixed seed for reproducibility
dsdgen -TABLE date_dim -SCALE 1 -FILTER Y -RNGSEED 12345 | gzip > date_dim.dat.gz
This streaming capability reduces disk I/O and memory usage for large-scale data generation. See TPC-DS Streaming Data Generation for detailed examples.
Other Benchmarks¶
# SSB (Star Schema Benchmark)
from benchbox import SSB
ssb = SSB(scale_factor=1.0)
tables = ssb.generate_data()
Data Validation¶
Automatic Validation¶
Smart generation automatically validates:
File existence: All required table files present
Row counts: Match expected values for scale factor
File sizes: Meet minimum size requirements
Data integrity: Files are readable and non-empty
Manual Validation¶
Access validation directly for custom workflows:
from benchbox.utils.data_validation import BenchmarkDataValidator
# Create validator for specific benchmark
validator = BenchmarkDataValidator("tpch", scale_factor=1.0)
# Validate a data directory
result = validator.validate_data_directory("./data")
# Check results
if result.valid:
print("✅ Data is valid")
print(f"Validated {len(result.tables_validated)} tables")
else:
print(f"❌ Validation failed: {len(result.issues)} issues")
for issue in result.issues:
print(f" - {issue}")
# Print detailed report
validator.print_validation_report(result, verbose=True)
Custom Validation Rules¶
For advanced validation scenarios:
# Check if regeneration is needed
should_regenerate, result = validator.should_regenerate_data("./data")
if should_regenerate:
print("Data needs regeneration:")
for issue in result.issues:
print(f" - {issue}")
else:
print("Existing data is valid")
# Force regeneration programmatically
force_regen = len(result.row_count_mismatches) > 0
benchmark = TPCH(scale_factor=1.0, force_regenerate=force_regen)
Performance Optimization¶
Smart Generation Benefits¶
Smart data generation avoids regenerating valid data:
Initial run: Full data generation (time depends on scale factor, hardware, and CPU cores)
Subsequent runs: Validation only - checks existing files and skips generation if valid
This provides significant time savings for iterative development workflows, CI/CD pipelines, and repeated benchmark runs regardless of scale factor.
Parallel Processing¶
Use parallel generation for large scale factors:
import multiprocessing
# Use all available CPU cores
cores = multiprocessing.cpu_count()
benchmark = TPCH(scale_factor=10.0, parallel=cores)
tables = benchmark.generate_data()
Caching Strategies¶
Organize data for maximum reuse:
# Organize by benchmark and scale factor
./benchmark-data/
├── tpch/
│ ├── sf-0.01/ # Small datasets for quick testing
│ ├── sf-0.1/ # Medium datasets for CI/CD
│ ├── sf-1.0/ # Large datasets for performance testing
│ └── sf-10.0/ # Production-scale datasets
└── tpcds/
├── sf-0.01/
└── sf-1.0/
Configuration Options¶
Constructor Parameters¶
All benchmarks support these common parameters:
benchmark = TPCH(
scale_factor=1.0, # Data scale factor
output_dir="./data", # Output directory
verbose=True, # Show detailed output
force_regenerate=False, # Skip smart validation
parallel=1 # Parallel generation processes
)
Environment Variables¶
Control behavior globally via environment variables:
# Force regeneration for all benchmarks
export BENCHBOX_FORCE_REGENERATE=true
# Enable verbose output
export BENCHBOX_VERBOSE=true
# Set default output directory
export BENCHBOX_DATA_DIR=./benchmark_runs/datagen
Platform-Specific Settings¶
Some platforms have special considerations:
# ClickHouse - handles large scale factors efficiently
benchmark = TPCH(scale_factor=100.0, platform="clickhouse")
# DuckDB - configured for analytical workloads
benchmark = TPCH(scale_factor=10.0, platform="duckdb")
Advanced-level Usage¶
Data Format Conversion¶
Generate data in different formats:
# Generate standard delimited files
tpch = TPCH(scale_factor=0.1)
tables = tpch.generate_data()
# Convert to Parquet (if pyarrow available)
for table_name, file_path in tables.items():
parquet_path = file_path.with_suffix('.parquet')
# Conversion code here
Custom Data Directories¶
Implement custom directory structures:
from pathlib import Path
from datetime import datetime
# Timestamp-based directories
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
data_dir = Path(f"./runs/{timestamp}/tpch_sf1")
benchmark = TPCH(scale_factor=1.0, output_dir=data_dir)
tables = benchmark.generate_data()
Integration with CI/CD¶
Cache data between runs:
# GitHub Actions example
- name: Cache benchmark data
uses: actions/cache@v3
with:
path: ./benchmark-data
key: benchbox-data-${{ runner.os }}-${{ hashFiles('**/benchmark-config.yaml') }}
- name: Generate benchmark data
run: |
python -c "
from benchbox import TPCH
tpch = TPCH(scale_factor=0.01, output_dir='./benchmark-data/tpch')
tpch.generate_data()
"
Error Handling¶
Common Issues and Solutions¶
Scale Factor Mismatches¶
Problem: Existing data was generated with different scale factor
# Solution 1: Use matching scale factor
benchmark = TPCH(scale_factor=0.5) # Match existing data
# Solution 2: Force regeneration with new scale factor
benchmark = TPCH(scale_factor=1.0, force_regenerate=True)
Missing or Corrupted Files¶
Problem: Some data files are missing or corrupted
# Smart generation automatically detects and regenerates
benchmark = TPCH(scale_factor=1.0, verbose=True)
tables = benchmark.generate_data()
# Output: "Missing files detected, regenerating all data"
Permission Errors¶
Problem: Cannot read/write data files
# Fix directory permissions
chmod 755 ./data
chown $USER:$USER ./data
Disk Space Issues¶
Problem: Not enough space for large scale factors
# Check required space before generation
def estimate_data_size(benchmark, scale_factor):
base_size_mb = {
'tpch': 1000, # ~1GB for SF=1.0
'tpcds': 2000, # ~2GB for SF=1.0
}
return base_size_mb.get(benchmark, 1000) * scale_factor
required_mb = estimate_data_size('tpch', 10.0)
print(f"Required space: ~{required_mb}MB")
Debug Mode¶
Enable detailed debugging:
import logging
logging.basicConfig(level=logging.DEBUG)
benchmark = TPCH(scale_factor=1.0, verbose=True)
tables = benchmark.generate_data()
# Shows detailed generation and validation steps
Best Practices¶
Development Workflow¶
Start Small: Use scale factors 0.001-0.01 during development
Cache Aggressively: Reuse generated data when possible
Validate Early: Test with small datasets before scaling up
Monitor Resources: Track memory and disk usage
# Development pattern
def setup_benchmark(scale_factor=0.01):
"""Setup benchmark with smart caching."""
data_dir = Path("./dev-cache") / f"tpch-sf{scale_factor}"
return TPCH(scale_factor=scale_factor, output_dir=data_dir, verbose=True)
# Use in development
tpch = setup_benchmark() # Uses cache if available
tables = tpch.generate_data()
Production Deployment¶
Pre-generate Data: Generate data during deployment, not runtime
Validate Integrity: Always validate data before benchmarks
Monitor Performance: Track generation time and data quality
Plan Storage: Estimate storage requirements for scale factors
# Production pattern
def setup_production_benchmark(scale_factor=1.0):
"""Setup production benchmark with validation."""
data_dir = Path("/opt/benchmark-data") / f"tpch-sf{scale_factor}"
benchmark = TPCH(
scale_factor=scale_factor,
output_dir=data_dir,
verbose=True,
parallel=multiprocessing.cpu_count()
)
# Pre-validate or generate
tables = benchmark.generate_data()
# Additional validation
validator = BenchmarkDataValidator("tpch", scale_factor)
result = validator.validate_data_directory(data_dir)
if not result.valid:
raise RuntimeError(f"Data validation failed: {result.issues}")
return benchmark, tables
Testing Strategy¶
Unit Tests: Test generation with small scale factors
Integration Tests: Validate full benchmark workflows
Performance Tests: Measure generation performance
Regression Tests: Detect data quality changes
import pytest
import tempfile
def test_data_generation():
"""Test basic data generation."""
with tempfile.TemporaryDirectory() as temp_dir:
benchmark = TPCH(scale_factor=0.001, output_dir=temp_dir)
tables = benchmark.generate_data()
# Validate results
assert len(tables) == 8
for table_path in tables.values():
assert table_path.exists()
assert table_path.stat().st_size > 0
def test_smart_generation():
"""Test smart generation skip logic."""
with tempfile.TemporaryDirectory() as temp_dir:
# First generation
benchmark1 = TPCH(scale_factor=0.001, output_dir=temp_dir)
tables1 = benchmark1.generate_data()
# Second generation should skip
benchmark2 = TPCH(scale_factor=0.001, output_dir=temp_dir, verbose=True)
tables2 = benchmark2.generate_data()
# Should return same tables
assert tables1.keys() == tables2.keys()
Supported Benchmarks¶
Data generation support by benchmark:
Benchmark |
Smart Generation |
Row Validation |
Parallel Support |
Notes |
|---|---|---|---|---|
TPC-H |
✅ Full |
✅ Yes |
✅ Yes |
Official dbgen tool |
TPC-DS |
✅ Full |
✅ Yes |
✅ Yes |
Official dsdgen tool |
Read Primitives |
✅ Full |
✅ Yes |
✅ Yes |
Shares TPC-H data |
SSB |
⚠️️ Basic |
❌ No |
✅ Yes |
Will be improved |
ClickBench |
⚠️️ Basic |
❌ No |
❌ No |
External data |
Join Order |
⚠️️ Basic |
❌ No |
❌ No |
IMDB dataset |
Others |
⚠️️ Basic |
❌ No |
❌ No |
Being improved |
Future Enhancements¶
Planned data generation improvements:
Checksum Validation: Verify data integrity using file checksums
Schema Validation: Check table structure and column types
Compression Support: Handle compressed data files (.gz, .bz2)
Remote Data: Support cloud-stored benchmark datasets
Incremental Updates: Update only changed tables
Cross-Platform Sync: Share data between database platforms
Format Conversion: Auto-convert between CSV, Parquet, etc.
Resource Estimation: Predict generation time and space requirements
Troubleshooting¶
Performance Issues¶
# Monitor generation performance
import time
start_time = time.time()
benchmark = TPCH(scale_factor=1.0, verbose=True, parallel=4)
tables = benchmark.generate_data()
generation_time = time.time() - start_time
print(f"Generated {len(tables)} tables in {generation_time:.2f} seconds")
print(f"Average: {generation_time/len(tables):.2f}s per table")
Memory Issues¶
# For large scale factors, monitor memory usage
import psutil
def monitor_generation():
process = psutil.Process()
print(f"Memory before: {process.memory_info().rss / 1024**2:.1f} MB")
benchmark = TPCH(scale_factor=10.0, parallel=2)
tables = benchmark.generate_data()
print(f"Memory after: {process.memory_info().rss / 1024**2:.1f} MB")
return tables
Validation Issues¶
# Debug validation problems
from benchbox.utils.data_validation import BenchmarkDataValidator
validator = BenchmarkDataValidator("tpch", scale_factor=1.0)
result = validator.validate_data_directory("./data")
# Detailed analysis
if not result.valid:
print("Validation Issues:")
print(f"Missing tables: {result.missing_tables}")
print(f"Row count mismatches: {result.row_count_mismatches}")
print(f"File size issues: {[f for f, size in result.file_size_info.items() if size == 0]}")
Data generation in BenchBox is designed to be powerful, intelligent, and developer-friendly. The smart generation features ensure appropriate performance while maintaining data integrity and standards compliance.