Data Generation¶

Tags intermediate guide data-generation

BenchBox provides systematic data generation capabilities for all supported benchmarks. This guide covers both basic data generation concepts and the advanced smart data generation features that optimize performance and developer productivity.

Overview¶

Data generation in BenchBox is designed to be:

Scalable: Generate data from small test datasets to production-scale benchmarks
Intelligent: Automatically validate and skip regeneration when valid data exists
Fast: Optimized generation algorithms and caching
Reliable: Comprehensive error handling and validation
Flexible: Support for different output formats and directories

Basic Data Generation¶

Simple Generation¶

All benchmarks support a consistent generate_data() interface:

from benchbox import TPCH, TPCDS

# Generate TPC-H data
tpch = TPCH(scale_factor=0.1)
tables = tpch.generate_data()
print(f"Generated {len(tables)} TPC-H tables")

# Generate TPC-DS data
tpcds = TPCDS(scale_factor=0.1)
tables = tpcds.generate_data()
print(f"Generated {len(tables)} TPC-DS tables")

Scale Factor Configuration¶

Scale factors control the amount of data generated:

# Development and testing - small datasets
benchmark = TPCH(scale_factor=0.001)  # ~1MB data

# CI/CD pipelines - fast execution
benchmark = TPCH(scale_factor=0.01)   # ~10MB data

# Performance testing - moderate size
benchmark = TPCH(scale_factor=0.1)    # ~100MB data

# Production validation - full scale
benchmark = TPCH(scale_factor=1.0)    # ~1GB data

Output Directory Management¶

Control where data files are generated:

from pathlib import Path

# Default location (current directory)
benchmark = TPCH(scale_factor=0.1)

# Custom directory
benchmark = TPCH(
    scale_factor=0.1,
    output_dir="./benchmark-data/tpch"
)

# Temporary directory for testing
import tempfile
temp_dir = tempfile.mkdtemp()
benchmark = TPCH(scale_factor=0.01, output_dir=temp_dir)

Parallel Generation¶

Speed up data generation with parallel processing:

# Use multiple processes for faster generation
benchmark = TPCH(
    scale_factor=1.0,
    parallel=4  # Use 4 parallel processes
)
tables = benchmark.generate_data()

Smart Data Generation¶

BenchBox includes intelligent data generation that validates existing data and automatically skips regeneration when valid data is present.

How Smart Generation Works¶

Smart generation follows this process:

Check for existing data files in the output directory
Validate row counts against expected values for the scale factor
Verify file integrity and minimum size requirements
Handle platform-specific differences (e.g., TPC-H minimum scale factors)
Skip generation if valid data exists, or regenerate if needed

Default Behavior (Smart Generation)¶

Smart validation is enabled by default and works transparently:

from benchbox import TPCH

# First run - no data exists, will generate
benchmark = TPCH(scale_factor=1.0, output_dir="./data")
tables = benchmark.generate_data()
# Output: ⚠️️ Data validation failed, generating TPC-H data...

# Second run - valid data exists, will skip
benchmark2 = TPCH(scale_factor=1.0, output_dir="./data")
tables2 = benchmark2.generate_data()
# Output: ✅ Valid TPC-H data found, skipping generation

Force Regeneration¶

When you need to regenerate data regardless of existing files:

# Force regeneration even if valid data exists
benchmark = TPCH(
    scale_factor=1.0,
    output_dir="./data",
    force_regenerate=True
)
tables = benchmark.generate_data()
# Output: ⚠️️ Force regeneration requested, generating data...

Verbose Validation Reports¶

Enable detailed validation information:

benchmark = TPCH(
    scale_factor=1.0,
    output_dir="./data",
    verbose=True
)
tables = benchmark.generate_data()

Successful validation output:

✅ Valid TPC-H data found for scale factor 1.0
✅ Data validation PASSED
   Validated 8 tables
   Total data size: 1.2 GB
   Skipping data generation

Failed validation output:

⚠️️  Data validation failed for scale factor 1.0
❌ Data validation FAILED
   Row count mismatches:
     customer: expected 150,000, found 75,000
     lineitem: expected 6,001,215, found 3,000,000
   Issues:
     - Table customer: expected ~150000 rows, found 75000 rows
     - Table lineitem: expected ~6001215 rows, found 3000000 rows
   Generating TPC-H data...

Benchmark-Specific Generation¶

TPC-H Data Generation¶

TPC-H uses the official dbgen tool for standards-compliant data:

from benchbox import TPCH

# Standard generation
tpch = TPCH(scale_factor=1.0)
tables = tpch.generate_data()

# Expected tables and approximate row counts (SF=1.0)
tables_info = {
    'customer': 150_000,
    'lineitem': 6_001_215,
    'nation': 25,          # Fixed size
    'orders': 1_500_000,
    'part': 200_000,
    'partsupp': 800_000,
    'region': 5,           # Fixed size
    'supplier': 10_000
}

TPC-H specific features:

Minimum scale factor of 0.1 (dbgen limitation)
Fixed-size dimension tables (nation, region)
Standards-compliant data distribution

CoffeeShop Data Generation¶

The CoffeeShop benchmark has been rewritten to align with the public reference generator. It now produces three tables (dim_locations, dim_products, and order_lines) with an exploded fact table that averages 1.5 lines per order.

from benchbox import CoffeeShop

# Reference-aligned generator (small scale for development)
coffeeshop = CoffeeShop(scale_factor=0.001)
data_files = coffeeshop.generate_data()
print(data_files["order_lines"])  # Path to generated CSV file

Scale factors map directly to order counts using the approved formula order_count = 50_000_000 * scale_factor. The table below outlines the recommended operating ranges:

Scale Factor	Orders Generated	Approx. Order Lines	Approx. Size
0.001	50,000	~75,000	~5 MB
0.01	500,000	~750,000	~50 MB
0.1	5,000,000	~7,500,000	~500 MB
1.0	50,000,000	~75,000,000	~5 GB

CoffeeShop specific highlights:

Seasonal (monthly) weighting, regional bias, and linear growth trend.
Deterministic order-line explosion following the 60/30/5/4/1 distribution.
Vendored seed files for locations and products ensure deterministic IDs and pricing windows.
Default output directories align with the BenchBox data layout (benchmark_runs/datagen/coffeeshop_{format_scale_factor(scale)} regardless of whether you use the high-level CoffeeShop wrapper or instantiate the generator directly) and honour optional compression settings.
Query suite updated (SA*, PR*, TR*, TM*, QC*) to operate exclusively on the new schema.
Repeated runs with the same inputs overwrite the existing CSV files deterministically, ensuring identical table contents across regenerations.
Seed CSVs are vendored verbatim from JosueBogran/coffeeshopdatageneratorv2 (MIT licensed, commit 2a99993b6bca94c0bc04fae7c695e86cd152add1) and verified via SHA256 checksums in the unit test suite.

TPC-DS Data Generation¶

TPC-DS uses the official dsdgen tool for complex analytical data:

from benchbox import TPCDS

# Standard generation
tpcds = TPCDS(scale_factor=1.0)
tables = tpcds.generate_data()

# TPC-DS generates 24 tables with complex relationships
print(f"Generated {len(tables)} TPC-DS tables")

TPC-DS specific features:

24 interconnected tables
Complex dimension and fact table relationships
Time-series data with seasonal patterns
Advanced-level query generation integration
Streaming output via -FILTER Y flag (see below)

TPC-DS Streaming Generation¶

BenchBox includes patched dsdgen binaries that support streaming data to stdout, enabling efficient compression pipelines:

# Generate table directly to compressed file
dsdgen -TABLE ship_mode -SCALE 1 -FILTER Y | zstd > ship_mode.dat.zst

# Generate with fixed seed for reproducibility
dsdgen -TABLE date_dim -SCALE 1 -FILTER Y -RNGSEED 12345 | gzip > date_dim.dat.gz

This streaming capability reduces disk I/O and memory usage for large-scale data generation. See TPC-DS Streaming Data Generation for detailed examples.

Primitives Benchmark - Shared TPC-H Data¶

The Read Primitives benchmark automatically shares data with TPC-H to avoid duplication:

from benchbox import ReadPrimitives

# Read Primitives automatically uses TPC-H data
read_primitives = ReadPrimitives(scale_factor=0.1)
tables = read_primitives.generate_data()

How Primitives Data Sharing Works:

Automatic Path Selection: Primitives defaults to the same data directory as TPC-H (tpch_sf*), not a separate primitives_sf* directory
Shared Data Files: Primitives uses TPC-H tables (customer, orders, lineitem, etc.) directly
No Duplication: One set of TPC-H data serves both benchmarks, saving disk space and generation time
Manifest Compatibility: Primitives accepts TPC-H manifests without modification

Benefits of Data Sharing:

Disk Space Savings: Eliminates duplicate TPC-H data (saves GBs for larger scale factors)
Faster Setup: If TPC-H data exists, Primitives uses it immediately without regeneration
Consistency: Both benchmarks use identical data, ensuring comparable results

Example - Shared Data Usage:

from benchbox import TPCH, Primitives

# Generate TPC-H data once
tpch = TPCH(scale_factor=1.0)  # Data goes to: data/tpch_sf1.0/
tpch_tables = tpch.generate_data()

# Primitives automatically uses the same data
read_primitives = ReadPrimitives(scale_factor=1.0)  # Also uses: data/tpch_sf1.0/
prim_tables = read_primitives.generate_data()  # No regeneration needed!

# Both benchmarks share the same table files
assert tpch_tables['customer'] == prim_tables['customer']  # Same file path

Custom Output Paths:

You can still use custom paths for isolated Primitives data:

# Isolated Primitives data (won't share with TPC-H)
read_primitives = ReadPrimitives(
    scale_factor=0.1,
    output_dir="./custom-primitives-data"
)
tables = read_primitives.generate_data()  # Generates fresh TPC-H data here

Other Benchmarks¶

# SSB (Star Schema Benchmark)
from benchbox import SSB
ssb = SSB(scale_factor=1.0)
tables = ssb.generate_data()

Data Validation¶

Automatic Validation¶

Smart generation automatically validates:

File existence: All required table files present
Row counts: Match expected values for scale factor
File sizes: Meet minimum size requirements
Data integrity: Files are readable and non-empty

Manual Validation¶

Access validation directly for custom workflows:

from benchbox.utils.data_validation import BenchmarkDataValidator

# Create validator for specific benchmark
validator = BenchmarkDataValidator("tpch", scale_factor=1.0)

# Validate a data directory
result = validator.validate_data_directory("./data")

# Check results
if result.valid:
    print("✅ Data is valid")
    print(f"Validated {len(result.tables_validated)} tables")
else:
    print(f"❌ Validation failed: {len(result.issues)} issues")
    for issue in result.issues:
        print(f"  - {issue}")

# Print detailed report
validator.print_validation_report(result, verbose=True)

Custom Validation Rules¶

For advanced validation scenarios:

# Check if regeneration is needed
should_regenerate, result = validator.should_regenerate_data("./data")

if should_regenerate:
    print("Data needs regeneration:")
    for issue in result.issues:
        print(f"  - {issue}")
else:
    print("Existing data is valid")

# Force regeneration programmatically
force_regen = len(result.row_count_mismatches) > 0
benchmark = TPCH(scale_factor=1.0, force_regenerate=force_regen)

Performance Optimization¶

Smart Generation Benefits¶

Smart data generation avoids regenerating valid data:

Initial run: Full data generation (time depends on scale factor, hardware, and CPU cores)
Subsequent runs: Validation only - checks existing files and skips generation if valid

This provides significant time savings for iterative development workflows, CI/CD pipelines, and repeated benchmark runs regardless of scale factor.

Parallel Processing¶

Use parallel generation for large scale factors:

import multiprocessing

# Use all available CPU cores
cores = multiprocessing.cpu_count()
benchmark = TPCH(scale_factor=10.0, parallel=cores)
tables = benchmark.generate_data()

Caching Strategies¶

Organize data for maximum reuse:

# Organize by benchmark and scale factor
./benchmark-data/
  ├── tpch/
  │   ├── sf-0.01/    # Small datasets for quick testing
  │   ├── sf-0.1/     # Medium datasets for CI/CD
  │   ├── sf-1.0/     # Large datasets for performance testing
  │   └── sf-10.0/    # Production-scale datasets
  └── tpcds/
      ├── sf-0.01/
      └── sf-1.0/

Configuration Options¶

Constructor Parameters¶

All benchmarks support these common parameters:

benchmark = TPCH(
    scale_factor=1.0,           # Data scale factor
    output_dir="./data",        # Output directory
    verbose=True,               # Show detailed output
    force_regenerate=False,     # Skip smart validation
    parallel=1                  # Parallel generation processes
)

Environment Variables¶

Control behavior globally via environment variables:

# Force regeneration for all benchmarks
export BENCHBOX_FORCE_REGENERATE=true

# Enable verbose output
export BENCHBOX_VERBOSE=true

# Set default output directory
export BENCHBOX_DATA_DIR=./benchmark_runs/datagen

Platform-Specific Settings¶

Some platforms have special considerations:

# ClickHouse - handles large scale factors efficiently
benchmark = TPCH(scale_factor=100.0, platform="clickhouse")

# DuckDB - configured for analytical workloads
benchmark = TPCH(scale_factor=10.0, platform="duckdb")

Advanced-level Usage¶

Data Format Conversion¶

Generate data in different formats:

# Generate standard delimited files
tpch = TPCH(scale_factor=0.1)
tables = tpch.generate_data()

# Convert to Parquet (if pyarrow available)
for table_name, file_path in tables.items():
    parquet_path = file_path.with_suffix('.parquet')
    # Conversion code here

Custom Data Directories¶

Implement custom directory structures:

from pathlib import Path
from datetime import datetime

# Timestamp-based directories
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
data_dir = Path(f"./runs/{timestamp}/tpch_sf1")

benchmark = TPCH(scale_factor=1.0, output_dir=data_dir)
tables = benchmark.generate_data()

Integration with CI/CD¶

Cache data between runs:

# GitHub Actions example
- name: Cache benchmark data
  uses: actions/cache@v3
  with:
    path: ./benchmark-data
    key: benchbox-data-${{ runner.os }}-${{ hashFiles('**/benchmark-config.yaml') }}

- name: Generate benchmark data
  run: |
    python -c "
    from benchbox import TPCH
    tpch = TPCH(scale_factor=0.01, output_dir='./benchmark-data/tpch')
    tpch.generate_data()
    "

Error Handling¶

Common Issues and Solutions¶

Scale Factor Mismatches¶

Problem: Existing data was generated with different scale factor

# Solution 1: Use matching scale factor
benchmark = TPCH(scale_factor=0.5)  # Match existing data

# Solution 2: Force regeneration with new scale factor
benchmark = TPCH(scale_factor=1.0, force_regenerate=True)

Missing or Corrupted Files¶

Problem: Some data files are missing or corrupted

# Smart generation automatically detects and regenerates
benchmark = TPCH(scale_factor=1.0, verbose=True)
tables = benchmark.generate_data()
# Output: "Missing files detected, regenerating all data"

Permission Errors¶

Problem: Cannot read/write data files

# Fix directory permissions
chmod 755 ./data
chown $USER:$USER ./data

Disk Space Issues¶

Problem: Not enough space for large scale factors

# Check required space before generation
def estimate_data_size(benchmark, scale_factor):
    base_size_mb = {
        'tpch': 1000,    # ~1GB for SF=1.0
        'tpcds': 2000,   # ~2GB for SF=1.0
    }
    return base_size_mb.get(benchmark, 1000) * scale_factor

required_mb = estimate_data_size('tpch', 10.0)
print(f"Required space: ~{required_mb}MB")

Debug Mode¶

Enable detailed debugging:

import logging
logging.basicConfig(level=logging.DEBUG)

benchmark = TPCH(scale_factor=1.0, verbose=True)
tables = benchmark.generate_data()
# Shows detailed generation and validation steps

Best Practices¶

Development Workflow¶

Start Small: Use scale factors 0.001-0.01 during development
Cache Aggressively: Reuse generated data when possible
Validate Early: Test with small datasets before scaling up
Monitor Resources: Track memory and disk usage

# Development pattern
def setup_benchmark(scale_factor=0.01):
    """Setup benchmark with smart caching."""
    data_dir = Path("./dev-cache") / f"tpch-sf{scale_factor}"
    return TPCH(scale_factor=scale_factor, output_dir=data_dir, verbose=True)

# Use in development
tpch = setup_benchmark()  # Uses cache if available
tables = tpch.generate_data()

Production Deployment¶

Pre-generate Data: Generate data during deployment, not runtime
Validate Integrity: Always validate data before benchmarks
Monitor Performance: Track generation time and data quality
Plan Storage: Estimate storage requirements for scale factors

# Production pattern
def setup_production_benchmark(scale_factor=1.0):
    """Setup production benchmark with validation."""
    data_dir = Path("/opt/benchmark-data") / f"tpch-sf{scale_factor}"

    benchmark = TPCH(
        scale_factor=scale_factor,
        output_dir=data_dir,
        verbose=True,
        parallel=multiprocessing.cpu_count()
    )

    # Pre-validate or generate
    tables = benchmark.generate_data()

    # Additional validation
    validator = BenchmarkDataValidator("tpch", scale_factor)
    result = validator.validate_data_directory(data_dir)

    if not result.valid:
        raise RuntimeError(f"Data validation failed: {result.issues}")

    return benchmark, tables

Testing Strategy¶

Unit Tests: Test generation with small scale factors
Integration Tests: Validate full benchmark workflows
Performance Tests: Measure generation performance
Regression Tests: Detect data quality changes

import pytest
import tempfile

def test_data_generation():
    """Test basic data generation."""
    with tempfile.TemporaryDirectory() as temp_dir:
        benchmark = TPCH(scale_factor=0.001, output_dir=temp_dir)
        tables = benchmark.generate_data()

        # Validate results
        assert len(tables) == 8
        for table_path in tables.values():
            assert table_path.exists()
            assert table_path.stat().st_size > 0

def test_smart_generation():
    """Test smart generation skip logic."""
    with tempfile.TemporaryDirectory() as temp_dir:
        # First generation
        benchmark1 = TPCH(scale_factor=0.001, output_dir=temp_dir)
        tables1 = benchmark1.generate_data()

        # Second generation should skip
        benchmark2 = TPCH(scale_factor=0.001, output_dir=temp_dir, verbose=True)
        tables2 = benchmark2.generate_data()

        # Should return same tables
        assert tables1.keys() == tables2.keys()

Supported Benchmarks¶

Data generation support by benchmark:

Benchmark	Smart Generation	Row Validation	Parallel Support	Notes
TPC-H	✅ Full	✅ Yes	✅ Yes	Official dbgen tool
TPC-DS	✅ Full	✅ Yes	✅ Yes	Official dsdgen tool
Read Primitives	✅ Full	✅ Yes	✅ Yes	Shares TPC-H data
SSB	⚠️️ Basic	❌ No	✅ Yes	Will be improved
ClickBench	⚠️️ Basic	❌ No	❌ No	External data
Join Order	⚠️️ Basic	❌ No	❌ No	IMDB dataset
Others	⚠️️ Basic	❌ No	❌ No	Being improved

Future Enhancements¶

Planned data generation improvements:

Checksum Validation: Verify data integrity using file checksums
Schema Validation: Check table structure and column types
Compression Support: Handle compressed data files (.gz, .bz2)
Remote Data: Support cloud-stored benchmark datasets
Incremental Updates: Update only changed tables
Cross-Platform Sync: Share data between database platforms
Format Conversion: Auto-convert between CSV, Parquet, etc.
Resource Estimation: Predict generation time and space requirements

Troubleshooting¶

Performance Issues¶

# Monitor generation performance
import time

start_time = time.time()
benchmark = TPCH(scale_factor=1.0, verbose=True, parallel=4)
tables = benchmark.generate_data()
generation_time = time.time() - start_time

print(f"Generated {len(tables)} tables in {generation_time:.2f} seconds")
print(f"Average: {generation_time/len(tables):.2f}s per table")

Memory Issues¶

# For large scale factors, monitor memory usage
import psutil

def monitor_generation():
    process = psutil.Process()

    print(f"Memory before: {process.memory_info().rss / 1024**2:.1f} MB")

    benchmark = TPCH(scale_factor=10.0, parallel=2)
    tables = benchmark.generate_data()

    print(f"Memory after: {process.memory_info().rss / 1024**2:.1f} MB")
    return tables

Validation Issues¶

# Debug validation problems
from benchbox.utils.data_validation import BenchmarkDataValidator

validator = BenchmarkDataValidator("tpch", scale_factor=1.0)
result = validator.validate_data_directory("./data")

# Detailed analysis
if not result.valid:
    print("Validation Issues:")
    print(f"Missing tables: {result.missing_tables}")
    print(f"Row count mismatches: {result.row_count_mismatches}")
    print(f"File size issues: {[f for f, size in result.file_size_info.items() if size == 0]}")

Data generation in BenchBox is designed to be powerful, intelligent, and developer-friendly. The smart generation features ensure appropriate performance while maintaining data integrity and standards compliance.