Data Compression in BenchBox

Tags intermediate guide data-generation

BenchBox provides comprehensive data compression support to reduce storage requirements and improve I/O performance during benchmark data generation.

Overview

The compression system offers:

  • 6-8x storage reduction for typical benchmark data

  • Multiple compression algorithms (Gzip, Zstd, or no compression)

  • Streaming compression without intermediate uncompressed files

  • Transparent integration with existing data generators

  • Configurable compression levels for performance tuning

Quick Start

CLI Usage

Enable compression with command-line options:

# Basic compression (uses zstd by default)
benchbox run --platform duckdb --benchmark tpch --compression zstd

# Specify compression type and level
benchbox run --platform duckdb --benchmark tpch --compression gzip:9

# Use zstd with custom level for maximum compression
benchbox run --platform snowflake --benchmark tpcds --compression zstd:19

# Disable compression
benchbox run --platform duckdb --benchmark tpch --compression none

Programmatic Usage

from benchbox.core.ssb.benchmark import SSBBenchmark

# Create benchmark with compression
benchmark = SSBBenchmark(
    scale_factor=1.0,
    output_dir="./data",
    compress_data=True,
    compression_type='zstd',
    compression_level=5
)

# Generate compressed data
data_files = benchmark.generate_data()
# Files will be saved as .zst files with automatic compression

Compression Types

Gzip (Universal Compatibility)

  • Widely supported across platforms

  • Compression levels: 1-9 (default: 6)

  • Typical ratios: 5-8:1 for benchmark data

  • Use case: Maximum compatibility requirements

benchbox run --platform duckdb --benchmark tpch --compression gzip

None (No Compression)

  • Standard uncompressed files

  • Use case: When compression is not desired or supported

benchbox run --platform duckdb --benchmark tpch --compression none

Performance Results

Based on testing with SSB benchmark data:

Compression

File Size

Compression Ratio

Space Savings

None

215,514 bytes

1.00:1

0%

Gzip

32,730 bytes

6.58:1

85%

Zstd

26,944 bytes

8.00:1

87.5%

Compression Levels

Gzip Levels (1-9)

  • Level 1: Fastest compression, lower ratio

  • Level 6: Default, good balance

  • Level 9: Maximum compression, slower

# Fast compression
benchbox run --platform duckdb --benchmark tpch --compression gzip:1

# Maximum compression
benchbox run --platform duckdb --benchmark tpch --compression gzip:9

Zstd Levels (1-22)

  • Level 1: Fastest compression

  • Level 3: Default, excellent balance

  • Level 19: Maximum compression (higher levels exist but may be very slow)

# Fast compression
benchbox run --platform duckdb --benchmark tpch --compression zstd:1

# High compression
benchbox run --platform duckdb --benchmark tpch --compression zstd:15

Integration with Data Generators

Adding Compression Support to New Generators

To add compression support to a custom data generator:

from benchbox.utils.compression_mixin import CompressionMixin

class MyDataGenerator(CompressionMixin):
    def __init__(self, scale_factor=1.0, output_dir=None, **kwargs):
        # Initialize compression mixin
        super().__init__(**kwargs)
        
        self.scale_factor = scale_factor
        self.output_dir = Path(output_dir) if output_dir else Path.cwd()

    def generate_data(self):
        """Generate data with optional compression."""
        # Get compressed filename
        filename = self.get_compressed_filename("data.csv")
        file_path = self.output_dir / filename
        
        # Open file with compression if enabled
        with self.open_output_file(file_path, "wt") as f:
            writer = csv.writer(f)
            # Write data...
        
        # Print compression report if enabled
        if self.should_use_compression():
            files = {"data": file_path}
            self.print_compression_report(files)
        
        return {"data": str(file_path)}

Benchmark Integration

To integrate compression into benchmark classes:

class MyBenchmark(BaseBenchmark):
    def __init__(self, scale_factor=1.0, **config):
        super().__init__(scale_factor, **config)
        
        # Pass compression settings to data generator
        self.data_generator = MyDataGenerator(
            scale_factor=scale_factor,
            output_dir=self.output_dir,
            compress_data=config.get('compress_data', False),
            compression_type=config.get('compression_type', 'zstd'),
            compression_level=config.get('compression_level', None)
        )

Platform Adapter Support

Platform adapters can automatically detect and handle compressed files:

from benchbox.utils.compression import CompressionManager

def load_data_file(file_path):
    """Load data file with automatic decompression if needed."""
    compression_manager = CompressionManager()
    compression_type = compression_manager.detect_compression(file_path)
    
    if compression_type == 'none':
        # Standard file loading
        with open(file_path, 'rt') as f:
            return f.read()
    else:
        # Compressed file loading
        compressor = compression_manager.get_compressor(compression_type)
        with compressor.open_for_read(file_path, 'rt') as f:
            return f.read()

Advanced Usage

Custom Compression Levels by Use Case

# CI/CD environments - prioritize speed
benchbox run --platform duckdb --benchmark tpch --compression zstd:1

# Storage-constrained environments - prioritize size
benchbox run --platform duckdb --benchmark tpch --compression zstd:19

# Production benchmarking - balanced performance
benchbox run --platform duckdb --benchmark tpch --compression zstd  # Uses default level 3

Dry Run with Compression

Preview compression settings without execution:

benchbox run --platform duckdb --benchmark tpch --scale 1.0 \
  --compression zstd:5 --dry-run ./preview

Environment Variables

Set compression defaults via environment variables:

export BENCHBOX_COMPRESS_DATA=true
export BENCHBOX_COMPRESSION_TYPE=zstd
export BENCHBOX_COMPRESSION_LEVEL=5

benchbox run --platform duckdb --benchmark tpch

Best Practices

When to Use Compression

✅ Recommended for:

  • Large scale factors (≥1.0)

  • Storage-constrained environments

  • CI/CD pipelines

  • Network file systems

  • Repeated benchmark runs

❌ Consider avoiding for:

  • Very small datasets (scale < 0.1)

  • Platforms that don’t support compressed files

  • Time-critical scenarios where compression overhead matters

Compression Type Selection

Use Zstd when:

  • You want the best compression ratio and speed

  • Storage space is a primary concern

  • You’re using modern systems

Use Gzip when:

  • You need maximum compatibility

  • Downstream tools only support gzip

  • You’re working with legacy systems

Use None when:

  • Dataset is very small

  • Platform adapters don’t support decompression

  • Debugging file contents manually

Performance Tuning

For Speed:

benchbox run --platform duckdb --benchmark tpch --compression zstd:1

For Size:

benchbox run --platform duckdb --benchmark tpch --compression zstd:15

For Balance:

benchbox run --platform duckdb --benchmark tpch --compression zstd  # Uses default level 3

Troubleshooting

Common Issues

“zstandard library not available”

# Install zstandard
uv pip install zstandard
# or
pip install zstandard

“Compressed file not found”

  • Check that --compress-data flag was used during generation

  • Verify file extensions (.gz, .zst)

  • Ensure compression completed successfully

“Decompression failed”

  • File may be corrupted during transfer

  • Check compression/decompression compatibility

  • Verify sufficient disk space

Debug Mode

Enable verbose logging to troubleshoot compression:

benchbox run --platform duckdb --benchmark tpch --compress-data --verbose

File Verification

Verify compressed file integrity:

from benchbox.utils.compression import CompressionManager

manager = CompressionManager()
compression_type = manager.detect_compression(Path("data.csv.gz"))
compressor = manager.get_compressor(compression_type)

# Test decompression
try:
    with compressor.open_for_read(Path("data.csv.gz"), 'rt') as f:
        content = f.read(100)  # Read first 100 characters
    print("File is valid")
except Exception as e:
    print(f"File is corrupted: {e}")

API Reference

CompressionManager

from benchbox.utils.compression import CompressionManager

manager = CompressionManager()

# Get available compressors
compressors = manager.get_available_compressors()

# Get specific compressor
compressor = manager.get_compressor('zstd', level=5)

# Detect compression type
compression_type = manager.detect_compression(Path("file.csv.gz"))

# Get compression statistics
info = manager.get_compression_info(original_file, compressed_file)

CompressionMixin

from benchbox.utils.compression_mixin import CompressionMixin

class MyGenerator(CompressionMixin):
    # Mixin methods available:
    # - get_compressed_filename(filename) -> str
    # - open_output_file(path, mode) -> file_object
    # - compress_existing_file(path) -> Path
    # - should_use_compression() -> bool
    # - print_compression_report(files) -> None

CLI Options

--compression TYPE[:LEVEL]   # Compression: zstd, zstd:9, gzip:6, none
                             # Examples: --compression zstd
                             #           --compression zstd:15
                             #           --compression gzip:9
                             #           --compression none

Examples Repository

For more examples, see the /examples directory:

  • examples/duckdb_tpch_compressed.py - TPC-H with compression

  • examples/compression_performance.py - Performance testing

  • examples/multi_benchmark_compression.py - Multiple benchmarks with compression