Data Compression in BenchBox¶

Tags intermediate guide data-generation

BenchBox provides comprehensive data compression support to reduce storage requirements and improve I/O performance during benchmark data generation.

Overview¶

The compression system offers:

6-8x storage reduction for typical benchmark data
Multiple compression algorithms (Gzip, Zstd, or no compression)
Streaming compression without intermediate uncompressed files
Transparent integration with existing data generators
Configurable compression levels for performance tuning

Quick Start¶

CLI Usage¶

Enable compression with command-line options:

# Basic compression (uses zstd by default)
benchbox run --platform duckdb --benchmark tpch --compression zstd

# Specify compression type and level
benchbox run --platform duckdb --benchmark tpch --compression gzip:9

# Use zstd with custom level for maximum compression
benchbox run --platform snowflake --benchmark tpcds --compression zstd:19

# Disable compression
benchbox run --platform duckdb --benchmark tpch --compression none

Programmatic Usage¶

from benchbox.core.ssb.benchmark import SSBBenchmark

# Create benchmark with compression
benchmark = SSBBenchmark(
    scale_factor=1.0,
    output_dir="./data",
    compress_data=True,
    compression_type='zstd',
    compression_level=5
)

# Generate compressed data
data_files = benchmark.generate_data()
# Files will be saved as .zst files with automatic compression

Compression Types¶

Zstd (Default, Recommended)¶

Best overall performance and compression ratio
Compression levels: 1-22 (default: 3)
Typical ratios: 6-10:1 for benchmark data
Use case: Default choice for most scenarios

benchbox run --platform duckdb --benchmark tpch --compression zstd

Gzip (Universal Compatibility)¶

Widely supported across platforms
Compression levels: 1-9 (default: 6)
Typical ratios: 5-8:1 for benchmark data
Use case: Maximum compatibility requirements

benchbox run --platform duckdb --benchmark tpch --compression gzip

None (No Compression)¶

Standard uncompressed files
Use case: When compression is not desired or supported

benchbox run --platform duckdb --benchmark tpch --compression none

Performance Results¶

Based on testing with SSB benchmark data:

Compression	File Size	Compression Ratio	Space Savings
None	215,514 bytes	1.00:1	0%
Gzip	32,730 bytes	6.58:1	85%
Zstd	26,944 bytes	8.00:1	87.5%

Compression Levels¶

Gzip Levels (1-9)¶

Level 1: Fastest compression, lower ratio
Level 6: Default, good balance
Level 9: Maximum compression, slower

# Fast compression
benchbox run --platform duckdb --benchmark tpch --compression gzip:1

# Maximum compression
benchbox run --platform duckdb --benchmark tpch --compression gzip:9

Zstd Levels (1-22)¶

Level 1: Fastest compression
Level 3: Default, excellent balance
Level 19: Maximum compression (higher levels exist but may be very slow)

# Fast compression
benchbox run --platform duckdb --benchmark tpch --compression zstd:1

# High compression
benchbox run --platform duckdb --benchmark tpch --compression zstd:15

Integration with Data Generators¶

Adding Compression Support to New Generators¶

To add compression support to a custom data generator:

from benchbox.utils.compression_mixin import CompressionMixin

class MyDataGenerator(CompressionMixin):
    def __init__(self, scale_factor=1.0, output_dir=None, **kwargs):
        # Initialize compression mixin
        super().__init__(**kwargs)
        
        self.scale_factor = scale_factor
        self.output_dir = Path(output_dir) if output_dir else Path.cwd()

    def generate_data(self):
        """Generate data with optional compression."""
        # Get compressed filename
        filename = self.get_compressed_filename("data.csv")
        file_path = self.output_dir / filename
        
        # Open file with compression if enabled
        with self.open_output_file(file_path, "wt") as f:
            writer = csv.writer(f)
            # Write data...
        
        # Print compression report if enabled
        if self.should_use_compression():
            files = {"data": file_path}
            self.print_compression_report(files)
        
        return {"data": str(file_path)}

Benchmark Integration¶

To integrate compression into benchmark classes:

class MyBenchmark(BaseBenchmark):
    def __init__(self, scale_factor=1.0, **config):
        super().__init__(scale_factor, **config)
        
        # Pass compression settings to data generator
        self.data_generator = MyDataGenerator(
            scale_factor=scale_factor,
            output_dir=self.output_dir,
            compress_data=config.get('compress_data', False),
            compression_type=config.get('compression_type', 'zstd'),
            compression_level=config.get('compression_level', None)
        )

Platform Adapter Support¶

Platform adapters can automatically detect and handle compressed files:

from benchbox.utils.compression import CompressionManager

def load_data_file(file_path):
    """Load data file with automatic decompression if needed."""
    compression_manager = CompressionManager()
    compression_type = compression_manager.detect_compression(file_path)
    
    if compression_type == 'none':
        # Standard file loading
        with open(file_path, 'rt') as f:
            return f.read()
    else:
        # Compressed file loading
        compressor = compression_manager.get_compressor(compression_type)
        with compressor.open_for_read(file_path, 'rt') as f:
            return f.read()

Advanced Usage¶

Custom Compression Levels by Use Case¶

# CI/CD environments - prioritize speed
benchbox run --platform duckdb --benchmark tpch --compression zstd:1

# Storage-constrained environments - prioritize size
benchbox run --platform duckdb --benchmark tpch --compression zstd:19

# Production benchmarking - balanced performance
benchbox run --platform duckdb --benchmark tpch --compression zstd  # Uses default level 3

Dry Run with Compression¶

Preview compression settings without execution:

benchbox run --platform duckdb --benchmark tpch --scale 1.0 \
  --compression zstd:5 --dry-run ./preview

Environment Variables¶

Set compression defaults via environment variables:

export BENCHBOX_COMPRESS_DATA=true
export BENCHBOX_COMPRESSION_TYPE=zstd
export BENCHBOX_COMPRESSION_LEVEL=5

benchbox run --platform duckdb --benchmark tpch

Best Practices¶

When to Use Compression¶

✅ Recommended for:

Large scale factors (≥1.0)
Storage-constrained environments
CI/CD pipelines
Network file systems
Repeated benchmark runs

❌ Consider avoiding for:

Very small datasets (scale < 0.1)
Platforms that don’t support compressed files
Time-critical scenarios where compression overhead matters

Compression Type Selection¶

Use Zstd when:

You want the best compression ratio and speed
Storage space is a primary concern
You’re using modern systems

Use Gzip when:

You need maximum compatibility
Downstream tools only support gzip
You’re working with legacy systems

Use None when:

Dataset is very small
Platform adapters don’t support decompression
Debugging file contents manually

Performance Tuning¶

For Speed:

benchbox run --platform duckdb --benchmark tpch --compression zstd:1

For Size:

benchbox run --platform duckdb --benchmark tpch --compression zstd:15

For Balance:

benchbox run --platform duckdb --benchmark tpch --compression zstd  # Uses default level 3

Troubleshooting¶

Common Issues¶

“zstandard library not available”

# Install zstandard
uv pip install zstandard
# or
pip install zstandard

“Compressed file not found”

Check that --compress-data flag was used during generation
Verify file extensions (.gz, .zst)
Ensure compression completed successfully

“Decompression failed”

File may be corrupted during transfer
Check compression/decompression compatibility
Verify sufficient disk space

Debug Mode¶

Enable verbose logging to troubleshoot compression:

benchbox run --platform duckdb --benchmark tpch --compress-data --verbose

File Verification¶

Verify compressed file integrity:

from benchbox.utils.compression import CompressionManager

manager = CompressionManager()
compression_type = manager.detect_compression(Path("data.csv.gz"))
compressor = manager.get_compressor(compression_type)

# Test decompression
try:
    with compressor.open_for_read(Path("data.csv.gz"), 'rt') as f:
        content = f.read(100)  # Read first 100 characters
    print("File is valid")
except Exception as e:
    print(f"File is corrupted: {e}")

API Reference¶

CompressionManager¶

from benchbox.utils.compression import CompressionManager

manager = CompressionManager()

# Get available compressors
compressors = manager.get_available_compressors()

# Get specific compressor
compressor = manager.get_compressor('zstd', level=5)

# Detect compression type
compression_type = manager.detect_compression(Path("file.csv.gz"))

# Get compression statistics
info = manager.get_compression_info(original_file, compressed_file)

CompressionMixin¶

from benchbox.utils.compression_mixin import CompressionMixin

class MyGenerator(CompressionMixin):
    # Mixin methods available:
    # - get_compressed_filename(filename) -> str
    # - open_output_file(path, mode) -> file_object
    # - compress_existing_file(path) -> Path
    # - should_use_compression() -> bool
    # - print_compression_report(files) -> None

CLI Options¶

--compression TYPE[:LEVEL]   # Compression: zstd, zstd:9, gzip:6, none
                             # Examples: --compression zstd
                             #           --compression zstd:15
                             #           --compression gzip:9
                             #           --compression none

Examples Repository¶

For more examples, see the /examples directory:

examples/duckdb_tpch_compressed.py - TPC-H with compression
examples/compression_performance.py - Performance testing
examples/multi_benchmark_compression.py - Multiple benchmarks with compression