Data Compression in BenchBox¶
BenchBox provides comprehensive data compression support to reduce storage requirements and improve I/O performance during benchmark data generation.
Overview¶
The compression system offers:
6-8x storage reduction for typical benchmark data
Multiple compression algorithms (Gzip, Zstd, or no compression)
Streaming compression without intermediate uncompressed files
Transparent integration with existing data generators
Configurable compression levels for performance tuning
Quick Start¶
CLI Usage¶
Enable compression with command-line options:
# Basic compression (uses zstd by default)
benchbox run --platform duckdb --benchmark tpch --compression zstd
# Specify compression type and level
benchbox run --platform duckdb --benchmark tpch --compression gzip:9
# Use zstd with custom level for maximum compression
benchbox run --platform snowflake --benchmark tpcds --compression zstd:19
# Disable compression
benchbox run --platform duckdb --benchmark tpch --compression none
Programmatic Usage¶
from benchbox.core.ssb.benchmark import SSBBenchmark
# Create benchmark with compression
benchmark = SSBBenchmark(
scale_factor=1.0,
output_dir="./data",
compress_data=True,
compression_type='zstd',
compression_level=5
)
# Generate compressed data
data_files = benchmark.generate_data()
# Files will be saved as .zst files with automatic compression
Compression Types¶
Zstd (Default, Recommended)¶
Best overall performance and compression ratio
Compression levels: 1-22 (default: 3)
Typical ratios: 6-10:1 for benchmark data
Use case: Default choice for most scenarios
benchbox run --platform duckdb --benchmark tpch --compression zstd
Gzip (Universal Compatibility)¶
Widely supported across platforms
Compression levels: 1-9 (default: 6)
Typical ratios: 5-8:1 for benchmark data
Use case: Maximum compatibility requirements
benchbox run --platform duckdb --benchmark tpch --compression gzip
None (No Compression)¶
Standard uncompressed files
Use case: When compression is not desired or supported
benchbox run --platform duckdb --benchmark tpch --compression none
Performance Results¶
Based on testing with SSB benchmark data:
Compression |
File Size |
Compression Ratio |
Space Savings |
|---|---|---|---|
None |
215,514 bytes |
1.00:1 |
0% |
Gzip |
32,730 bytes |
6.58:1 |
85% |
Zstd |
26,944 bytes |
8.00:1 |
87.5% |
Compression Levels¶
Gzip Levels (1-9)¶
Level 1: Fastest compression, lower ratio
Level 6: Default, good balance
Level 9: Maximum compression, slower
# Fast compression
benchbox run --platform duckdb --benchmark tpch --compression gzip:1
# Maximum compression
benchbox run --platform duckdb --benchmark tpch --compression gzip:9
Zstd Levels (1-22)¶
Level 1: Fastest compression
Level 3: Default, excellent balance
Level 19: Maximum compression (higher levels exist but may be very slow)
# Fast compression
benchbox run --platform duckdb --benchmark tpch --compression zstd:1
# High compression
benchbox run --platform duckdb --benchmark tpch --compression zstd:15
Integration with Data Generators¶
Adding Compression Support to New Generators¶
To add compression support to a custom data generator:
from benchbox.utils.compression_mixin import CompressionMixin
class MyDataGenerator(CompressionMixin):
def __init__(self, scale_factor=1.0, output_dir=None, **kwargs):
# Initialize compression mixin
super().__init__(**kwargs)
self.scale_factor = scale_factor
self.output_dir = Path(output_dir) if output_dir else Path.cwd()
def generate_data(self):
"""Generate data with optional compression."""
# Get compressed filename
filename = self.get_compressed_filename("data.csv")
file_path = self.output_dir / filename
# Open file with compression if enabled
with self.open_output_file(file_path, "wt") as f:
writer = csv.writer(f)
# Write data...
# Print compression report if enabled
if self.should_use_compression():
files = {"data": file_path}
self.print_compression_report(files)
return {"data": str(file_path)}
Benchmark Integration¶
To integrate compression into benchmark classes:
class MyBenchmark(BaseBenchmark):
def __init__(self, scale_factor=1.0, **config):
super().__init__(scale_factor, **config)
# Pass compression settings to data generator
self.data_generator = MyDataGenerator(
scale_factor=scale_factor,
output_dir=self.output_dir,
compress_data=config.get('compress_data', False),
compression_type=config.get('compression_type', 'zstd'),
compression_level=config.get('compression_level', None)
)
Platform Adapter Support¶
Platform adapters can automatically detect and handle compressed files:
from benchbox.utils.compression import CompressionManager
def load_data_file(file_path):
"""Load data file with automatic decompression if needed."""
compression_manager = CompressionManager()
compression_type = compression_manager.detect_compression(file_path)
if compression_type == 'none':
# Standard file loading
with open(file_path, 'rt') as f:
return f.read()
else:
# Compressed file loading
compressor = compression_manager.get_compressor(compression_type)
with compressor.open_for_read(file_path, 'rt') as f:
return f.read()
Advanced Usage¶
Custom Compression Levels by Use Case¶
# CI/CD environments - prioritize speed
benchbox run --platform duckdb --benchmark tpch --compression zstd:1
# Storage-constrained environments - prioritize size
benchbox run --platform duckdb --benchmark tpch --compression zstd:19
# Production benchmarking - balanced performance
benchbox run --platform duckdb --benchmark tpch --compression zstd # Uses default level 3
Dry Run with Compression¶
Preview compression settings without execution:
benchbox run --platform duckdb --benchmark tpch --scale 1.0 \
--compression zstd:5 --dry-run ./preview
Environment Variables¶
Set compression defaults via environment variables:
export BENCHBOX_COMPRESS_DATA=true
export BENCHBOX_COMPRESSION_TYPE=zstd
export BENCHBOX_COMPRESSION_LEVEL=5
benchbox run --platform duckdb --benchmark tpch
Best Practices¶
When to Use Compression¶
✅ Recommended for:
Large scale factors (≥1.0)
Storage-constrained environments
CI/CD pipelines
Network file systems
Repeated benchmark runs
❌ Consider avoiding for:
Very small datasets (scale < 0.1)
Platforms that don’t support compressed files
Time-critical scenarios where compression overhead matters
Compression Type Selection¶
Use Zstd when:
You want the best compression ratio and speed
Storage space is a primary concern
You’re using modern systems
Use Gzip when:
You need maximum compatibility
Downstream tools only support gzip
You’re working with legacy systems
Use None when:
Dataset is very small
Platform adapters don’t support decompression
Debugging file contents manually
Performance Tuning¶
For Speed:
benchbox run --platform duckdb --benchmark tpch --compression zstd:1
For Size:
benchbox run --platform duckdb --benchmark tpch --compression zstd:15
For Balance:
benchbox run --platform duckdb --benchmark tpch --compression zstd # Uses default level 3
Troubleshooting¶
Common Issues¶
“zstandard library not available”
# Install zstandard
uv pip install zstandard
# or
pip install zstandard
“Compressed file not found”
Check that
--compress-dataflag was used during generationVerify file extensions (.gz, .zst)
Ensure compression completed successfully
“Decompression failed”
File may be corrupted during transfer
Check compression/decompression compatibility
Verify sufficient disk space
Debug Mode¶
Enable verbose logging to troubleshoot compression:
benchbox run --platform duckdb --benchmark tpch --compress-data --verbose
File Verification¶
Verify compressed file integrity:
from benchbox.utils.compression import CompressionManager
manager = CompressionManager()
compression_type = manager.detect_compression(Path("data.csv.gz"))
compressor = manager.get_compressor(compression_type)
# Test decompression
try:
with compressor.open_for_read(Path("data.csv.gz"), 'rt') as f:
content = f.read(100) # Read first 100 characters
print("File is valid")
except Exception as e:
print(f"File is corrupted: {e}")
API Reference¶
CompressionManager¶
from benchbox.utils.compression import CompressionManager
manager = CompressionManager()
# Get available compressors
compressors = manager.get_available_compressors()
# Get specific compressor
compressor = manager.get_compressor('zstd', level=5)
# Detect compression type
compression_type = manager.detect_compression(Path("file.csv.gz"))
# Get compression statistics
info = manager.get_compression_info(original_file, compressed_file)
CompressionMixin¶
from benchbox.utils.compression_mixin import CompressionMixin
class MyGenerator(CompressionMixin):
# Mixin methods available:
# - get_compressed_filename(filename) -> str
# - open_output_file(path, mode) -> file_object
# - compress_existing_file(path) -> Path
# - should_use_compression() -> bool
# - print_compression_report(files) -> None
CLI Options¶
--compression TYPE[:LEVEL] # Compression: zstd, zstd:9, gzip:6, none
# Examples: --compression zstd
# --compression zstd:15
# --compression gzip:9
# --compression none
Examples Repository¶
For more examples, see the /examples directory:
examples/duckdb_tpch_compressed.py- TPC-H with compressionexamples/compression_performance.py- Performance testingexamples/multi_benchmark_compression.py- Multiple benchmarks with compression