End-to-End (E2E) Testing Guide¶
This guide covers BenchBox’s end-to-end test suite, which validates complete benchmark workflows through the CLI interface.
Overview¶
The E2E test suite (tests/e2e/) validates that BenchBox works correctly as an integrated system. Unlike unit tests that test individual components or integration tests that test component interactions, E2E tests exercise the full CLI workflow from command invocation through result generation.
Test Count: 162 test functions across 6 test modules
Test Structure¶
tests/e2e/
├── conftest.py # Fixtures and helpers
├── test_cli_options.py # CLI option validation (31 tests)
├── test_error_handling.py # Error handling coverage (31 tests)
├── test_result_validation.py # Result schema validation (40 tests)
├── test_local_platforms.py # Local platform tests (16 tests)
├── test_cloud_platforms.py # Cloud platform dry-run tests (26 tests)
├── test_dataframe_platforms.py # DataFrame platform tests (18 tests)
└── utils/
├── platform_detection.py # Platform availability detection
└── result_validators.py # Result JSON validation
Quick Start¶
Run All E2E Tests¶
# All E2E tests
uv run -- python -m pytest tests/e2e/ -v
# With parallel execution
uv run -- python -m pytest tests/e2e/ -n auto
Run Quick E2E Tests (Dry-Run Mode)¶
# Quick tests using dry-run mode (no actual benchmark execution)
make test-e2e-quick
# or
uv run -- python -m pytest -m e2e_quick
Run Local Platform Tests¶
# Full execution against local databases
uv run -- python -m pytest -m e2e_local
# Specific platform
uv run -- python -m pytest tests/e2e/test_local_platforms.py -v
Test Categories¶
CLI Options Tests (test_cli_options.py)¶
Validates all CLI options work correctly:
Option |
Tests |
|---|---|
|
TPC-H, TPC-DS, SSB, ClickBench selection |
|
Scale factor validation (0.01, 0.1) |
|
Phase selection (generate, load, warmup, power) |
|
Query subset: single, multiple, case normalization |
|
Tuning modes (tuned, notuning, auto) |
|
Compression formats (zstd, gzip, none) |
|
Validation levels (disabled, loose) |
|
Deterministic seeding (seed 42 reproducibility) |
|
Force modes (datagen, all) |
# Run CLI option tests
uv run -- python -m pytest tests/e2e/test_cli_options.py -v
# Test specific option
uv run -- python -m pytest tests/e2e/test_cli_options.py -k "scale" -v
Error Handling Tests (test_error_handling.py)¶
Validates proper error messages and exit codes:
Missing required parameters (
--platform,--benchmark)Invalid platform/benchmark names
Invalid scale factors (negative, zero; TPC-DS requires scale ≥ 1)
Query format violations (max 100 queries, alphanumeric IDs only, SQL injection patterns)
Invalid tuning, compression, and validation modes
Platform option format validation (
key=value)Dry-run directory edge cases (file instead of directory)
Help and version commands (
--help,--help all,--help examples,--version)
# Run error handling tests
uv run -- python -m pytest tests/e2e/test_error_handling.py -v
Result Validation Tests (test_result_validation.py)¶
Validates benchmark result file structure and content:
Result structure: Required fields, null handling, timestamp formats, negative durations, query count consistency
Query execution: Valid/invalid query IDs, status values (SUCCESS, FAILED, ERROR, TIMEOUT, SKIPPED), execution times
Phase timing: Duration constraints, zero-duration handling
Execution phases: Setup (data_generation, data_loading), power_test, throughput_test validation
Validator reuse: Multiple validation passes, error collection
File loading: Valid JSON, nonexistent files, malformed JSON, end-to-end load-and-validate
# Run result validation tests
uv run -- python -m pytest tests/e2e/test_result_validation.py -v
Local Platform Tests (test_local_platforms.py)¶
Full benchmark execution against local databases:
Platform |
Tests |
|---|---|
DuckDB |
Full TPC-H, query subsets, |
SQLite |
Full TPC-H, phase-limited execution (generate,load only) |
DataFusion |
Full TPC-H (skipped if unavailable) |
ClickHouse (chDB) |
Dry-run with local mode and platform options |
Cross-platform |
DuckDB vs SQLite result comparison (Q1) |
All platforms also include dry-run artifact validation tests, plus parametrized tests for help commands and dry-run across platform/benchmark combinations.
# Run local platform tests
uv run -- python -m pytest tests/e2e/test_local_platforms.py -v
# Run DuckDB tests only
uv run -- python -m pytest tests/e2e/test_local_platforms.py -k duckdb -v
Note: Local platform tests execute actual benchmarks and may take several minutes.
Cloud Platform Tests (test_cloud_platforms.py)¶
Dry-run tests for cloud platforms (no credentials required):
Platform |
Tests |
|---|---|
Snowflake |
Dry-run validation, artifact generation |
BigQuery |
Dry-run validation, artifact generation |
Redshift |
Dry-run validation, artifact generation |
Athena |
Dry-run validation, artifact generation |
Databricks |
Dry-run validation, artifact generation |
Firebolt |
Dry-run validation, artifact generation |
Trino |
Dry-run validation, artifact generation |
Presto |
Dry-run validation, artifact generation |
ClickHouse (cloud) |
Dry-run validation (server mode) |
Each platform class also has a test_full_execution_with_credentials() test gated behind the live_integration marker. Parametrized tests cover all platforms across multiple benchmark types (including TPC-DS with SF=1 constraint).
# Run cloud platform dry-run tests
uv run -- python -m pytest tests/e2e/test_cloud_platforms.py -v
# Run live integration tests (requires credentials)
uv run -- python -m pytest -m live_integration -v
DataFrame Platform Tests (test_dataframe_platforms.py)¶
Tests for DataFrame-based platforms:
Platform |
Tests |
|---|---|
Pandas |
Full TPC-H, query subsets, dry-run |
Polars |
Full TPC-H, query subsets, dry-run |
Dask |
Full TPC-H, dry-run |
PySpark |
Full TPC-H (marked |
cuDF |
Full TPC-H (requires NVIDIA GPU) |
Modin |
Full TPC-H execution |
DataFusion (DF) |
Full TPC-H execution |
Cross-platform |
Pandas vs Polars result comparison (Q1) |
# Run DataFrame platform tests
uv run -- python -m pytest tests/e2e/test_dataframe_platforms.py -v
# Skip slow tests (PySpark)
uv run -- python -m pytest tests/e2e/test_dataframe_platforms.py -m "not slow" -v
Test Markers¶
E2E tests use several pytest markers for selective execution:
Marker |
Description |
|---|---|
|
All E2E tests |
|
Quick dry-run tests (no benchmark execution) |
|
Local platform tests (full execution) |
|
Cloud platform dry-run tests |
|
DataFrame platform tests |
|
Tests that take significant time (e.g., PySpark) |
|
Tests requiring real cloud credentials |
|
Benchmark-specific tests |
|
Platform-specific tests |
# Examples
uv run -- python -m pytest -m e2e_quick # Quick tests only
uv run -- python -m pytest -m e2e_local # Local platforms
uv run -- python -m pytest -m "e2e and not slow" # Fast E2E tests
uv run -- python -m pytest -m "e2e and tpch" # TPC-H tests only
uv run -- python -m pytest -m live_integration # Live cloud tests
Fixtures¶
The E2E test suite provides fixtures in conftest.py:
Directory Fixtures¶
@pytest.fixture
def results_dir(tmp_path):
"""Temporary directory for benchmark results."""
@pytest.fixture
def dry_run_dir(tmp_path):
"""Temporary directory for dry-run output."""
Platform Configuration Fixtures¶
# Local platforms
@pytest.fixture
def duckdb_config():
return {"platform": "duckdb", "benchmark": "tpch", "scale": "0.01"}
@pytest.fixture
def sqlite_config():
return {"platform": "sqlite", "benchmark": "tpch", "scale": "0.01"}
@pytest.fixture
def datafusion_config():
return {"platform": "datafusion", "benchmark": "tpch", "scale": "0.01"}
# DataFrame platforms
@pytest.fixture
def pandas_df_config():
return {"platform": "pandas-df", "benchmark": "tpch", "scale": "0.01"}
@pytest.fixture
def polars_df_config():
return {"platform": "polars-df", "benchmark": "tpch", "scale": "0.01"}
# Cloud platforms (dry-run)
@pytest.fixture
def snowflake_dry_run_config(dry_run_dir):
return {"platform": "snowflake", "benchmark": "tpch", "dry_run": str(dry_run_dir)}
Helper Functions¶
def build_cli_args(config, extra_args=None):
"""Build CLI arguments from config dictionary."""
def run_benchmark(config, extra_args=None, env=None, timeout=600):
"""Run a benchmark with configuration."""
def find_result_files(directory, pattern="*.json"):
"""Find result files in directory."""
def find_latest_result(directory, pattern="*.json"):
"""Find the most recent result file."""
Writing E2E Tests¶
Basic Test Pattern¶
import pytest
from tests.e2e.conftest import build_cli_args, run_benchmark
def test_basic_benchmark(duckdb_config, results_dir):
"""Test basic benchmark execution."""
config = {**duckdb_config, "output": str(results_dir)}
result = run_benchmark(config)
assert result.returncode == 0
assert "Benchmark completed" in result.stdout
Testing Error Conditions¶
def test_invalid_scale_factor():
"""Test that invalid scale factors produce clear errors."""
config = {"platform": "duckdb", "benchmark": "tpcds", "scale": "0.1"}
result = run_benchmark(config)
assert result.returncode != 0
assert "scale factor" in result.stderr.lower()
Testing Result Files¶
import json
from tests.e2e.conftest import find_result_files
def test_result_file_schema(duckdb_config, results_dir):
"""Test that result files match expected schema."""
config = {**duckdb_config, "output": str(results_dir)}
run_benchmark(config)
result_files = find_result_files(results_dir / "results")
assert len(result_files) > 0
with open(result_files[0]) as f:
data = json.load(f)
assert "platform" in data
assert "benchmark" in data
assert "successful_queries" in data
CI/CD Integration¶
GitHub Actions¶
jobs:
e2e-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: uv sync --group dev
- name: Run E2E quick tests
run: uv run -- python -m pytest -m e2e_quick -v
- name: Run E2E local tests
run: uv run -- python -m pytest -m e2e_local -v
timeout-minutes: 15
Makefile Targets¶
test-e2e-quick:
uv run -- python -m pytest -m e2e_quick -v
test-e2e-local:
uv run -- python -m pytest -m e2e_local -v
test-e2e-all:
uv run -- python -m pytest tests/e2e/ -v
Platform Availability Gating¶
Tests use decorators from tests/e2e/utils/platform_detection.py to skip gracefully when dependencies or credentials are unavailable:
from tests.e2e.utils.platform_detection import requires_platform, requires_gpu
@requires_platform("datafusion")
def test_datafusion_execution():
"""Skipped if datafusion not importable."""
@requires_gpu()
def test_cudf_execution():
"""Skipped if no NVIDIA GPU/CUDA available."""
Available decorators:
Decorator |
Checks |
|---|---|
|
Platform module is importable |
|
DataFrame platform available |
|
NVIDIA GPU and CUDA present |
|
Required env vars set (e.g., |
Troubleshooting¶
Tests Timing Out¶
E2E tests have a 10-minute timeout by default. For slower systems:
# Increase timeout
uv run -- python -m pytest tests/e2e/ --timeout=900
ClickHouse Tests Without Dependencies¶
The test suite includes automatic stub generation for ClickHouse dependencies:
@pytest.fixture
def clickhouse_stub_dir(tmp_path):
"""Create minimal chDB and clickhouse_driver stubs."""
# Stubs are created automatically for dry-run tests
Debugging Test Failures¶
# Run with verbose output
uv run -- python -m pytest tests/e2e/test_cli_options.py -v -s
# Run single test with debugging
uv run -- python -m pytest tests/e2e/test_cli_options.py::test_benchmark_selection -v -s --tb=long
Result File Inspection¶
# Find generated result files
find /tmp -name "*.json" -path "*/benchmark_results/*" 2>/dev/null
# Validate JSON schema
python -c "import json; json.load(open('result.json'))"