CI/CD Integration Guide¶

Tags advanced guide testing

Complete guide for integrating BenchBox into CI/CD pipelines for automated performance testing and regression detection.

Overview¶

BenchBox can be integrated into CI/CD pipelines to:

Automate Performance Testing: Run benchmarks on every commit or pull request
Detect Regressions: Automatically identify performance degradations
Track Performance Trends: Build historical performance baselines
Validate Optimizations: Verify that code changes improve performance
Generate Reports: Create automated performance reports

GitHub Actions¶

Basic Workflow¶

Create .github/workflows/benchmarks.yml:

name: Benchmark Tests

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]
  schedule:
    # Run daily at 2 AM UTC
    - cron: '0 2 * * *'

jobs:
  benchmark:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install UV
        run: curl -LsSf https://astral.sh/uv/install.sh | sh

      - name: Install BenchBox
        run: uv pip install benchbox[all]

      - name: Run TPC-H Benchmark
        run: |
          benchbox run \
            --platform duckdb \
            --benchmark tpch \
            --scale 0.01 \
            --output results/

      - name: Upload Results
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results
          path: results/*.json

Regression Detection Workflow¶

Create .github/workflows/benchmark-regression.yml:

name: Performance Regression Check

on:
  pull_request:
    branches: [main]

jobs:
  benchmark-comparison:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Need history for baseline

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install UV
        run: curl -LsSf https://astral.sh/uv/install.sh | sh

      - name: Install BenchBox
        run: uv pip install benchbox[all]

      - name: Download Baseline
        uses: actions/download-artifact@v4
        with:
          name: baseline-results
          path: baseline/
        continue-on-error: true

      - name: Run Current Benchmark
        run: |
          benchbox run \
            --platform duckdb \
            --benchmark tpch \
            --scale 0.01 \
            --output current/

      - name: Compare Results
        id: compare
        run: |
          python scripts/compare_benchmarks.py \
            --baseline baseline/tpch_sf001_duckdb.json \
            --current current/tpch_sf001_duckdb.json \
            --threshold 10

      - name: Comment PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const report = fs.readFileSync('comparison_report.md', 'utf8');

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: report
            });

      - name: Fail on Regression
        if: steps.compare.outputs.regression == 'true'
        run: |
          echo "Performance regression detected!"
          exit 1

Benchmark Comparison Script¶

Create scripts/compare_benchmarks.py:

#!/usr/bin/env python3
"""Compare benchmark results and detect regressions."""

import argparse
import json
import sys
from pathlib import Path


def compare_benchmarks(baseline_path: Path, current_path: Path, threshold: float = 10.0):
    """Compare two benchmark results.

    Args:
        baseline_path: Path to baseline results
        current_path: Path to current results
        threshold: Regression threshold percentage

    Returns:
        True if regression detected
    """
    with open(baseline_path) as f:
        baseline = json.load(f)

    with open(current_path) as f:
        current = json.load(f)

    # Extract timing data
    baseline_timing = baseline.get("results", {}).get("timing", {})
    current_timing = current.get("results", {}).get("timing", {})

    baseline_avg = baseline_timing.get("avg_ms", 0) / 1000
    current_avg = current_timing.get("avg_ms", 0) / 1000

    # Calculate change
    if baseline_avg == 0:
        print("⚠️  No baseline data available")
        return False

    change_pct = ((current_avg - baseline_avg) / baseline_avg) * 100

    # Generate report
    report = f"""## Benchmark Comparison Report

### Overall Performance

- **Baseline**: {baseline_avg:.3f}s average query time
- **Current**: {current_avg:.3f}s average query time
- **Change**: {change_pct:+.2f}%

"""

    # Determine status
    is_regression = change_pct > threshold

    if is_regression:
        report += f"### ❌ Regression Detected\n\n"
        report += f"Performance degraded by {change_pct:.2f}% (threshold: {threshold}%)\n"
    elif change_pct < -threshold:
        report += f"### ✅ Performance Improvement\n\n"
        report += f"Performance improved by {abs(change_pct):.2f}%\n"
    else:
        report += f"### ✓ No Significant Change\n\n"
        report += f"Performance change within acceptable range\n"

    # Write report
    Path("comparison_report.md").write_text(report)

    # Set GitHub Actions output
    with open(os.environ.get("GITHUB_OUTPUT", "/dev/null"), "a") as f:
        f.write(f"regression={'true' if is_regression else 'false'}\n")

    print(report)

    return is_regression


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--baseline", type=Path, required=True)
    parser.add_argument("--current", type=Path, required=True)
    parser.add_argument("--threshold", type=float, default=10.0)

    args = parser.parse_args()

    is_regression = compare_benchmarks(args.baseline, args.current, args.threshold)
    sys.exit(1 if is_regression else 0)

GitLab CI¶

Create .gitlab-ci.yml:

stages:
  - benchmark
  - compare
  - report

variables:
  BENCHMARK_SCALE: "0.01"
  REGRESSION_THRESHOLD: "10"

benchmark:
  stage: benchmark
  image: python:3.11
  before_script:
    - curl -LsSf https://astral.sh/uv/install.sh | sh
    - export PATH="$HOME/.cargo/bin:$PATH"
    - uv pip install benchbox[all]
  script:
    - |
      benchbox run \
        --platform duckdb \
        --benchmark tpch \
        --scale ${BENCHMARK_SCALE} \
        --output results/
  artifacts:
    paths:
      - results/*.json
    expire_in: 30 days

compare:
  stage: compare
  image: python:3.11
  dependencies:
    - benchmark
  script:
    - |
      if [ -f "baseline/tpch_sf001_duckdb.json" ]; then
        python scripts/compare_benchmarks.py \
          --baseline baseline/tpch_sf001_duckdb.json \
          --current results/*.json \
          --threshold ${REGRESSION_THRESHOLD}
      else
        echo "No baseline found, skipping comparison"
      fi
  allow_failure: false
  artifacts:
    paths:
      - comparison_report.md
    when: always

report:
  stage: report
  image: python:3.11
  dependencies:
    - compare
  script:
    - |
      if [ -f "comparison_report.md" ]; then
        cat comparison_report.md
      fi
  when: always

Jenkins¶

Create Jenkinsfile:

pipeline {
    agent any

    parameters {
        string(name: 'BENCHMARK', defaultValue: 'tpch', description: 'Benchmark to run')
        string(name: 'SCALE_FACTOR', defaultValue: '0.01', description: 'Scale factor')
        string(name: 'PLATFORM', defaultValue: 'duckdb', description: 'Database platform')
    }

    environment {
        UV_HOME = "${WORKSPACE}/.uv"
    }

    stages {
        stage('Setup') {
            steps {
                sh '''
                    curl -LsSf https://astral.sh/uv/install.sh | sh
                    export PATH="$HOME/.cargo/bin:$PATH"
                    uv pip install benchbox[all]
                '''
            }
        }

        stage('Run Benchmark') {
            steps {
                sh """
                    benchbox run \
                        --platform ${params.PLATFORM} \
                        --benchmark ${params.BENCHMARK} \
                        --scale ${params.SCALE_FACTOR} \
                        --output results/
                """
            }
        }

        stage('Compare with Baseline') {
            when {
                expression {
                    fileExists('baseline/tpch_sf001_duckdb.json')
                }
            }
            steps {
                sh '''
                    python scripts/compare_benchmarks.py \
                        --baseline baseline/*.json \
                        --current results/*.json \
                        --threshold 10
                '''
            }
        }

        stage('Archive Results') {
            steps {
                archiveArtifacts artifacts: 'results/*.json', fingerprint: true
                archiveArtifacts artifacts: 'results/*.html', fingerprint: true
            }
        }

        stage('Publish Report') {
            steps {
                publishHTML([
                    reportDir: 'results',
                    reportFiles: '*.html',
                    reportName: 'Benchmark Report'
                ])
            }
        }
    }

    post {
        always {
            cleanWs()
        }
        failure {
            emailext(
                subject: "Benchmark Failed: ${env.JOB_NAME} - ${env.BUILD_NUMBER}",
                body: "Benchmark test failed. Check ${env.BUILD_URL} for details.",
                to: '${DEFAULT_RECIPIENTS}'
            )
        }
    }
}

CircleCI¶

Create .circleci/config.yml:

version: 2.1

executors:
  python-executor:
    docker:
      - image: cimg/python:3.11
    working_directory: ~/project

jobs:
  run-benchmark:
    executor: python-executor
    steps:
      - checkout

      - run:
          name: Install UV
          command: curl -LsSf https://astral.sh/uv/install.sh | sh

      - run:
          name: Install BenchBox
          command: |
            export PATH="$HOME/.cargo/bin:$PATH"
            uv pip install benchbox[all]

      - run:
          name: Run TPC-H Benchmark
          command: |
            benchbox run \
              --platform duckdb \
              --benchmark tpch \
              --scale 0.01 \
              --output results/

      - store_artifacts:
          path: results/
          destination: benchmark-results

      - persist_to_workspace:
          root: .
          paths:
            - results/*.json

  compare-results:
    executor: python-executor
    steps:
      - checkout

      - attach_workspace:
          at: .

      - run:
          name: Compare with Baseline
          command: |
            if [ -f "baseline/tpch_sf001_duckdb.json" ]; then
              python scripts/compare_benchmarks.py \
                --baseline baseline/tpch_sf001_duckdb.json \
                --current results/*.json \
                --threshold 10
            fi

      - store_artifacts:
          path: comparison_report.md

workflows:
  version: 2
  benchmark-workflow:
    jobs:
      - run-benchmark
      - compare-results:
          requires:
            - run-benchmark

Azure Pipelines¶

Create azure-pipelines.yml:

trigger:
  branches:
    include:
      - main
  paths:
    include:
      - src/*

pr:
  branches:
    include:
      - main

pool:
  vmImage: 'ubuntu-latest'

variables:
  BENCHMARK: 'tpch'
  SCALE_FACTOR: '0.01'
  PLATFORM: 'duckdb'

stages:
  - stage: Benchmark
    jobs:
      - job: RunBenchmark
        steps:
          - task: UsePythonVersion@0
            inputs:
              versionSpec: '3.11'

          - script: |
              curl -LsSf https://astral.sh/uv/install.sh | sh
              export PATH="$HOME/.cargo/bin:$PATH"
              uv pip install benchbox[all]
            displayName: 'Install Dependencies'

          - script: |
              benchbox run \
                --platform $(PLATFORM) \
                --benchmark $(BENCHMARK) \
                --scale $(SCALE_FACTOR) \
                --output $(Build.ArtifactStagingDirectory)/results/
            displayName: 'Run Benchmark'

          - task: PublishBuildArtifacts@1
            inputs:
              PathtoPublish: '$(Build.ArtifactStagingDirectory)/results'
              ArtifactName: 'benchmark-results'

  - stage: Compare
    dependsOn: Benchmark
    condition: succeeded()
    jobs:
      - job: CompareResults
        steps:
          - task: DownloadBuildArtifacts@1
            inputs:
              buildType: 'current'
              downloadType: 'single'
              artifactName: 'benchmark-results'

          - script: |
              python scripts/compare_benchmarks.py \
                --baseline baseline/tpch_sf001_duckdb.json \
                --current $(Build.ArtifactStagingDirectory)/benchmark-results/*.json \
                --threshold 10
            displayName: 'Compare with Baseline'
            condition: and(succeeded(), exists('baseline/tpch_sf001_duckdb.json'))

Best Practices¶

1. Small Scale Factors¶

Use small scale factors in CI/CD to keep execution times reasonable:

# Fast feedback in CI
--scale 0.01  # ~10 MB data, ~30 seconds

# Nightly builds can use larger scales
--scale 0.1   # ~100 MB data, ~5 minutes

2. Baseline Management¶

Maintain stable baselines:

# Store baseline in version control
git add baseline/tpch_sf001_duckdb.json
git commit -m "Update performance baseline"

# Or use artifact storage
aws s3 cp results/*.json s3://benchmarks/baselines/$(git rev-parse HEAD)/

3. Conditional Execution¶

Run benchmarks only when relevant:

# GitHub Actions - only on specific paths
on:
  push:
    paths:
      - 'src/**'
      - 'benchmarks/**'

4. Parallel Execution¶

Run multiple benchmarks in parallel:

strategy:
  matrix:
    benchmark: [tpch, ssb, clickbench]
    scale: [0.01, 0.1]

5. Caching¶

Cache dependencies and data:

- name: Cache BenchBox data
  uses: actions/cache@v4
  with:
    path: ~/.benchbox/data
    key: benchbox-data-${{ hashFiles('**/benchmark-config.yml') }}

Performance Baselines¶

Creating Baselines¶

Create initial baseline:

from benchbox.tpch import TPCH
from benchbox.platforms.duckdb import DuckDBAdapter
from benchbox.core.results.exporter import ResultExporter
from pathlib import Path

# Run benchmark
benchmark = TPCH(scale_factor=0.01)
adapter = DuckDBAdapter()
results = adapter.run_benchmark(benchmark)

# Export as baseline
exporter = ResultExporter(output_dir="baseline")
baseline_files = exporter.export_result(results, formats=["json"])

print(f"Baseline created: {baseline_files['json']}")

Updating Baselines¶

Update baseline after verified improvements:

# Run comparison
python scripts/compare_benchmarks.py \
  --baseline baseline/tpch_sf001_duckdb.json \
  --current results/tpch_sf001_duckdb.json

# If improvement is verified, update baseline
cp results/tpch_sf001_duckdb.json baseline/tpch_sf001_duckdb.json
git add baseline/tpch_sf001_duckdb.json
git commit -m "Update baseline with verified improvements"

Regression Detection¶

Advanced Comparison Script¶

Create scripts/regression_detector.py:

#!/usr/bin/env python3
"""Advanced regression detection with query-level analysis."""

import argparse
import json
import sys
from pathlib import Path
from typing import Dict, List


class RegressionDetector:
    def __init__(self, threshold: float = 10.0):
        self.threshold = threshold

    def load_results(self, path: Path) -> dict:
        with open(path) as f:
            return json.load(f)

    def extract_query_timings(self, results: dict) -> Dict[str, float]:
        """Extract query-level timing data."""
        query_timings = {}

        queries = results.get("results", {}).get("queries", {}).get("details", [])
        for query in queries:
            query_id = query.get("id") or query.get("query_id")
            exec_time_ms = query.get("execution_time_ms", 0)
            query_timings[query_id] = exec_time_ms / 1000

        return query_timings

    def compare_queries(
        self,
        baseline_timings: Dict[str, float],
        current_timings: Dict[str, float]
    ) -> List[dict]:
        """Compare query-level performance."""
        comparisons = []

        for query_id in baseline_timings.keys():
            if query_id not in current_timings:
                continue

            baseline_time = baseline_timings[query_id]
            current_time = current_timings[query_id]

            if baseline_time == 0:
                continue

            change_pct = ((current_time - baseline_time) / baseline_time) * 100

            comparisons.append({
                "query_id": query_id,
                "baseline_time": baseline_time,
                "current_time": current_time,
                "change_pct": change_pct,
                "is_regression": change_pct > self.threshold,
                "is_improvement": change_pct < -self.threshold
            })

        return comparisons

    def generate_report(
        self,
        baseline: dict,
        current: dict,
        comparisons: List[dict]
    ) -> str:
        """Generate detailed regression report."""
        # Overall metrics
        baseline_timing = baseline.get("results", {}).get("timing", {})
        current_timing = current.get("results", {}).get("timing", {})

        baseline_avg = baseline_timing.get("avg_ms", 0) / 1000
        current_avg = current_timing.get("avg_ms", 0) / 1000
        overall_change = ((current_avg - baseline_avg) / baseline_avg) * 100 if baseline_avg else 0

        # Categorize queries
        regressions = [c for c in comparisons if c["is_regression"]]
        improvements = [c for c in comparisons if c["is_improvement"]]
        unchanged = [c for c in comparisons if not c["is_regression"] and not c["is_improvement"]]

        # Build report
        report = f"""# Benchmark Regression Report

## Overall Performance

| Metric | Baseline | Current | Change |
|--------|----------|---------|--------|
| Average Query Time | {baseline_avg:.3f}s | {current_avg:.3f}s | {overall_change:+.2f}% |

"""

        # Summary
        if overall_change > self.threshold:
            report += f"### ❌ Regression Detected\n\n"
            report += f"Overall performance degraded by {overall_change:.2f}%\n\n"
        elif overall_change < -self.threshold:
            report += f"### ✅ Performance Improvement\n\n"
            report += f"Overall performance improved by {abs(overall_change):.2f}%\n\n"
        else:
            report += f"### ✓ Performance Stable\n\n"
            report += f"No significant overall change ({overall_change:+.2f}%)\n\n"

        # Query breakdown
        report += f"""## Query-Level Analysis

- **Total Queries**: {len(comparisons)}
- **Regressions**: {len(regressions)}
- **Improvements**: {len(improvements)}
- **Unchanged**: {len(unchanged)}

"""

        # Regressions detail
        if regressions:
            report += "### Regressed Queries\n\n"
            report += "| Query | Baseline | Current | Change |\n"
            report += "|-------|----------|---------|--------|\n"
            for r in sorted(regressions, key=lambda x: x["change_pct"], reverse=True)[:10]:
                report += f"| {r['query_id']} | {r['baseline_time']:.3f}s | {r['current_time']:.3f}s | {r['change_pct']:+.2f}% |\n"
            report += "\n"

        # Improvements detail
        if improvements:
            report += "### Improved Queries\n\n"
            report += "| Query | Baseline | Current | Change |\n"
            report += "|-------|----------|---------|--------|\n"
            for i in sorted(improvements, key=lambda x: x["change_pct"])[:10]:
                report += f"| {i['query_id']} | {i['baseline_time']:.3f}s | {i['current_time']:.3f}s | {i['change_pct']:+.2f}% |\n"
            report += "\n"

        return report

    def detect_regressions(
        self,
        baseline_path: Path,
        current_path: Path
    ) -> bool:
        """Main regression detection logic."""
        baseline = self.load_results(baseline_path)
        current = self.load_results(current_path)

        baseline_timings = self.extract_query_timings(baseline)
        current_timings = self.extract_query_timings(current)

        comparisons = self.compare_queries(baseline_timings, current_timings)
        report = self.generate_report(baseline, current, comparisons)

        # Write report
        Path("regression_report.md").write_text(report)
        print(report)

        # Determine if there are regressions
        has_regressions = any(c["is_regression"] for c in comparisons)

        return has_regressions


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--baseline", type=Path, required=True)
    parser.add_argument("--current", type=Path, required=True)
    parser.add_argument("--threshold", type=float, default=10.0)

    args = parser.parse_args()

    detector = RegressionDetector(threshold=args.threshold)
    has_regressions = detector.detect_regressions(args.baseline, args.current)

    sys.exit(1 if has_regressions else 0)

Reporting¶

HTML Report Generation¶

Generate HTML reports with charts:

from benchbox.core.results.exporter import ResultExporter

exporter = ResultExporter(output_dir="reports")

# Export with HTML format
files = exporter.export_result(results, formats=["html"])

# Export comparison report
comparison = exporter.compare_results(baseline_path, current_path)
report_path = exporter.export_comparison_report(comparison)

print(f"Report: {report_path}")

Slack Notifications¶

Send results to Slack:

import requests
import json

def send_slack_notification(webhook_url: str, results: dict):
    """Send benchmark results to Slack."""
    benchmark_name = results.get("benchmark", {}).get("name", "Unknown")
    execution_id = results.get("execution", {}).get("id", "")
    timing = results.get("results", {}).get("timing", {})

    avg_time = timing.get("avg_ms", 0) / 1000
    total_queries = results.get("results", {}).get("queries", {}).get("total", 0)

    message = {
        "blocks": [
            {
                "type": "header",
                "text": {
                    "type": "plain_text",
                    "text": f"Benchmark Results: {benchmark_name}"
                }
            },
            {
                "type": "section",
                "fields": [
                    {"type": "mrkdwn", "text": f"*Execution ID:*\n{execution_id}"},
                    {"type": "mrkdwn", "text": f"*Total Queries:*\n{total_queries}"},
                    {"type": "mrkdwn", "text": f"*Average Time:*\n{avg_time:.3f}s"},
                ]
            }
        ]
    }

    response = requests.post(webhook_url, json=message)
    response.raise_for_status()

# Usage
webhook = os.environ.get("SLACK_WEBHOOK_URL")
if webhook:
    send_slack_notification(webhook, results)

Troubleshooting¶

Long Execution Times¶

Problem: Benchmarks take too long in CI

Solution: Use smaller scale factors and query subsets:

benchbox run \
  --platform duckdb \
  --benchmark tpch \
  --scale 0.001 \  # Very small scale
  --queries 1,6,12  # Subset of queries

Resource Constraints¶

Problem: CI runners run out of memory

Solution: Use resource-appropriate configurations:

# GitHub Actions - use larger runners
runs-on: ubuntu-latest-8-cores

# Or limit benchmark scope
--scale 0.01  # Smaller data size

Flaky Results¶

Problem: Benchmark results vary between runs

Solution: Run multiple iterations and use median:

# Run multiple times
results = []
for i in range(3):
    result = adapter.run_benchmark(benchmark)
    results.append(result.average_query_time)

# Use median for stability
median_time = sorted(results)[len(results) // 2]