ClickHouse Local Mode

Tags intermediate guide clickhouse embedded-platform

BenchBox supports ClickHouse in two deployment modes:

  • Server Mode: Connects to an external ClickHouse server (default)

  • Local Mode: Uses chDB for in-process ClickHouse engine (new)

Overview

ClickHouse Local Mode uses chDB, the official embedded ClickHouse engine, to run ClickHouse queries directly in Python without requiring a separate ClickHouse server installation.

Key Benefits

  • Zero Server Setup: No ClickHouse server installation required

  • Native Performance: In-process execution eliminates IPC overhead

  • Development Friendly: Perfect for testing, development, and quick analysis

  • Same SQL Compatibility: Full ClickHouse SQL dialect support

  • Easy Installation: Single pip install chdb command

Installation

Prerequisites

  • Python 3.10+

  • Supported platforms: macOS and Linux (x86_64 and ARM64)

Install chDB

# Install chDB for embedded mode support
pip install chdb

# Verify installation
python -c "import chdb; print(chdb.chdb_version())"

Install BenchBox with ClickHouse Support

# Install BenchBox (if not already installed)
uv add benchbox

# Or with pip
pip install benchbox

Usage

Basic Usage

# Run TPC-H benchmark in embedded mode
benchbox run tpch --platform=clickhouse --mode=local --scale-factor=0.01

# Run with custom data path
benchbox run tpch --platform=clickhouse --mode=local --data-path=/tmp/benchmark_data

# Compare with server mode
benchbox run tpch --platform=clickhouse --mode=server --host=localhost --port=9000

CLI Arguments

Mode Selection

  • --mode=local - Use embedded ClickHouse via chDB

  • --mode=server - Use ClickHouse server (default for backward compatibility)

Embedded Mode Specific Arguments

  • --data-path=PATH - Optional data path for file operations

Server Mode Arguments (not used in embedded mode)

  • --host=HOST - ClickHouse server host

  • --port=PORT - ClickHouse server port

  • --user=USER - Username for server authentication

  • --password=PASS - Password for server authentication

  • --secure - Use TLS connection

Performance Characteristics

Embedded Mode

  • Memory Usage: Lower baseline memory (~50-200MB)

  • Startup Time: No network connection setup required

  • Query Execution: Columnar engine for analytical workloads

  • Scalability: Suited for small to medium datasets (< 10GB)

  • Concurrency: Single-process, sequential query execution

Server Mode

  • Memory Usage: Higher baseline (server overhead)

  • Startup Time: Network connection overhead

  • Query Execution: Same columnar engine, distributed architecture available

  • Scalability: Designed for large datasets (TB+)

  • Concurrency: Multi-client support, parallel query execution

When to Use Each Mode

Use Embedded Mode When:

  • Development & Testing: Quick benchmark development and validation

  • CI/CD Pipelines: Automated testing without infrastructure setup

  • Data Analysis: Interactive data exploration and analysis

  • Prototyping: Rapid benchmark prototyping and iteration

  • Small to Medium Data: Datasets under 10GB

  • Single-User Scenarios: Personal analysis and development

Use Server Mode When:

  • Production Benchmarking: Large-scale production environment testing

  • Large Datasets: Working with multi-TB datasets

  • Multi-User Access: Shared benchmark environments

  • Enterprise Deployments: Integration with existing ClickHouse infrastructure

  • Performance Testing: Maximum throughput and scalability testing

  • Cluster Configurations: Testing distributed ClickHouse setups

Examples

TPC-H Benchmark

# Small scale for development
benchbox run tpch --platform=clickhouse --mode=local --scale-factor=0.01

# Medium scale for testing
benchbox run tpch --platform=clickhouse --mode=local --scale-factor=1.0

ClickBench Benchmark

# Run ClickBench analytical queries
benchbox run clickbench --platform=clickhouse --mode=local

Custom Data Directory

# Use specific directory for generated data
benchbox run tpch \
  --platform=clickhouse \
  --mode=local \
  --scale-factor=0.1 \
  --data-path=/path/to/benchmark/data

Troubleshooting

Common Issues and Solutions

1. chDB Not Installed

Error: ClickHouse local mode requires chDB but it is not installed.

Solution:

pip install chdb

2. Platform Not Supported

Error: chDB installation failed or not compatible with your platform

Solution:

  • Ensure you’re on macOS or Linux (x86_64/ARM64)

  • Try upgrading pip: pip install --upgrade pip

  • Check Python version: python --version (3.8+ required)

3. Memory Issues with Large Datasets

Error: Memory limit exceeded or system running out of memory

Solution:

  • Use smaller scale factors for testing

  • Switch to server mode for large datasets

  • Monitor system memory usage

4. Query Performance Issues

Queries running slower than expected in embedded mode

Solution:

  • Embedded mode is optimized for small-medium datasets

  • For large datasets or maximum performance, use server mode

  • Consider data partitioning or smaller scale factors

Getting Help

  1. Check Installation: Verify chDB is properly installed

    python -c "import chdb; print('chDB version:', chdb.chdb_version())"
    
  2. Verbose Output: Run with verbose logging

    benchbox run tpch --platform=clickhouse --mode=local --verbose
    
  3. Compare Modes: Test both modes to isolate issues

    # Test embedded mode
    benchbox run tpch --platform=clickhouse --mode=local --scale-factor=0.01
    
    # Test server mode (if available)
    benchbox run tpch --platform=clickhouse --mode=server --scale-factor=0.01
    

Advanced Usage

Performance Tuning

While embedded mode has fewer tuning options than server mode, you can optimize performance:

# Use appropriate scale factors
benchbox run tpch --platform=clickhouse --mode=local --scale-factor=0.1

# Monitor memory usage during execution
top -p $(pgrep -f benchbox)

Integration with Other Tools

# Export results for analysis
benchbox run tpch --platform=clickhouse --mode=local --output=json > results.json

# Run multiple benchmarks
for benchmark in tpch tpcds ssb; do
  echo "Running $benchmark..."
  benchbox run $benchmark --platform=clickhouse --mode=local --scale-factor=0.01
done

Technical Details

Architecture

  • chDB Integration: Uses official ClickHouse local engine

  • Connection Management: Persistent connection maintains table state

  • Query Execution: Direct SQL execution without network overhead

  • Result Processing: Native Python data type conversion

  • Error Handling: Comprehensive error messages with resolution guidance

File Formats

Embedded mode supports all standard formats:

  • CSV, TSV (tab-separated)

  • Parquet (future enhancement)

  • JSON (future enhancement)

Limitations

  • Single Process: No multi-process parallelism

  • Memory Bounds: Limited by available system memory

  • No Clustering: Single-node execution only

  • No Replication: No built-in data redundancy

Migration Guide

From Server to Embedded Mode

# Old server mode command
benchbox run tpch --platform=clickhouse --host=localhost --port=9000

# New embedded mode equivalent
benchbox run tpch --platform=clickhouse --mode=local

From Embedded to Server Mode

# Current embedded mode command
benchbox run tpch --platform=clickhouse --mode=local

# Server mode equivalent (requires ClickHouse server)
benchbox run tpch --platform=clickhouse --mode=server --host=localhost --port=9000

Contributing

To contribute to ClickHouse local mode support:

  1. Testing: Run the embedded mode test suite

    pytest tests/unit/platforms/test_clickhouse_local.py -v
    
  2. Development: Set up development environment

    pip install -e .[dev]
    pip install chdb
    
  3. Bug Reports: Include system information and chDB version

    python -c "import chdb, platform; print(f'chDB: {chdb.chdb_version()}, Platform: {platform.platform()}')"
    

References