ClickHouse Platform Adapter¶

Tags reference python-api clickhouse

The ClickHouse adapter provides high-performance columnar database execution for analytical benchmarks.

Overview¶

ClickHouse is an open-source column-oriented database management system that provides:

Columnar architecture - Optimized for analytical queries
Scalability - Support for petabyte-scale datasets
Flexible deployment - Server mode or embedded local mode
Compression - Columnar compression for storage efficiency
OLAP focus - Designed for analytical workloads

The ClickHouse adapter supports two modes:

Server mode - Connect to ClickHouse server (local or remote)
Local mode - Embedded execution using chDB library

Common use cases:

Analytical workloads
Large-scale benchmarking (100GB+)
Performance comparison with other columnar databases
Real-time analytics applications

Quick Start¶

Server Mode (Default)¶

from benchbox.tpch import TPCH
from benchbox.platforms.clickhouse import ClickHouseAdapter

# Connect to ClickHouse server
adapter = ClickHouseAdapter(
    host="localhost",
    port=9000,
    database="benchmark",
    username="default",
    password=""
)

# Run benchmark
benchmark = TPCH(scale_factor=1.0)
results = benchmark.run_with_platform(adapter)

Local Mode (Embedded)¶

from benchbox.platforms.clickhouse import ClickHouseAdapter

# Embedded ClickHouse with chDB
adapter = ClickHouseAdapter(
    mode="local",
    data_path="./benchmark.chdb"  # Optional persistent storage
)

# Run benchmark
benchmark = TPCH(scale_factor=0.1)
results = benchmark.run_with_platform(adapter)

API Reference¶

ClickHouseAdapter Class¶

class ClickHouseAdapter(**config)[source]¶

Bases: ClickHouseMetadataMixin, ClickHouseSetupMixin, ClickHouseDiagnosticsMixin, ClickHouseWorkloadMixin, ClickHouseTuningMixin, PlatformAdapter

High-level adapter coordinating ClickHouse operations.

Known Limitations:

TPC-DS queries with known incompatibilities: - Query 14: INTERSECT DISTINCT requires manual alias addition - Query 30: Query plan cloning not implemented for aggregation steps (Code: 48) - Query 66: Nested aggregation not supported - requires query rewrite - Query 81: Query plan cloning not implemented for aggregation steps (Code: 48)

These queries may fail even with transformations applied and require manual query rewriting or ClickHouse engine improvements.

KNOWN_INCOMPATIBLE_QUERIES = {'tpcds': [14, 30, 66, 81]}¶

__init__(**config)[source]¶

Constructor Parameters¶

Server Mode:

ClickHouseAdapter(
    mode: str = "server",
    host: str = "localhost",
    port: int = 9000,
    database: str = "default",
    username: str = "default",
    password: str = "",
    secure: bool = False,
    compression: bool = False,
    max_memory_usage: str = "8GB",
    max_execution_time: int = 300,
    max_threads: int = 8
)

Local Mode:

ClickHouseAdapter(
    mode: str = "local",
    data_path: Optional[str] = None,
    max_memory_usage: str = "4GB",
    max_execution_time: int = 300,
    max_threads: int = 4
)

Parameters:

Connection (Server Mode):

mode (str): Connection mode - “server” (default) or “local”
host (str): Server hostname or IP address. Default: “localhost”
port (int): Native protocol port. Default: 9000 (HTTP: 8123)
database (str): Database name. Default: “default”
username (str): Username for authentication. Default: “default”
password (str): Password for authentication. Default: “”
secure (bool): Use TLS/SSL connection. Default: False
compression (bool): Enable compression (disabled by default due to Python 3.13+ compatibility). Default: False

Performance:

max_memory_usage (str): Maximum memory per query. Default: “8GB” (server), “4GB” (local)
max_execution_time (int): Query timeout in seconds. Default: 300
max_threads (int): Maximum query threads. Default: 8 (server), 4 (local)

Local Mode:

data_path (str, optional): Path for persistent chDB storage. Default: None (in-memory)

Configuration Examples¶

Server Mode - Local Development¶

from benchbox.platforms.clickhouse import ClickHouseAdapter

# Default local server
adapter = ClickHouseAdapter(
    host="localhost",
    port=9000,
    database="benchmark"
)

# With authentication
adapter = ClickHouseAdapter(
    host="localhost",
    port=9000,
    database="benchmark",
    username="benchmark_user",
    password="secure_password"
)

Server Mode - Production¶

# Production server with TLS
adapter = ClickHouseAdapter(
    host="clickhouse.example.com",
    port=9440,  # Secure native port
    database="production_benchmarks",
    username="admin",
    password="production_password",
    secure=True,
    max_memory_usage="32GB",
    max_threads=16
)

Local Mode - Development¶

# In-memory execution (fast, no persistence)
adapter = ClickHouseAdapter(mode="local")

# Persistent storage (data survives restarts)
adapter = ClickHouseAdapter(
    mode="local",
    data_path="./benchmarks/clickhouse_local.chdb"
)

Performance Tuning¶

# High-performance configuration
adapter = ClickHouseAdapter(
    host="localhost",
    database="benchmark",
    max_memory_usage="64GB",     # Increase for large datasets
    max_execution_time=600,      # 10 minute timeout
    max_threads=32,              # Use all available cores
    compression=False            # Disabled by default for compatibility
)

Data Loading¶

Bulk Loading from Files¶

from benchbox.platforms.clickhouse import ClickHouseAdapter

adapter = ClickHouseAdapter(host="localhost", database="benchmark")
conn = adapter.create_connection()

# Load from CSV
conn.execute("""
    CREATE TABLE lineitem (
        l_orderkey UInt32,
        l_partkey UInt32,
        l_suppkey UInt32,
        l_linenumber UInt8,
        l_quantity Decimal(15, 2),
        l_extendedprice Decimal(15, 2),
        l_discount Decimal(15, 2),
        l_tax Decimal(15, 2),
        l_returnflag String,
        l_linestatus String,
        l_shipdate Date,
        l_commitdate Date,
        l_receiptdate Date,
        l_shipinstruct String,
        l_shipmode String,
        l_comment String
    ) ENGINE = MergeTree()
    ORDER BY (l_orderkey, l_linenumber)
""")

# Bulk insert from CSV file
conn.execute("""
    INSERT INTO lineitem
    FROM INFILE 'data/lineitem.tbl'
    FORMAT CSV
""")

Loading from S3¶

# ClickHouse can read directly from S3
conn.execute("""
    CREATE TABLE lineitem AS
    SELECT * FROM s3(
        'https://s3.amazonaws.com/bucket/lineitem/*.parquet',
        'Parquet'
    )
""")

# With credentials
conn.execute("""
    CREATE TABLE lineitem AS
    SELECT * FROM s3(
        'https://s3.amazonaws.com/bucket/lineitem/*.parquet',
        'aws_access_key_id',
        'aws_secret_access_key',
        'Parquet'
    )
""")

Query Execution¶

Execute Queries Directly¶

from benchbox.platforms.clickhouse import ClickHouseAdapter

adapter = ClickHouseAdapter(host="localhost", database="benchmark")
conn = adapter.create_connection()

# Simple query
result = conn.execute("SELECT COUNT(*) FROM lineitem")
row_count = result[0][0]  # Result is list of tuples

# Complex analytical query
result = conn.execute("""
    SELECT
        l_returnflag,
        l_linestatus,
        sum(l_quantity) as sum_qty,
        sum(l_extendedprice) as sum_base_price,
        count(*) as count_order
    FROM lineitem
    WHERE l_shipdate <= '1998-09-01'
    GROUP BY l_returnflag, l_linestatus
    ORDER BY l_returnflag, l_linestatus
""")

Query Plans and Optimization¶

# Get query plan
plan = conn.execute("""
    EXPLAIN
    SELECT * FROM lineitem
    WHERE l_shipdate > '1995-01-01'
""")
for row in plan:
    print(row[0])

# Analyze query pipeline
pipeline = conn.execute("""
    EXPLAIN PIPELINE
    SELECT COUNT(*) FROM lineitem
    GROUP BY l_orderkey
""")

Advanced Features¶

Table Engines¶

# MergeTree (most common for analytics)
conn.execute("""
    CREATE TABLE orders (
        o_orderkey UInt32,
        o_custkey UInt32,
        o_orderstatus String,
        o_totalprice Decimal(15, 2),
        o_orderdate Date
    ) ENGINE = MergeTree()
    ORDER BY (o_orderdate, o_orderkey)
    PARTITION BY toYYYYMM(o_orderdate)
""")

# ReplacingMergeTree (deduplication)
conn.execute("""
    CREATE TABLE customer_updates (
        c_custkey UInt32,
        c_name String,
        c_address String,
        update_timestamp DateTime
    ) ENGINE = ReplacingMergeTree(update_timestamp)
    ORDER BY c_custkey
""")

Materialized Views¶

# Create materialized view for pre-aggregation
conn.execute("""
    CREATE MATERIALIZED VIEW orders_by_date
    ENGINE = SummingMergeTree()
    ORDER BY order_date
    AS SELECT
        toDate(o_orderdate) AS order_date,
        count() AS order_count,
        sum(o_totalprice) AS total_revenue
    FROM orders
    GROUP BY order_date
""")

Distributed Queries¶

# Query across multiple shards (cluster setup required)
result = conn.execute("""
    SELECT
        l_returnflag,
        count() AS cnt
    FROM cluster('benchmark_cluster', default.lineitem)
    GROUP BY l_returnflag
""")

Best Practices¶

Memory Management¶

Set appropriate memory limits per query:

adapter = ClickHouseAdapter(
    host="localhost",
    max_memory_usage="16GB"  # Per query limit
)

Monitor memory usage during execution:

# Check memory usage
result = conn.execute("""
    SELECT
        query,
        memory_usage,
        formatReadableSize(memory_usage) AS readable_memory
    FROM system.processes
    WHERE user = currentUser()
""")

Use external aggregation for large GROUP BY:

# Enable external aggregation automatically
conn.execute("SET max_bytes_before_external_group_by = 10000000000")

Performance Optimization¶

Choose optimal table engine and ordering key:

# Good: Order by commonly filtered columns
CREATE TABLE lineitem (...)
ENGINE = MergeTree()
ORDER BY (l_shipdate, l_orderkey)

# Better: Include all filter columns
ORDER BY (l_shipdate, l_returnflag, l_orderkey)

Use appropriate data types:

# Prefer smaller types
UInt8 instead of UInt32 for small integers
Date instead of DateTime for date-only fields
LowCardinality(String) for repeated strings

Partition large tables:

CREATE TABLE lineitem (...)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(l_shipdate)  # Monthly partitions
ORDER BY (l_orderkey, l_linenumber)

Connection Management¶

Reuse connections for multiple queries:

adapter = ClickHouseAdapter(host="localhost")
conn = adapter.create_connection()

# Run multiple queries
for query_id in range(1, 23):
    result = conn.execute(queries[query_id])

# Close when done
adapter.close_connection(conn)

Set connection timeouts appropriately:

# Long-running benchmarks need longer timeouts
adapter = ClickHouseAdapter(
    host="localhost",
    max_execution_time=600  # 10 minutes
)

Common Issues¶

Connection Refused¶

Problem: Cannot connect to ClickHouse server

Solutions:

# 1. Check if server is running
ps aux | grep clickhouse-server

# 2. Start server if not running
sudo service clickhouse-server start

# 3. Check port is listening
netstat -ln | grep 9000

# 4. Test connection
clickhouse-client --host=localhost --port=9000

# Verify connection in Python
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
result = sock.connect_ex(('localhost', 9000))
if result == 0:
    print("Port 9000 is open")
else:
    print("Cannot connect to port 9000")

Memory Limit Exceeded¶

Problem: Query fails with “Memory limit exceeded”

Solutions:

# 1. Increase memory limit
adapter = ClickHouseAdapter(
    host="localhost",
    max_memory_usage="32GB"
)

# 2. Enable external operations
conn = adapter.create_connection()
conn.execute("SET max_bytes_before_external_group_by = 20000000000")
conn.execute("SET max_bytes_before_external_sort = 20000000000")

# 3. Reduce scale factor for testing
benchmark = TPCH(scale_factor=0.1)  # Start small

Local Mode Import Error¶

Problem: “chdb is not installed” error in local mode

Solution:

# Install chDB
uv pip install chdb

# Or switch to server mode
adapter = ClickHouseAdapter(mode="server", host="localhost")

Slow Query Performance¶

Problem: Queries execute slowly

Solutions:

# 1. Increase thread count
adapter = ClickHouseAdapter(
    host="localhost",
    max_threads=16  # Use more CPU cores
)

# 2. Check query plan
plan = conn.execute("EXPLAIN SELECT ...")
# Look for FullScanStep (table scan) - may need better ORDER BY

# 3. Enable query profiling
conn.execute("SET log_queries = 1")
conn.execute("SET log_query_threads = 1")

# Run query
result = conn.execute("SELECT ...")

# Check query log
log = conn.execute("""
    SELECT
        query,
        query_duration_ms,
        memory_usage,
        read_rows,
        read_bytes
    FROM system.query_log
    WHERE type = 'QueryFinish'
    ORDER BY event_time DESC
    LIMIT 1
""")

ClickHouse Platform Adapter¶

Overview¶

Quick Start¶

Server Mode (Default)¶

Local Mode (Embedded)¶

API Reference¶

ClickHouseAdapter Class¶

Constructor Parameters¶

Configuration Examples¶

Server Mode - Local Development¶

Server Mode - Production¶

Local Mode - Development¶

Performance Tuning¶

Data Loading¶

Bulk Loading from Files¶

Loading from S3¶

Query Execution¶

Execute Queries Directly¶

Query Plans and Optimization¶

Advanced Features¶

Table Engines¶

Materialized Views¶

Distributed Queries¶

Best Practices¶

Memory Management¶

Performance Optimization¶

Connection Management¶

Common Issues¶

Connection Refused¶

Memory Limit Exceeded¶

Local Mode Import Error¶

Slow Query Performance¶

See Also¶

Platform Documentation¶

Benchmark Guides¶

API Reference¶

External Resources¶