ClickHouse Platform Adapter¶
The ClickHouse adapter provides high-performance columnar database execution for analytical benchmarks.
Overview¶
ClickHouse is an open-source column-oriented database management system that provides:
Columnar architecture - Optimized for analytical queries
Scalability - Support for petabyte-scale datasets
Flexible deployment - Server mode or embedded local mode
Compression - Columnar compression for storage efficiency
OLAP focus - Designed for analytical workloads
The ClickHouse adapter supports two modes:
Server mode - Connect to ClickHouse server (local or remote)
Local mode - Embedded execution using chDB library
Common use cases:
Analytical workloads
Large-scale benchmarking (100GB+)
Performance comparison with other columnar databases
Real-time analytics applications
Quick Start¶
Server Mode (Default)¶
from benchbox.tpch import TPCH
from benchbox.platforms.clickhouse import ClickHouseAdapter
# Connect to ClickHouse server
adapter = ClickHouseAdapter(
host="localhost",
port=9000,
database="benchmark",
username="default",
password=""
)
# Run benchmark
benchmark = TPCH(scale_factor=1.0)
results = benchmark.run_with_platform(adapter)
Local Mode (Embedded)¶
from benchbox.platforms.clickhouse import ClickHouseAdapter
# Embedded ClickHouse with chDB
adapter = ClickHouseAdapter(
mode="local",
data_path="./benchmark.chdb" # Optional persistent storage
)
# Run benchmark
benchmark = TPCH(scale_factor=0.1)
results = benchmark.run_with_platform(adapter)
API Reference¶
ClickHouseAdapter Class¶
- class ClickHouseAdapter(**config)[source]¶
Bases:
ClickHouseMetadataMixin,ClickHouseSetupMixin,ClickHouseDiagnosticsMixin,ClickHouseWorkloadMixin,ClickHouseTuningMixin,PlatformAdapterHigh-level adapter coordinating ClickHouse operations.
- Known Limitations:
TPC-DS queries with known incompatibilities: - Query 14: INTERSECT DISTINCT requires manual alias addition - Query 30: Query plan cloning not implemented for aggregation steps (Code: 48) - Query 66: Nested aggregation not supported - requires query rewrite - Query 81: Query plan cloning not implemented for aggregation steps (Code: 48)
These queries may fail even with transformations applied and require manual query rewriting or ClickHouse engine improvements.
- KNOWN_INCOMPATIBLE_QUERIES = {'tpcds': [14, 30, 66, 81]}¶
Constructor Parameters¶
Server Mode:
ClickHouseAdapter(
mode: str = "server",
host: str = "localhost",
port: int = 9000,
database: str = "default",
username: str = "default",
password: str = "",
secure: bool = False,
compression: bool = False,
max_memory_usage: str = "8GB",
max_execution_time: int = 300,
max_threads: int = 8
)
Local Mode:
ClickHouseAdapter(
mode: str = "local",
data_path: Optional[str] = None,
max_memory_usage: str = "4GB",
max_execution_time: int = 300,
max_threads: int = 4
)
Parameters:
Connection (Server Mode):
mode (str): Connection mode - “server” (default) or “local”
host (str): Server hostname or IP address. Default: “localhost”
port (int): Native protocol port. Default: 9000 (HTTP: 8123)
database (str): Database name. Default: “default”
username (str): Username for authentication. Default: “default”
password (str): Password for authentication. Default: “”
secure (bool): Use TLS/SSL connection. Default: False
compression (bool): Enable compression (disabled by default due to Python 3.13+ compatibility). Default: False
Performance:
max_memory_usage (str): Maximum memory per query. Default: “8GB” (server), “4GB” (local)
max_execution_time (int): Query timeout in seconds. Default: 300
max_threads (int): Maximum query threads. Default: 8 (server), 4 (local)
Local Mode:
data_path (str, optional): Path for persistent chDB storage. Default: None (in-memory)
Configuration Examples¶
Server Mode - Local Development¶
from benchbox.platforms.clickhouse import ClickHouseAdapter
# Default local server
adapter = ClickHouseAdapter(
host="localhost",
port=9000,
database="benchmark"
)
# With authentication
adapter = ClickHouseAdapter(
host="localhost",
port=9000,
database="benchmark",
username="benchmark_user",
password="secure_password"
)
Server Mode - Production¶
# Production server with TLS
adapter = ClickHouseAdapter(
host="clickhouse.example.com",
port=9440, # Secure native port
database="production_benchmarks",
username="admin",
password="production_password",
secure=True,
max_memory_usage="32GB",
max_threads=16
)
Local Mode - Development¶
# In-memory execution (fast, no persistence)
adapter = ClickHouseAdapter(mode="local")
# Persistent storage (data survives restarts)
adapter = ClickHouseAdapter(
mode="local",
data_path="./benchmarks/clickhouse_local.chdb"
)
Performance Tuning¶
# High-performance configuration
adapter = ClickHouseAdapter(
host="localhost",
database="benchmark",
max_memory_usage="64GB", # Increase for large datasets
max_execution_time=600, # 10 minute timeout
max_threads=32, # Use all available cores
compression=False # Disabled by default for compatibility
)
Data Loading¶
Bulk Loading from Files¶
from benchbox.platforms.clickhouse import ClickHouseAdapter
adapter = ClickHouseAdapter(host="localhost", database="benchmark")
conn = adapter.create_connection()
# Load from CSV
conn.execute("""
CREATE TABLE lineitem (
l_orderkey UInt32,
l_partkey UInt32,
l_suppkey UInt32,
l_linenumber UInt8,
l_quantity Decimal(15, 2),
l_extendedprice Decimal(15, 2),
l_discount Decimal(15, 2),
l_tax Decimal(15, 2),
l_returnflag String,
l_linestatus String,
l_shipdate Date,
l_commitdate Date,
l_receiptdate Date,
l_shipinstruct String,
l_shipmode String,
l_comment String
) ENGINE = MergeTree()
ORDER BY (l_orderkey, l_linenumber)
""")
# Bulk insert from CSV file
conn.execute("""
INSERT INTO lineitem
FROM INFILE 'data/lineitem.tbl'
FORMAT CSV
""")
Loading from S3¶
# ClickHouse can read directly from S3
conn.execute("""
CREATE TABLE lineitem AS
SELECT * FROM s3(
'https://s3.amazonaws.com/bucket/lineitem/*.parquet',
'Parquet'
)
""")
# With credentials
conn.execute("""
CREATE TABLE lineitem AS
SELECT * FROM s3(
'https://s3.amazonaws.com/bucket/lineitem/*.parquet',
'aws_access_key_id',
'aws_secret_access_key',
'Parquet'
)
""")
Query Execution¶
Execute Queries Directly¶
from benchbox.platforms.clickhouse import ClickHouseAdapter
adapter = ClickHouseAdapter(host="localhost", database="benchmark")
conn = adapter.create_connection()
# Simple query
result = conn.execute("SELECT COUNT(*) FROM lineitem")
row_count = result[0][0] # Result is list of tuples
# Complex analytical query
result = conn.execute("""
SELECT
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
count(*) as count_order
FROM lineitem
WHERE l_shipdate <= '1998-09-01'
GROUP BY l_returnflag, l_linestatus
ORDER BY l_returnflag, l_linestatus
""")
Query Plans and Optimization¶
# Get query plan
plan = conn.execute("""
EXPLAIN
SELECT * FROM lineitem
WHERE l_shipdate > '1995-01-01'
""")
for row in plan:
print(row[0])
# Analyze query pipeline
pipeline = conn.execute("""
EXPLAIN PIPELINE
SELECT COUNT(*) FROM lineitem
GROUP BY l_orderkey
""")
Advanced Features¶
Table Engines¶
# MergeTree (most common for analytics)
conn.execute("""
CREATE TABLE orders (
o_orderkey UInt32,
o_custkey UInt32,
o_orderstatus String,
o_totalprice Decimal(15, 2),
o_orderdate Date
) ENGINE = MergeTree()
ORDER BY (o_orderdate, o_orderkey)
PARTITION BY toYYYYMM(o_orderdate)
""")
# ReplacingMergeTree (deduplication)
conn.execute("""
CREATE TABLE customer_updates (
c_custkey UInt32,
c_name String,
c_address String,
update_timestamp DateTime
) ENGINE = ReplacingMergeTree(update_timestamp)
ORDER BY c_custkey
""")
Materialized Views¶
# Create materialized view for pre-aggregation
conn.execute("""
CREATE MATERIALIZED VIEW orders_by_date
ENGINE = SummingMergeTree()
ORDER BY order_date
AS SELECT
toDate(o_orderdate) AS order_date,
count() AS order_count,
sum(o_totalprice) AS total_revenue
FROM orders
GROUP BY order_date
""")
Distributed Queries¶
# Query across multiple shards (cluster setup required)
result = conn.execute("""
SELECT
l_returnflag,
count() AS cnt
FROM cluster('benchmark_cluster', default.lineitem)
GROUP BY l_returnflag
""")
Best Practices¶
Memory Management¶
Set appropriate memory limits per query:
adapter = ClickHouseAdapter( host="localhost", max_memory_usage="16GB" # Per query limit )
Monitor memory usage during execution:
# Check memory usage result = conn.execute(""" SELECT query, memory_usage, formatReadableSize(memory_usage) AS readable_memory FROM system.processes WHERE user = currentUser() """)
Use external aggregation for large GROUP BY:
# Enable external aggregation automatically conn.execute("SET max_bytes_before_external_group_by = 10000000000")
Performance Optimization¶
Choose optimal table engine and ordering key:
# Good: Order by commonly filtered columns CREATE TABLE lineitem (...) ENGINE = MergeTree() ORDER BY (l_shipdate, l_orderkey) # Better: Include all filter columns ORDER BY (l_shipdate, l_returnflag, l_orderkey)
Use appropriate data types:
# Prefer smaller types UInt8 instead of UInt32 for small integers Date instead of DateTime for date-only fields LowCardinality(String) for repeated strings
Partition large tables:
CREATE TABLE lineitem (...) ENGINE = MergeTree() PARTITION BY toYYYYMM(l_shipdate) # Monthly partitions ORDER BY (l_orderkey, l_linenumber)
Connection Management¶
Reuse connections for multiple queries:
adapter = ClickHouseAdapter(host="localhost") conn = adapter.create_connection() # Run multiple queries for query_id in range(1, 23): result = conn.execute(queries[query_id]) # Close when done adapter.close_connection(conn)
Set connection timeouts appropriately:
# Long-running benchmarks need longer timeouts adapter = ClickHouseAdapter( host="localhost", max_execution_time=600 # 10 minutes )
Common Issues¶
Connection Refused¶
Problem: Cannot connect to ClickHouse server
Solutions:
# 1. Check if server is running
ps aux | grep clickhouse-server
# 2. Start server if not running
sudo service clickhouse-server start
# 3. Check port is listening
netstat -ln | grep 9000
# 4. Test connection
clickhouse-client --host=localhost --port=9000
# Verify connection in Python
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
result = sock.connect_ex(('localhost', 9000))
if result == 0:
print("Port 9000 is open")
else:
print("Cannot connect to port 9000")
Memory Limit Exceeded¶
Problem: Query fails with “Memory limit exceeded”
Solutions:
# 1. Increase memory limit
adapter = ClickHouseAdapter(
host="localhost",
max_memory_usage="32GB"
)
# 2. Enable external operations
conn = adapter.create_connection()
conn.execute("SET max_bytes_before_external_group_by = 20000000000")
conn.execute("SET max_bytes_before_external_sort = 20000000000")
# 3. Reduce scale factor for testing
benchmark = TPCH(scale_factor=0.1) # Start small
Local Mode Import Error¶
Problem: “chdb is not installed” error in local mode
Solution:
# Install chDB
uv pip install chdb
# Or switch to server mode
adapter = ClickHouseAdapter(mode="server", host="localhost")
Slow Query Performance¶
Problem: Queries execute slowly
Solutions:
# 1. Increase thread count
adapter = ClickHouseAdapter(
host="localhost",
max_threads=16 # Use more CPU cores
)
# 2. Check query plan
plan = conn.execute("EXPLAIN SELECT ...")
# Look for FullScanStep (table scan) - may need better ORDER BY
# 3. Enable query profiling
conn.execute("SET log_queries = 1")
conn.execute("SET log_query_threads = 1")
# Run query
result = conn.execute("SELECT ...")
# Check query log
log = conn.execute("""
SELECT
query,
query_duration_ms,
memory_usage,
read_rows,
read_bytes
FROM system.query_log
WHERE type = 'QueryFinish'
ORDER BY event_time DESC
LIMIT 1
""")
See Also¶
Platform Documentation¶
Platform Selection Guide - Choosing ClickHouse vs other platforms
Multi-Platform Database Support - Quick setup for all platforms
Platform Comparison Matrix - Feature comparison
ClickHouse Local Mode - Local mode guide
Benchmark Guides¶
TPC-H Benchmark - TPC-H on ClickHouse
TPC-DS Benchmark - TPC-DS on ClickHouse
ClickBench (ClickHouse Analytics Benchmark) - ClickBench on ClickHouse
API Reference¶
DuckDB Platform Adapter - DuckDB adapter for comparison
Base Benchmark API - Base benchmark interface
Python API Reference - Python API overview
/usage/api-reference - High-level API guide
External Resources¶
ClickHouse Documentation - Official ClickHouse docs
ClickHouse Performance Guide - Performance tuning
chDB Documentation - Local mode library
ClickHouse Table Engines - Storage engines