ClickBench (ClickHouse Analytics Benchmark)¶
Overview¶
ClickBench is a systematic analytics benchmark designed to test the performance of analytical database systems using real-world web analytics data patterns. Originally developed by ClickHouse, this benchmark has become a standard for evaluating OLAP database performance across different systems and architectures.
The benchmark uses a single flat table with web analytics data containing approximately 100 columns and tests 43 different query patterns that are representative of real-world analytics workloads. ClickBench is particularly valuable for testing columnar databases, OLAP engines, and systems designed for high-performance analytics.
Key Features¶
43 analytical queries covering diverse performance patterns
Single flat table design - Emphasizes column store performance
Real-world data patterns - Based on actual web analytics data
Comprehensive column coverage - ~100 columns testing various data types
Performance-focused - Designed for precise timing comparisons
Cross-system compatibility - Standard benchmark across different databases
Scalable testing - Configurable dataset sizes for different scenarios
Schema Description¶
ClickBench uses a single table called HITS that represents web analytics data with systematic coverage of user interactions, browser information, geographic data, and various event attributes.
HITS Table Structure¶
The HITS table contains ~100 columns covering all aspects of web analytics:
Core Event Data¶
Column |
Type |
Description |
|---|---|---|
|
BIGINT |
Unique event identifier |
|
TIMESTAMP |
Event timestamp |
|
DATE |
Event date |
|
INTEGER |
Analytics counter ID |
|
BIGINT |
User identifier |
User and Session Information¶
Column |
Type |
Description |
|---|---|---|
|
SMALLINT |
Browser user agent ID |
|
INTEGER |
Client IP address |
|
INTEGER |
Geographic region ID |
|
SMALLINT |
User age |
|
SMALLINT |
User gender |
|
SMALLINT |
User income bracket |
Browser and Device Data¶
Column |
Type |
Description |
|---|---|---|
|
SMALLINT |
Screen resolution width |
|
SMALLINT |
Screen resolution height |
|
SMALLINT |
Mobile device flag |
|
SMALLINT |
Mobile phone type |
|
VARCHAR(255) |
Mobile phone model |
|
SMALLINT |
Java enabled flag |
|
SMALLINT |
JavaScript enabled flag |
|
SMALLINT |
Cookies enabled flag |
Search and Advertising Data¶
Column |
Type |
Description |
|---|---|---|
|
SMALLINT |
Search engine identifier |
|
VARCHAR(1024) |
Search query phrase |
|
SMALLINT |
Advertising engine ID |
Technical and Performance Data¶
Column |
Type |
Description |
|---|---|---|
|
INTEGER |
Send timing in milliseconds |
|
INTEGER |
DNS timing in milliseconds |
|
INTEGER |
Connection timing |
|
INTEGER |
Response start timing |
|
INTEGER |
Response end timing |
|
INTEGER |
Fetch timing |
Data Characteristics¶
The HITS table contains realistic web analytics distributions:
Temporal patterns: Realistic time-of-day and seasonal usage patterns
Geographic distribution: Global user base with regional concentrations
Device diversity: Mixed desktop, mobile, and tablet usage
Search patterns: Real search query distributions and patterns
User behavior: Realistic session lengths, bounce rates, and engagement metrics
Query Characteristics¶
ClickBench includes 43 queries that test different aspects of analytical performance:
Basic Aggregation Queries (Q1-Q7)¶
Q1: Simple Count
SELECT COUNT(*) FROM hits;
Purpose: Test basic table scanning performance
Performance focus: Sequential scan optimization
Q2: Filtered Count
SELECT COUNT(*) FROM hits WHERE AdvEngineID <> 0;
Purpose: Test filtering with aggregation
Performance focus: Predicate evaluation during scan
Q3: Multiple Aggregations
SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits;
Purpose: Test multiple aggregation functions
Performance focus: Aggregation pipeline efficiency
Q5: Distinct Count
SELECT COUNT(DISTINCT UserID) FROM hits;
Purpose: Test cardinality estimation and distinct operations
Performance focus: Hash table efficiency for high cardinality
Grouping and Ordering Queries (Q8-Q19)¶
Q8: Basic GROUP BY
SELECT AdvEngineID, COUNT(*) FROM hits
WHERE AdvEngineID <> 0
GROUP BY AdvEngineID
ORDER BY COUNT(*) DESC;
Purpose: Test basic grouping with ordering
Performance focus: Hash aggregation and sorting
Q9: Distinct Users by Region
SELECT RegionID, COUNT(DISTINCT UserID) AS u FROM hits
GROUP BY RegionID
ORDER BY u DESC
LIMIT 10;
Purpose: Test complex aggregation with TOP-N
Performance focus: Distinct counting within groups
Q13: Search Phrase Analysis
SELECT SearchPhrase, COUNT(*) AS c FROM hits
WHERE SearchPhrase <> ''
GROUP BY SearchPhrase
ORDER BY c DESC
LIMIT 10;
Purpose: Test string grouping with high cardinality
Performance focus: String hash performance
String Operations Queries (Q20-Q29)¶
Q21: String Pattern Matching
SELECT COUNT(*) FROM hits WHERE URL LIKE '%google%';
Purpose: Test string pattern matching performance
Performance focus: Text search optimization
Q22: Complex String Analysis
SELECT SearchPhrase, MIN(URL), COUNT(*) AS c FROM hits
WHERE URL LIKE '%google%' AND SearchPhrase <> ''
GROUP BY SearchPhrase
ORDER BY c DESC
LIMIT 10;
Purpose: Test combined string operations with grouping
Performance focus: String processing with aggregation
Q29: Regex Operations
SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\\.)?([^/]+)/.*$', '\\1') AS k,
AVG(LENGTH(Referer)) AS l,
COUNT(*) AS c,
MIN(Referer)
FROM hits
WHERE Referer <> ''
GROUP BY k
HAVING COUNT(*) > 100000
ORDER BY l DESC
LIMIT 25;
Purpose: Test complex regex and string functions
Performance focus: Regex engine performance
Complex Analytics Queries (Q30-Q43)¶
Q30: Wide Aggregation
SELECT SUM(ResolutionWidth), SUM(ResolutionWidth + 1), ..., SUM(ResolutionWidth + 89)
FROM hits;
Purpose: Test wide query processing (89 aggregation columns)
Performance focus: Expression evaluation and vectorization
Q37: Time Series Analysis
SELECT SUM(ResolutionWidth),
SUM(ResolutionWidth + 1),
SUM(ResolutionWidth + 2)
FROM hits
GROUP BY AdvEngineID, DATE_TRUNC('minute', EventTime)
ORDER BY SUM(ResolutionWidth) DESC
LIMIT 10;
Purpose: Test temporal grouping with expressions
Performance focus: Time-based aggregation optimization
Q43: Advanced-level Analytics
SELECT CounterID,
AVG(LENGTH(URL)) AS l,
COUNT(*) AS c
FROM hits
WHERE URL <> ''
GROUP BY CounterID
HAVING COUNT(*) > 100000
ORDER BY l DESC
LIMIT 25;
Purpose: Test complex analytical patterns with HAVING
Performance focus: Two-phase aggregation optimization
Usage Examples¶
Basic Benchmark Setup¶
from benchbox import ClickBench
# Initialize ClickBench benchmark
clickbench = ClickBench(scale_factor=1.0, output_dir="clickbench_data")
# Generate web analytics data
data_files = clickbench.generate_data()
# Get all 43 benchmark queries
queries = clickbench.get_queries()
print(f"Generated {len(queries)} ClickBench queries")
# Get specific query
count_query = clickbench.get_query("Q1")
print(count_query)
Data Generation and Loading¶
# Generate ClickBench data at different scales
clickbench_small = ClickBench(scale_factor=0.01, output_dir="clickbench_small")
data_files = clickbench_small.generate_data()
# Check generated data
hits_file = clickbench_small.output_dir / "hits.csv"
size_mb = hits_file.stat().st_size / (1024 * 1024)
print(f"Generated hits data: {size_mb:.2f} MB")
# Get schema information
schema = clickbench_small.get_schema()
print(f"HITS table has {len(schema[0]['columns'])} columns")
DuckDB Integration Example¶
import duckdb
from benchbox import ClickBench
import time
# Initialize and generate data
clickbench = ClickBench(scale_factor=0.01, output_dir="clickbench_tiny")
data_files = clickbench.generate_data()
# Create DuckDB connection and schema
conn = duckdb.connect("clickbench.duckdb")
schema_sql = clickbench.get_create_tables_sql()
conn.execute(schema_sql)
# Load hits data
hits_file = clickbench.output_dir / "hits.csv"
print(f"Loading data from {hits_file}...")
load_start = time.time()
conn.execute(f"""
INSERT INTO hits
SELECT * FROM read_csv('{hits_file}',
header=true,
auto_detect=true,
ignore_errors=true)
""")
load_time = time.time() - load_start
row_count = conn.execute("SELECT COUNT(*) FROM hits").fetchone()[0]
print(f"Loaded {row_count:,} rows in {load_time:.2f} seconds")
# Run ClickBench query categories
query_categories = {
'basic_agg': ['Q1', 'Q2', 'Q3', 'Q4', 'Q5'],
'grouping': ['Q8', 'Q9', 'Q10', 'Q13', 'Q16'],
'string_ops': ['Q21', 'Q22', 'Q25', 'Q26'],
'complex': ['Q28', 'Q31', 'Q37', 'Q43']
}
results = {}
for category, query_ids in query_categories.items():
print(f"\\nRunning {category} queries...")
category_results = {}
for query_id in query_ids:
try:
query_sql = clickbench.get_query(query_id)
start_time = time.time()
result = conn.execute(query_sql).fetchall()
execution_time = time.time() - start_time
category_results[query_id] = {
'time': execution_time,
'rows': len(result)
}
print(f" {query_id}: {execution_time:.3f}s ({len(result)} rows)")
except Exception as e:
category_results[query_id] = {
'error': str(e)
}
print(f" {query_id}: ERROR - {str(e)[:50]}...")
results[category] = category_results
conn.close()
# Print summary
total_time = sum(
r['time'] for cat_results in results.values()
for r in cat_results.values()
if 'time' in r
)
total_queries = sum(
1 for cat_results in results.values()
for r in cat_results.values()
if 'time' in r
)
print(f"\\nSummary: {total_queries} queries executed in {total_time:.2f}s")
Performance Analysis Framework¶
import time
from typing import Dict, List
from statistics import mean, median
class ClickBenchPerformanceTester:
def __init__(self, clickbench: ClickBench, connection):
self.clickbench = clickbench
self.connection = connection
def benchmark_query_category(self, category: str, iterations: int = 3) -> Dict:
"""Benchmark specific ClickBench query categories."""
categories = {
'scan': ['Q1', 'Q2', 'Q7'],
'aggregation': ['Q3', 'Q4', 'Q5', 'Q6'],
'grouping': ['Q8', 'Q9', 'Q10', 'Q11', 'Q12'],
'string': ['Q13', 'Q14', 'Q21', 'Q22', 'Q23'],
'complex': ['Q28', 'Q29', 'Q30', 'Q31'],
'analytics': ['Q37', 'Q40', 'Q41', 'Q42', 'Q43']
}
if category not in categories:
raise ValueError(f"Invalid category: {category}")
query_ids = categories[category]
results = {}
for query_id in query_ids:
print(f"Benchmarking {query_id} ({category})...")
times = []
for iteration in range(iterations):
query_sql = self.clickbench.get_query(query_id)
start_time = time.time()
try:
result = self.connection.execute(query_sql).fetchall()
execution_time = time.time() - start_time
times.append(execution_time)
except Exception as e:
print(f" Error in iteration {iteration + 1}: {e}")
continue
if times:
results[query_id] = {
'category': category,
'avg_time': mean(times),
'median_time': median(times),
'min_time': min(times),
'max_time': max(times),
'rows_returned': len(result) if 'result' in locals() else 0,
'times': times,
'success_rate': len(times) / iterations
}
print(f" Average: {mean(times):.3f}s, Median: {median(times):.3f}s")
else:
results[query_id] = {
'category': category,
'error': 'All iterations failed'
}
return results
def run_complete_benchmark(self) -> Dict:
"""Run all ClickBench categories and return systematic results."""
complete_results = {}
# Test each category
categories = ['scan', 'aggregation', 'grouping', 'string', 'complex', 'analytics']
for category in categories:
print(f"\\nRunning {category.upper()} queries...")
try:
category_results = self.benchmark_query_category(category)
complete_results[category] = category_results
except Exception as e:
print(f"Error in {category} category: {e}")
complete_results[category] = {'error': str(e)}
# Calculate summary statistics
all_times = []
successful_queries = 0
total_queries = 0
for category_data in complete_results.values():
if isinstance(category_data, dict) and 'error' not in category_data:
for query_data in category_data.values():
total_queries += 1
if isinstance(query_data, dict) and 'times' in query_data:
all_times.extend(query_data['times'])
successful_queries += 1
if all_times:
complete_results['summary'] = {
'total_queries': total_queries,
'successful_queries': successful_queries,
'success_rate': successful_queries / total_queries,
'total_avg_time': mean(all_times),
'total_median_time': median(all_times),
'total_min_time': min(all_times),
'total_max_time': max(all_times),
'geomean_time': self._geometric_mean(all_times)
}
return complete_results
def _geometric_mean(self, values: List[float]) -> float:
"""Calculate geometric mean of execution times."""
if not values:
return 0.0
product = 1.0
for value in values:
product *= value
return product ** (1.0 / len(values))
def analyze_column_performance(self) -> Dict:
"""Analyze performance across different column types and operations."""
column_tests = [
('integer_scan', 'SELECT COUNT(*) FROM hits WHERE RegionID > 1000'),
('string_scan', 'SELECT COUNT(*) FROM hits WHERE SearchPhrase LIKE \\'%google%\\''),
('timestamp_scan', 'SELECT COUNT(*) FROM hits WHERE EventTime > \\'2013-07-01\\''),
('integer_agg', 'SELECT RegionID, COUNT(*) FROM hits GROUP BY RegionID LIMIT 10'),
('string_agg', 'SELECT SearchPhrase, COUNT(*) FROM hits WHERE SearchPhrase <> \\'\\'GROUP BY SearchPhrase LIMIT 10'),
('mixed_agg', 'SELECT RegionID, SearchPhrase, COUNT(*) FROM hits WHERE SearchPhrase <> \\'\\'GROUP BY RegionID, SearchPhrase LIMIT 10')
]
results = {}
for test_name, query_sql in column_tests:
print(f"Testing {test_name}...")
times = []
for iteration in range(3):
try:
start_time = time.time()
result = self.connection.execute(query_sql).fetchall()
execution_time = time.time() - start_time
times.append(execution_time)
except Exception as e:
print(f" Error: {e}")
break
if times:
results[test_name] = {
'avg_time': mean(times),
'rows_returned': len(result) if 'result' in locals() else 0,
'times': times
}
else:
results[test_name] = {'error': 'Failed to execute'}
return results
# Usage
performance_tester = ClickBenchPerformanceTester(clickbench, conn)
# Test individual categories
scan_results = performance_tester.benchmark_query_category('scan')
grouping_results = performance_tester.benchmark_query_category('grouping')
print("\\nCategory Performance Summary:")
print(f"Scan Queries: {scan_results}")
print(f"Grouping Queries: {grouping_results}")
# Run complete benchmark
complete_results = performance_tester.run_complete_benchmark()
print(f"\\nComplete Benchmark Summary: {complete_results.get('summary', 'No summary available')}")
# Analyze column performance
column_results = performance_tester.analyze_column_performance()
print(f"\\nColumn Performance Analysis: {column_results}")
ClickHouse Native Integration¶
import clickhouse_connect
from benchbox import ClickBench
# Initialize ClickHouse for appropriate performance
client = clickhouse_connect.get_client(host='localhost', port=8123)
clickbench = ClickBench(scale_factor=1.0, output_dir="clickbench_data")
# Generate data
data_files = clickbench.generate_data()
# Create ClickHouse table with appropriate settings
create_table_sql = """
CREATE TABLE hits (
WatchID UInt64,
JavaEnable UInt8,
Title String,
GoodEvent UInt8,
EventTime DateTime,
EventDate Date,
CounterID UInt32,
ClientIP UInt32,
RegionID UInt32,
UserID UInt64,
CounterClass UInt8,
OS UInt8,
UserAgent UInt8,
URL String,
Referer String,
IsRefresh UInt8,
RefererCategoryID UInt16,
RefererRegionID UInt32,
URLCategoryID UInt16,
URLRegionID UInt32,
ResolutionWidth UInt16,
ResolutionHeight UInt16,
ResolutionDepth UInt8,
FlashMajor UInt8,
FlashMinor UInt8,
FlashMinor2 String,
NetMajor UInt8,
NetMinor UInt8,
UserAgentMajor UInt16,
UserAgentMinor String,
CookieEnable UInt8,
JavascriptEnable UInt8,
IsMobile UInt8,
MobilePhone UInt8,
MobilePhoneModel String,
Params String,
IPNetworkID UInt32,
TraficSourceID Int8,
SearchEngineID UInt16,
SearchPhrase String,
AdvEngineID UInt8,
IsArtifical UInt8,
WindowClientWidth UInt16,
WindowClientHeight UInt16,
ClientTimeZone Int16,
ClientEventTime DateTime,
SilverlightVersion1 UInt8,
SilverlightVersion2 UInt8,
SilverlightVersion3 UInt32,
SilverlightVersion4 UInt16,
PageCharset String,
CodeVersion UInt32,
IsLink UInt8,
IsDownload UInt8,
IsNotBounce UInt8,
FUniqID UInt64,
OriginalURL String,
HID UInt32,
IsOldCounter UInt8,
IsEvent UInt8,
IsParameter UInt8,
DontCountHits UInt8,
WithHash UInt8,
HitColor FixedString(1),
LocalEventTime DateTime,
Age UInt8,
Sex UInt8,
Income UInt8,
Interests UInt16,
Robotness UInt8,
RemoteIP UInt32,
WindowName Int32,
OpenerName Int32,
HistoryLength Int16,
BrowserLanguage FixedString(2),
BrowserCountry FixedString(2),
SocialNetwork String,
SocialAction String,
HTTPError UInt8,
SendTiming UInt32,
DNSTiming UInt32,
ConnectTiming UInt32,
ResponseStartTiming UInt32,
ResponseEndTiming UInt32,
FetchTiming UInt32,
SocialSourceNetworkID UInt8,
SocialSourcePage String,
ParamPrice Int64,
ParamOrderID String,
ParamCurrency FixedString(3),
ParamCurrencyID UInt16,
OpenstatServiceName String,
OpenstatCampaignID String,
OpenstatAdID String,
OpenstatSourceID String,
UTMSource String,
UTMMedium String,
UTMCampaign String,
UTMContent String,
UTMTerm String,
FromTag String,
HasGCLID UInt8,
RefererHash UInt64,
URLHash UInt64,
CLID UInt32
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID))
SAMPLE BY intHash32(UserID);
"""
client.execute(create_table_sql)
# Load data using ClickHouse CSV import
hits_file = clickbench.output_dir / "hits.csv"
with open(hits_file, 'rb') as f:
client.insert_file('hits', f, fmt='CSV')
# Run all 43 ClickBench queries
print("Running all 43 ClickBench queries on ClickHouse...")
query_results = {}
for i in range(1, 44):
query_id = f"Q{i}"
try:
query_sql = clickbench.get_query(query_id)
start_time = time.time()
result = client.query(query_sql)
execution_time = time.time() - start_time
query_results[query_id] = {
'time': execution_time,
'rows': len(result.result_rows)
}
print(f"{query_id}: {execution_time:.3f}s ({len(result.result_rows)} rows)")
except Exception as e:
query_results[query_id] = {'error': str(e)}
print(f"{query_id}: ERROR - {str(e)[:50]}...")
# Calculate performance statistics
successful_queries = [r for r in query_results.values() if 'time' in r]
if successful_queries:
total_time = sum(r['time'] for r in successful_queries)
avg_time = total_time / len(successful_queries)
print(f"\\nClickBench Results Summary:")
print(f"Successful queries: {len(successful_queries)}/43")
print(f"Total execution time: {total_time:.2f}s")
print(f"Average query time: {avg_time:.3f}s")
Performance Characteristics¶
Query Performance Patterns¶
Scan Queries (Q1, Q2, Q7):
Primary bottleneck: Sequential scan speed and basic filtering
Optimization targets: Column scanning, SIMD operations, predicate pushdown
Typical performance: Milliseconds to seconds
Scaling characteristics: Linear with data size
Aggregation Queries (Q3-Q6):
Primary bottleneck: Aggregation function computation
Optimization targets: Vectorized aggregation, parallel execution
Typical performance: Seconds for large datasets
Scaling characteristics: Sub-linear with good optimization
Grouping Queries (Q8-Q19):
Primary bottleneck: Hash table construction and group management
Optimization targets: Hash aggregation algorithms, memory efficiency
Typical performance: Seconds to tens of seconds
Scaling characteristics: Depends on cardinality and memory
String Operations (Q20-Q29):
Primary bottleneck: String processing and pattern matching
Optimization targets: String function optimization, regex engines
Typical performance: Highly variable, can be slow
Scaling characteristics: Often super-linear for complex operations
Complex Analytics (Q30-Q43):
Primary bottleneck: Multiple complex operations combined
Optimization targets: Query plan optimization, vectorization
Typical performance: Minutes for large datasets
Scaling characteristics: Highly dependent on optimization quality
System Optimization Opportunities¶
System Type |
Scan Performance |
Aggregation |
Grouping |
String Ops |
Complex Analytics |
|---|---|---|---|---|---|
Columnar |
Column scanning |
Vectorized ops |
Column-wise hash |
String columns |
Vector processing |
Row-based |
Index usage |
Row aggregation |
Row-wise groups |
String indexes |
Traditional optimization |
In-Memory |
Memory bandwidth |
Fast aggregation |
In-memory hash |
String caching |
Memory-configured |
GPU |
Parallel scanning |
GPU aggregation |
GPU hash tables |
Limited support |
Mixed CPU/GPU |
Configuration Options¶
Scale Factor Guidelines¶
Scale Factor |
Rows |
Data Size |
Query Times |
Use Case |
|---|---|---|---|---|
0.001 |
~100K |
~10 MB |
Milliseconds |
Development |
0.01 |
~1M |
~100 MB |
Seconds |
Testing |
0.1 |
~10M |
~1 GB |
10s-60s |
Integration |
1.0 |
~100M |
~10 GB |
Minutes |
Standard benchmark |
10.0 |
~1B |
~100 GB |
Hours |
Large-scale testing |
Advanced-level Configuration¶
clickbench = ClickBench(
scale_factor=1.0,
output_dir="clickbench_data",
# Data generation options
date_range_days=365, # Range of event dates
user_count=10000000, # Number of unique users
enable_compression=True, # Compress output files
# Performance options
partition_by_date=True, # Partition by EventDate
create_indices=True, # Create performance indices
optimize_for_analytics=True # Analytics-configured generation
)
Best Practices¶
Data Generation¶
Scale appropriately - Use realistic scale factors for your testing needs
Consider compression - ClickBench data compresses well
Partition strategy - Partition by date for temporal queries
Index planning - Create indices on frequently filtered columns
Query Optimization¶
Column store optimization - Leverage columnar storage for scans
String optimization - Optimize string operations and pattern matching
Memory management - Monitor memory usage for large aggregations
Parallel execution - Use parallel query execution where available
Performance Testing¶
Multiple iterations - Run queries multiple times for stable timings
Cold vs. warm runs - Test both cold and warm query performance
Resource monitoring - Monitor CPU, memory, and I/O during execution
Cross-system comparison - Use ClickBench for database comparisons
Common Issues and Solutions¶
Performance Issues¶
Issue: Slow string operations
-- Solution: Create appropriate indices and use configured string functions
CREATE INDEX idx_hits_search_phrase ON hits(SearchPhrase);
CREATE INDEX idx_hits_url ON hits USING hash(URL);
-- Use database-specific string optimizations
SELECT COUNT(*) FROM hits WHERE SearchPhrase ILIKE '%google%'; -- Case-insensitive
Issue: Memory exhaustion on complex queries
-- Solution: Optimize memory usage and use incremental processing
SET max_memory_usage = 10000000000; -- 10GB limit
SET max_bytes_before_external_group_by = 5000000000; -- External sorting
-- Break down complex queries
SELECT * FROM (
SELECT RegionID, COUNT(*) as cnt FROM hits GROUP BY RegionID
) WHERE cnt > 1000;
Data Loading Issues¶
Issue: Slow data loading
# Solution: Use configured loading strategies
clickbench = ClickBench(
scale_factor=1.0,
output_dir="clickbench_data",
compression='gzip', # Compress for faster I/O
batch_size=100000, # Optimize batch size
parallel_loading=True # Use parallel loading
)
Issue: Schema mismatches across databases
-- Solution: Use database-specific schema adaptations
-- For PostgreSQL
ALTER TABLE hits ALTER COLUMN EventTime TYPE TIMESTAMP;
-- For MySQL
ALTER TABLE hits MODIFY COLUMN Title TEXT CHARACTER SET utf8mb4;
-- For DuckDB
ALTER TABLE hits ALTER COLUMN ResolutionWidth TYPE USMALLINT;
See Also¶
Understanding BenchBox¶
Architecture Overview - Component design and data flow
Workflow Patterns - Benchmarking workflow examples
Data Model - Result schema documentation
Glossary - Benchmark terminology reference
Getting Started¶
Getting Started Guide - Run your first benchmark in 5 minutes
CLI Quick Reference - Command-line usage
Configuration Handbook - Advanced configuration
Examples - Code snippets and patterns
Platform-Specific Guides¶
Platform Selection Guide - Choosing the right database
ClickHouse Local Mode - Running ClickHouse locally
Platform Quick Reference - Platform setup guides
Performance Tuning - Optimization strategies
External Resources¶
ClickBench GitHub - Official benchmark repository
ClickBench.com Results - Cross-database performance comparisons
ClickBench Methodology - Detailed methodology
DuckDB Labs Benchmark - Updated benchmark results