Apache Doris Platform¶
Apache Doris is a high-performance real-time analytical database based on MPP (Massively Parallel Processing) architecture. Originally developed at Baidu as Palo, it graduated as an Apache Top-Level Project in 2022. BenchBox connects via the MySQL protocol using PyMySQL (port 9030) and supports Stream Load via the FE HTTP API (port 8030) for high-throughput data ingestion. SQLGlot provides official doris dialect support for SQL translation.
Apache Doris is used in production by Baidu, Xiaomi, ByteDance, JD.com, Meituan, and thousands of other enterprises globally. It delivers competitive performance against ClickHouse, StarRocks, and Trino on analytical workloads, with particularly strong support for real-time analytics and high-concurrency point queries.
Features¶
MySQL protocol connectivity - Standard MySQL wire protocol via PyMySQL (port 9030)
Stream Load HTTP API - High-throughput data ingestion via FE HTTP port 8030
Full TPC-H support - All 22 queries with row count validation
Full TPC-DS support - All 99 queries with row count validation
Official SQLGlot dialect - Native
dorisdialect for accurate SQL translationVectorized execution engine - High-performance columnar processing (Doris 2.0+)
Multiple table models - Duplicate, Aggregate, Unique, and Primary Key models
Distributed hash partitioning - Automatic data distribution across Backend nodes
Tuning support - Partitioning, sorting, and distribution tuning at table creation time
Quick Start¶
# Install PyMySQL dependency
uv add pymysql
# Or install via the Doris extra
uv add benchbox --extra doris
# Configure connection (Doris must be running)
export DORIS_HOST=localhost
export DORIS_PORT=9030
# Run TPC-H benchmark
benchbox run --platform doris --benchmark tpch --scale 0.01
Docker Quick Start¶
# Start Apache Doris with Docker (all-in-one FE + BE)
docker run -p 9030:9030 -p 8030:8030 -p 8040:8040 \
apache/doris:doris-all-in-one-2.1
# Verify connectivity
mysql -h 127.0.0.1 -P 9030 -u root -e "SELECT 1"
# Run benchmark
benchbox run --platform doris --benchmark tpch --scale 1.0
Configuration Options¶
Option |
CLI Argument |
Environment Variable |
Default |
Description |
|---|---|---|---|---|
|
|
|
|
Doris FE node hostname |
|
|
|
|
MySQL protocol port (FE) |
|
|
|
|
Database username |
|
|
|
|
Database password |
|
|
|
auto-generated |
Target database name |
|
|
|
|
FE HTTP port for Stream Load |
|
|
- |
|
Use HTTPS for Stream Load API |
Data Loading¶
BenchBox loads data into Apache Doris using the Stream Load HTTP API when the requests library is available. When requests is not installed, it falls back to batch INSERT statements via the MySQL protocol. The adapter handles both TPC pipe-delimited (.tbl) and standard CSV formats automatically.
Loading Process¶
Schema creation - Tables are created with Doris-optimized DDL (Duplicate Key model, hash distribution)
Type conversion - DuckDB/standard SQL types are translated via the SQLGlot
dorisdialectStream Load (primary) - Data files are sent via HTTP PUT to the FE Stream Load endpoint on port 8030
INSERT fallback - If
requestsis not installed, data is loaded in batches of 1,000 rows using parameterizedINSERT INTO ... VALUESstatementsConstraint handling - Foreign keys are removed (Doris does not enforce them); primary keys map to Doris key models
Stream Load (HTTP API)¶
Apache Doris provides a high-throughput Stream Load API on the FE HTTP port (default 8030). BenchBox uses this as the primary data loading method:
Endpoint:
http://<fe_host>:8030/api/<database>/<table>/_stream_loadProtocol: HTTP PUT with CSV payload and
100-continueheaderAuthentication: HTTP Basic Auth using Doris credentials
TPC format handling: Trailing delimiters are automatically stripped before loading
TLS support: Enable
--doris-use-tlsto use HTTPS for encrypted Stream Load transfersThroughput: Significantly higher than row-by-row INSERT for large datasets
INSERT Fallback¶
When the requests library is not available, the adapter falls back to batch INSERT:
Rows are loaded in batches of 1,000 using
executemany()TPC pipe-delimited format is handled automatically
Suitable for small to medium datasets (up to SF 1)
For larger datasets, install
requeststo enable Stream Load
Table Models¶
Apache Doris supports four table models, each optimized for different workloads. BenchBox uses the Duplicate Key model by default for benchmark tables, as it preserves all rows without deduplication.
Model |
Use Case |
Key Behavior |
BenchBox Usage |
|---|---|---|---|
Duplicate Key |
Analytics, logs |
All rows preserved, no deduplication |
Default for benchmarks |
Aggregate |
Pre-aggregation |
Rows with same key are merged by aggregate functions |
Not used |
Unique Key |
Dimension tables |
Last write wins for same key |
Not used |
Primary Key |
Real-time updates |
Last write wins with merge-on-read |
Not used |
The Duplicate Key model is optimal for TPC-H and TPC-DS workloads because:
No deduplication overhead during data loading
All original rows are preserved for accurate query results
Sort keys can be specified for scan optimization
Usage Examples¶
Basic Benchmarks¶
# TPC-H at scale factor 1
benchbox run --platform doris --benchmark tpch --scale 1.0
# TPC-DS at scale factor 10
benchbox run --platform doris --benchmark tpcds --scale 10.0
# Run specific queries only
benchbox run --platform doris --benchmark tpch --queries Q1,Q6,Q17
Environment Variable Configuration¶
export DORIS_HOST=doris-fe.example.com
export DORIS_PORT=9030
export DORIS_USER=benchbox
export DORIS_PASSWORD=secret
export DORIS_DATABASE=benchmark_db
export DORIS_HTTP_PORT=8030
benchbox run --platform doris --benchmark tpch --scale 10.0
CLI Argument Configuration¶
benchbox run --platform doris --benchmark tpch --scale 1.0 \
--doris-host doris-fe.example.com \
--doris-port 9030 \
--doris-username benchbox \
--doris-password secret \
--doris-database my_benchmarks
Dry Run (Preview)¶
# Preview execution plan without running
benchbox run --platform doris --benchmark tpch --scale 1.0 --dry-run ./preview
TLS-Encrypted Stream Load¶
# Use HTTPS for Stream Load API (e.g., managed cloud deployments)
benchbox run --platform doris --benchmark tpch --scale 1.0 \
--doris-use-tls
Architecture¶
Adapter Structure¶
The Doris adapter is a single-file implementation:
Module |
Class |
Responsibility |
|---|---|---|
|
|
Connection management, schema creation, data loading, query execution, tuning |
Connection Model¶
BenchBox CLI
|
v
DorisAdapter
|
+-- MySQL Protocol (PyMySQL) --> FE (port 9030)
| - Schema DDL
| - Query execution
| - Batch INSERT fallback
|
+-- HTTP API (Stream Load) --> FE (port 8030)
- High-throughput CSV data ingestion
- HTTP PUT with Basic Auth
Doris Cluster Ports¶
Port |
Service |
Protocol |
Purpose |
|---|---|---|---|
9030 |
FE MySQL |
TCP |
SQL queries, DDL, DML via MySQL protocol |
8030 |
FE HTTP |
HTTP/HTTPS |
Stream Load API, web UI, REST API |
8040 |
BE HTTP |
HTTP |
BE web server, internal data transfer |
9010 |
FE Edit Log |
TCP |
FE metadata replication (internal) |
Platform Information¶
At runtime, BenchBox captures platform metadata:
{
"platform_type": "doris",
"platform_name": "Apache Doris",
"configuration": {
"host": "localhost",
"port": 9030,
"database": "benchbox_tpch_sf1",
"http_port": 8030,
"stream_load_available": True
},
"platform_version": "2.1.x",
"client_library_version": "1.x.x"
}
Tuning and Optimization¶
Automatic Benchmark Configuration¶
The adapter automatically applies session-level optimizations when running benchmarks:
SQL cache: Disabled (
enable_sql_cache = false) for accurate timingParallel execution: Set to 8 fragment instances (
parallel_fragment_exec_instance_num = 8)Memory limit: Set to 8 GB for OLAP workloads (
exec_mem_limit = 8589934592)
Supported Tuning Types¶
Tuning Type |
Support |
Notes |
|---|---|---|
Partitioning |
Yes |
|
Sorting |
Yes |
Sort keys via Duplicate Key model key ordering |
Distribution |
Yes |
|
Clustering |
No |
No CLUSTER command |
Primary Keys |
Yes |
Unique/Primary Key table models |
Foreign Keys |
No |
No FK enforcement in Doris |
Distribution Keys¶
Choosing effective distribution keys is critical for Doris query performance:
-- Hash distribution on frequently joined columns
CREATE TABLE lineitem (
l_orderkey BIGINT,
l_partkey BIGINT,
...
) DUPLICATE KEY(l_orderkey)
DISTRIBUTED BY HASH(l_orderkey) BUCKETS 16;
Guidelines:
Use high-cardinality columns for even data distribution
Align distribution keys with common join predicates
Start with 8-16 buckets and adjust based on data volume
Bloom Filter and Bitmap Indexes¶
Doris supports secondary indexes for accelerating point queries and filter predicates:
-- Bloom filter index for high-cardinality columns
ALTER TABLE lineitem SET ("bloom_filter_columns" = "l_orderkey, l_partkey");
-- Bitmap index for low-cardinality columns
CREATE INDEX idx_shipmode ON lineitem (l_shipmode) USING BITMAP;
Colocate Join Groups¶
For frequently joined tables, colocate groups ensure data locality:
-- Create colocate group for TPC-H tables
CREATE TABLE orders (
o_orderkey BIGINT,
...
) DUPLICATE KEY(o_orderkey)
DISTRIBUTED BY HASH(o_orderkey) BUCKETS 16
PROPERTIES ("colocate_with" = "tpch_group");
CREATE TABLE lineitem (
l_orderkey BIGINT,
...
) DUPLICATE KEY(l_orderkey)
DISTRIBUTED BY HASH(l_orderkey) BUCKETS 16
PROPERTIES ("colocate_with" = "tpch_group");
Colocate joins eliminate data shuffle across BE nodes for join operations on aligned distribution keys.
Managed Cloud Options¶
Several vendors offer managed Apache Doris cloud services:
Provider |
Service |
Description |
|---|---|---|
VeloDB Cloud |
Fully managed Doris by core contributors |
|
SelectDB Cloud |
Enterprise managed Doris with compute-storage separation |
|
ApsaraDB for SelectDB |
Alibaba Cloud |
Managed Doris on Alibaba Cloud infrastructure |
When using managed cloud services:
The host, port, and credentials are provided by the service
Stream Load endpoints may use HTTPS (enable
--doris-use-tls)HTTP port may differ from the default 8030
Troubleshooting¶
Connection Refused¶
Error: Failed to connect to Doris
Solutions:
Verify Doris FE is running and accessible on the configured host and port
Check that port 9030 (MySQL protocol) is open and not blocked by a firewall
For Docker deployments, ensure port mapping is correct (
-p 9030:9030)Test connectivity directly:
mysql -h <host> -P 9030 -u root
Missing PyMySQL Dependency¶
Error: Missing dependencies for doris platform: pymysql
Solutions:
Install PyMySQL:
uv add pymysqlOr install the Doris extra:
uv add benchbox --extra doris
Stream Load Failures¶
Error: Stream Load failed with status 503
Solutions:
Verify the FE HTTP port (default 8030) is accessible from the BenchBox host
Check that BE nodes are alive:
SHOW BACKENDS;via MySQL clientEnsure the target table exists before loading data
For large files (>1 GB), consider increasing the
timeoutor using smaller batch filesInstall the
requestslibrary:uv add requests
Schema Creation Failures¶
Error: critical CREATE TABLE statement(s) failed
Solutions:
Check that the Doris user has
CREATE TABLEandCREATE DATABASEpermissionsVerify the target database exists or the user has
CREATE DATABASEprivilegeReview Doris FE logs (
fe.log) for detailed error messagesEnsure sufficient disk space and memory on BE nodes
Check for incompatible column types in the translated DDL
Slow Data Loading¶
Solutions:
Install
requeststo enable Stream Load (the recommended bulk-ingest path; INSERT fallback is row-by-row and typically 10x or more slower for large datasets)Ensure the FE HTTP port (8030) is accessible for Stream Load
Increase bucket count for better parallelism during loading
Check network latency between BenchBox host and Doris cluster
Monitor BE resource utilization during loading
Query Timeout¶
Error: Query execution exceeded timeout
Solutions:
Check Doris resource utilization (CPU, memory, disk I/O) via
SHOW PROC '/backends';Verify data distribution is balanced across BE nodes
Run
ANALYZE TABLEto update column statistics for the query optimizerReview the query plan with
EXPLAINfor performance bottlenecksConsider increasing
exec_mem_limitfor memory-intensive queries
See Also¶
Platform Comparison Matrix - Compare all platforms
Platform Selection Guide - Choosing the right platform
StarRocks Platform - Similar MPP OLAP database (shared heritage with Doris)
TPC-H Benchmark - TPC-H benchmark guide
TPC-DS Benchmark - TPC-DS benchmark guide
Deployment Modes Guide - Platform deployment architecture