MCP Server Reference¶
Complete reference for the BenchBox MCP (Model Context Protocol) server, including all available tools, resources, and prompts.
Running the Server¶
Prerequisites¶
Install BenchBox with MCP dependencies:
uv sync --extra mcp
Starting the Server¶
# Via Python module (recommended)
uv run python -m benchbox.mcp
# Via entry point (if installed globally)
benchbox-mcp
The server communicates via stdio using JSON-RPC, compatible with Claude Code and other MCP clients.
Testing Locally¶
To verify the server works, you can test it interactively:
# Start server and send a test request
echo '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' | uv run python -m benchbox.mcp
This should return a JSON response listing all available tools.
Using the MCP Inspector¶
For interactive testing, use the MCP Inspector:
# Install the inspector
npx @anthropic-ai/inspector
# Connect to BenchBox
npx @anthropic-ai/inspector "uv run python -m benchbox.mcp"
The inspector provides a web UI to browse tools, test calls, and view responses.
Server Options¶
The MCP server accepts environment variables for configuration:
Variable |
Default |
Description |
|---|---|---|
|
|
Logging verbosity (DEBUG, INFO, WARNING, ERROR) |
|
|
Directory for benchmark results |
Example with custom logging:
BENCHBOX_MCP_LOG_LEVEL=DEBUG uv run python -m benchbox.mcp
Tools¶
Tools are executable actions that can be invoked by AI assistants.
Discovery Tools¶
list_platforms¶
List all available database platforms.
Returns:
platforms: List of platform objects with:name: Platform identifierdisplay_name: Human-readable namecategory: Platform category (analytical, cloud, embedded, dataframe)available: Whether dependencies are installedsupports_sql: SQL query supportsupports_dataframe: DataFrame API support
count: Total number of platformssummary: Aggregated statistics
Example:
{
"platforms": [
{
"name": "duckdb",
"display_name": "DuckDB",
"category": "embedded",
"available": true,
"supports_sql": true,
"supports_dataframe": false
}
],
"count": 15,
"summary": {"available": 8, "sql_platforms": 12, "dataframe_platforms": 5}
}
list_benchmarks¶
List all available benchmarks.
Returns:
benchmarks: List of benchmark objects with:name: Benchmark identifierdisplay_name: Human-readable namequery_count: Number of queriesscale_factors: Supported scale factorsdataframe_support: Whether DataFrame mode is supported
count: Total number of benchmarkscategories: Benchmarks grouped by category
get_benchmark_info¶
Get detailed information about a specific benchmark.
Parameters:
Name |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Benchmark name (e.g., ‘tpch’, ‘tpcds’) |
Returns:
name: Benchmark identifierqueries: Query information including IDs and countschema: Table informationscale_factors: Supported scale factors with defaults
system_profile¶
Get system profile information for capacity planning.
Returns:
cpu: Core count, thread count, architecturememory: Total and available memory in GBdisk: Disk usage for key pathspython: Python version and implementationpackages: Versions of key packages (polars, pandas, duckdb, pyarrow)benchbox: BenchBox versionrecommendations: Suggested maximum scale factor based on resources
Benchmark Tools¶
run_benchmark¶
Run a benchmark on a database platform.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Target platform (e.g., ‘duckdb’, ‘polars-df’) |
|
string |
Yes |
- |
Benchmark name (e.g., ‘tpch’, ‘tpcds’) |
|
float |
No |
0.01 |
Data scale factor |
|
string |
No |
None |
Comma-separated query IDs (e.g., “1,3,6”) |
|
string |
No |
“load,power” |
Comma-separated phases |
Returns:
execution_id: Unique run identifierstatus: “completed”, “failed”, or “no_results”execution_time_seconds: Total runtimesummary: Query counts and total runtimequeries: Per-query results (limited to 20)
Example:
{
"execution_id": "mcp_a1b2c3d4",
"status": "completed",
"platform": "duckdb",
"benchmark": "tpch",
"scale_factor": 0.01,
"execution_time_seconds": 5.23,
"summary": {
"total_queries": 22,
"successful_queries": 22,
"failed_queries": 0,
"total_execution_time": 1.234
}
}
dry_run¶
Preview what a benchmark run would do without executing.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Target platform |
|
string |
Yes |
- |
Benchmark name |
|
float |
No |
0.01 |
Data scale factor |
|
string |
No |
None |
Optional query subset |
Returns:
status: Always “dry_run”execution_mode: Execution mode (“sql” or “dataframe”)execution_plan: Phases and queries that would runphases: List of phases (e.g., [“load”, “power”])total_queries: Number of queries to executequery_ids: List of query IDs (truncated to 30)test_execution_type: Execution type (e.g., “standard”)execution_context: Context description
resource_estimates: Estimated resource requirementsdata_size_gb: Estimated data size in GBmemory_recommended_gb: Recommended memory in GBdisk_space_recommended_gb: Recommended disk space in GBcpu_cores_available: Available CPU cores on systemmemory_gb_available: Available memory on system in GB
notes: Important considerationswarnings: Platform availability warnings (if any)
validate_config¶
Validate a benchmark configuration before running.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Target platform |
|
string |
Yes |
- |
Benchmark name |
|
float |
No |
1.0 |
Data scale factor |
Returns:
valid: Boolean indicating if configuration is validerrors: List of validation errorswarnings: List of warningsrecommendations: Configuration recommendations
Results Tools¶
list_recent_runs¶
List recent benchmark runs.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
int |
No |
10 |
Maximum results to return |
|
string |
No |
None |
Filter by platform |
|
string |
No |
None |
Filter by benchmark |
Returns:
runs: List of run metadatacount: Number of runs returnedtotal_available: Total runs in results directory
get_results¶
Get detailed results from a benchmark run.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Result filename from |
|
bool |
No |
true |
Include per-query details |
Returns:
platform: Platform configurationbenchmark: Benchmark namesummary: Execution summaryqueries: Per-query timing details
compare_results¶
Compare two benchmark runs to identify performance changes.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Baseline result file |
|
string |
Yes |
- |
Comparison result file |
|
float |
No |
10.0 |
Change threshold for highlighting |
Returns:
baseline: Baseline run metadatacomparison: Comparison run metadatasummary: Regression/improvement countsregressions: Query IDs that regressedimprovements: Query IDs that improvedquery_comparisons: Per-query delta details
export_summary¶
Export a formatted summary of benchmark results.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Result filename |
|
string |
No |
“text” |
Output format: ‘text’, ‘markdown’, or ‘json’ |
Returns:
format: Output format usedcontent: Formatted summary string
export_results¶
Export benchmark results to different formats.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Result filename |
|
string |
No |
“json” |
Output format: ‘json’, ‘csv’, or ‘html’ |
|
string |
No |
None |
File path to write output (relative to results dir) |
Returns:
status: “exported”format: Format usedcontent: Formatted content (if output_path not specified)output_path: Path to written file (if output_path specified)size_bytes: Size of exported content
generate_data¶
Generate benchmark data without running queries.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Benchmark name (e.g., ‘tpch’, ‘tpcds’) |
|
float |
No |
0.01 |
Data scale factor |
|
string |
No |
“parquet” |
Data format: ‘parquet’ or ‘csv’ |
|
bool |
No |
false |
Force regeneration if data exists |
Returns:
status: “generated” or “exists”data_path: Path to generated datafile_count: Number of files generatedtotal_size_mb: Total data size in MB
check_dependencies¶
Check platform dependencies and installation status.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
No |
None |
Specific platform to check (checks all if not provided) |
|
bool |
No |
false |
Include detailed package information |
Returns:
platforms: Dictionary of platform status with availability and missing packagessummary: Total, available, and missing dependency countsrecommendations: Installation suggestionsquick_status: Quick status for single platform queries
Analytics Tools¶
get_query_plan¶
Get query execution plan from benchmark results.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Result filename containing query plans |
|
string |
Yes |
- |
Query identifier (e.g., ‘1’, ‘Q1’, ‘q05’) |
|
string |
No |
“tree” |
Output format: ‘tree’, ‘json’, or ‘summary’ |
Returns:
status: “success” or “no_plan”query_id: Requested query IDplan: Query plan in requested formatruntime_ms: Query runtime
Note: Query plans must be captured using --capture-plans during benchmark execution.
detect_regressions¶
Automatically detect performance regressions across recent runs.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
No |
None |
Filter by platform |
|
string |
No |
None |
Filter by benchmark |
|
float |
No |
10.0 |
Regression threshold percentage |
|
int |
No |
5 |
Number of recent runs to analyze |
Returns:
comparison: Baseline and current run metadatasummary: Regression and improvement countsregressions: List of regressed queries with severityimprovements: List of improved queriesrecommendations: Suggested actions
Historical Analysis Tools¶
get_performance_trends¶
Get performance trends over multiple benchmark runs.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
No |
None |
Filter by platform |
|
string |
No |
None |
Filter by benchmark |
|
string |
No |
“geometric_mean” |
Performance metric: ‘geometric_mean’, ‘p50’, ‘p95’, ‘p99’, ‘total_time’ |
|
int |
No |
10 |
Maximum runs to analyze |
Returns:
metric: Metric usedsummary: Run count, first/last run timestamps, trend directiondata_points: Time-series performance data
aggregate_results¶
Aggregate multiple benchmark results into summary statistics.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
No |
None |
Filter by platform |
|
string |
No |
None |
Filter by benchmark |
|
string |
No |
“platform” |
Grouping dimension: ‘platform’, ‘benchmark’, or ‘date’ |
Returns:
group_by: Grouping usedsummary: Total groups and runsaggregates: Statistical summaries per group (mean, std, min, max, percentiles)
Prompts¶
Prompts are reusable templates for AI analysis. Invoke via slash commands in Claude Code.
analyze_results¶
Analyze benchmark results and identify performance patterns.
Arguments (positional):
benchmark(default: “tpch”)platform(default: “duckdb”)focus(optional): Focus area like ‘slowest_queries’, ‘memory’, ‘io’
Usage:
/mcp__benchbox__analyze_results tpch duckdb slowest_queries
compare_platforms¶
Compare benchmark performance across multiple platforms.
Arguments (positional):
benchmark(default: “tpch”)platforms(default: “duckdb,polars-df”): Comma-separated platform namesscale_factor(default: 0.01)
Usage:
/mcp__benchbox__compare_platforms tpch "duckdb,polars-df,sqlite" 0.1
identify_regressions¶
Identify performance regressions between benchmark runs.
Arguments (positional):
baseline_run(optional): Baseline result filecomparison_run(optional): Comparison result filethreshold_percent(default: 10.0)
Usage:
/mcp__benchbox__identify_regressions run1.json run2.json 5
benchmark_planning¶
Help plan a benchmark strategy for a specific use case.
Arguments (positional):
use_case(default: “testing”): One of ‘testing’, ‘production’, ‘comparison’, ‘regression’platforms(optional): Comma-separated platform listtime_budget_minutes(default: 30)
Usage:
/mcp__benchbox__benchmark_planning comparison "duckdb,snowflake" 60
troubleshoot_failure¶
Diagnose and resolve benchmark failures.
Arguments (positional):
error_message(optional): Error message from failed runplatform(optional): Platform where failure occurredbenchmark(optional): Benchmark that failed
Usage:
/mcp__benchbox__troubleshoot_failure "Connection refused" snowflake tpch
benchmark_run¶
Execute a planned benchmark with validation and dependency checks.
Arguments (positional):
platform(default: “duckdb”): Target platformbenchmark(default: “tpch”): Benchmark to runscale_factor(default: 0.01): Data scale factorqueries(optional): Query subset (e.g., “1,5,10”)
Usage:
/mcp__benchbox__benchmark_run duckdb tpch 0.1
/mcp__benchbox__benchmark_run snowflake tpcds 1 "1,5,10"
This prompt:
Validates the configuration
Checks dependencies
Runs the benchmark
Provides execution summary and recommendations
platform_tuning¶
Get tuning recommendations for a specific platform.
Arguments (positional):
platform(default: “duckdb”): Platform to tuneworkload(optional): Workload characteristics description
Usage:
/mcp__benchbox__platform_tuning duckdb
/mcp__benchbox__platform_tuning snowflake "heavy aggregation workload"
This prompt provides:
Memory configuration recommendations
Parallelism settings
I/O optimization
Platform-specific tuning parameters
Resources¶
Resources provide read-only access to BenchBox data. Currently, resources are accessed indirectly through tools.
Error Handling¶
All tools return structured error information:
{
"error": "Description of the error",
"error_type": "ExceptionClassName",
"suggestion": "How to resolve the issue"
}
Common error types:
ConfigurationError: Invalid platform or benchmark configuration
DependencyError: Missing required dependencies
FileNotFoundError: Result file not found