MCP Server Reference¶
Complete reference for the BenchBox MCP (Model Context Protocol) server, including all available tools, resources, and prompts.
Running the Server¶
Prerequisites¶
Install BenchBox with MCP dependencies:
uv sync --extra mcp
Starting the Server¶
# Via Python module
uv run python -m benchbox.mcp
# Via entry point (if installed globally)
benchbox-mcp
# With explicit MCP path overrides
benchbox-mcp --results-dir /tmp/benchbox-results --charts-dir /tmp/benchbox-charts
The server communicates via stdio using JSON-RPC, compatible with Claude Code and other MCP clients.
Testing Locally¶
To verify the server works, you can test it interactively:
# Start server and send a test request
echo '{"jsonrpc":"2.0","id":1,"method":"tools/list"}' | uv run python -m benchbox.mcp
This should return a JSON response listing all available tools.
Using the MCP Inspector¶
For interactive testing, use the MCP Inspector:
# Install the inspector
npx @anthropic-ai/inspector
# Connect to BenchBox
npx @anthropic-ai/inspector "uv run python -m benchbox.mcp"
The inspector provides a web UI to browse tools, test calls, and view responses.
Server Options¶
benchbox-mcp supports explicit flags and environment-variable fallback.
CLI flags
Flag |
Description |
|---|---|
|
Results root used by MCP result reads/writes |
|
Charts root used by MCP visualization paths |
|
Logging level (DEBUG, INFO, WARNING, ERROR) |
Environment variables
Variable |
Default |
Description |
|---|---|---|
|
|
Results root when |
|
|
Charts root when |
|
|
Base root used to derive results/charts when specific vars are unset |
|
|
Logging level when |
Precedence
Explicit MCP flag (
--results-dir,--charts-dir,--log-level)Specific env var (
BENCHBOX_RESULTS_DIR,BENCHBOX_CHARTS_DIR,BENCHBOX_LOG_LEVEL)Derived from
BENCHBOX_OUTPUT_DIRfor pathsBuilt-in defaults (
benchmark_runs/results,benchmark_runs/charts,INFO)
Example:
BENCHBOX_RESULTS_DIR=/tmp/results BENCHBOX_LOG_LEVEL=DEBUG benchbox-mcp
Tools¶
Tools are executable actions that can be invoked by AI assistants.
Discovery Tools¶
list_platforms¶
List all available database platforms.
Returns:
platforms: List of platform objects with:name: Platform identifierdisplay_name: Human-readable namecategory: Platform category (analytical, cloud, embedded, dataframe)available: Whether dependencies are installedsupports_sql: SQL query supportsupports_dataframe: DataFrame API support
count: Total number of platformssummary: Aggregated statistics
Example:
{
"platforms": [
{
"name": "duckdb",
"display_name": "DuckDB",
"category": "embedded",
"available": true,
"supports_sql": true,
"supports_dataframe": false
}
],
"count": 15,
"summary": {"available": 8, "sql_platforms": 12, "dataframe_platforms": 5}
}
list_benchmarks¶
List all available benchmarks.
Returns:
benchmarks: List of benchmark objects with:name: Benchmark identifierdisplay_name: Human-readable namequery_count: Number of queriesscale_factors: Supported scale factorsdataframe_support: Whether DataFrame mode is supported
count: Total number of benchmarkscategories: Benchmarks grouped by category
get_benchmark_info¶
Get detailed information about a specific benchmark.
Parameters:
Name |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Benchmark name (e.g., ‘tpch’, ‘tpcds’) |
Returns:
name: Benchmark identifierqueries: Query information including IDs and countschema: Table informationscale_factors: Supported scale factors with defaults
system_profile¶
Get system profile information for capacity planning.
Returns:
cpu: Core count, thread count, architecturememory: Total and available memory in GBdisk: Disk usage for key pathspython: Python version and implementationpackages: Versions of key packages (polars, pandas, duckdb, pyarrow)benchbox: BenchBox versionrecommendations: Suggested maximum scale factor based on resources
Benchmark Tools¶
run_benchmark¶
Run a benchmark on a database platform.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Target platform (e.g., ‘duckdb’, ‘polars-df’) |
|
string |
Yes |
- |
Benchmark name (e.g., ‘tpch’, ‘tpcds’) |
|
float |
No |
0.01 |
Data scale factor |
|
string |
No |
None |
Comma-separated query IDs (e.g., “1,3,6”) |
|
string |
No |
“load,power” |
Comma-separated phases |
Returns:
execution_id: Unique run identifierstatus: “completed”, “failed”, or “no_results”execution_time_seconds: Total runtimesummary: Query counts and total runtimequeries: Per-query results (limited to 20)
Example:
{
"execution_id": "mcp_a1b2c3d4",
"status": "completed",
"platform": "duckdb",
"benchmark": "tpch",
"scale_factor": 0.01,
"execution_time_seconds": 5.23,
"summary": {
"total_queries": 22,
"successful_queries": 22,
"failed_queries": 0,
"total_execution_time": 1.234
}
}
dry_run¶
Preview what a benchmark run would do without executing.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Target platform |
|
string |
Yes |
- |
Benchmark name |
|
float |
No |
0.01 |
Data scale factor |
|
string |
No |
None |
Optional query subset |
Returns:
status: Always “dry_run”execution_mode: Execution mode (“sql” or “dataframe”)execution_plan: Phases and queries that would runphases: List of phases (e.g., [“load”, “power”])total_queries: Number of queries to executequery_ids: List of query IDs (truncated to 30)test_execution_type: Execution type (e.g., “standard”)execution_context: Context description
resource_estimates: Estimated resource requirementsdata_size_gb: Estimated data size in GBmemory_recommended_gb: Recommended memory in GBdisk_space_recommended_gb: Recommended disk space in GBcpu_cores_available: Available CPU cores on systemmemory_gb_available: Available memory on system in GB
notes: Important considerationswarnings: Platform availability warnings (if any)
validate_config¶
Validate a benchmark configuration before running.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Target platform |
|
string |
Yes |
- |
Benchmark name |
|
float |
No |
1.0 |
Data scale factor |
Returns:
valid: Boolean indicating if configuration is validerrors: List of validation errorswarnings: List of warningsrecommendations: Configuration recommendations
Results Tools¶
list_recent_runs¶
List recent benchmark runs.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
int |
No |
10 |
Maximum results to return |
|
string |
No |
None |
Filter by platform |
|
string |
No |
None |
Filter by benchmark |
Returns:
runs: List of run metadatacount: Number of runs returnedtotal_available: Total runs in results directory
get_results¶
Get detailed results from a benchmark run.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Result filename from |
|
bool |
No |
true |
Include per-query details |
Returns:
platform: Platform configurationbenchmark: Benchmark namesummary: Execution summaryqueries: Per-query timing details
compare_results¶
Compare two benchmark runs to identify performance changes.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Baseline result file |
|
string |
Yes |
- |
Comparison result file |
|
float |
No |
10.0 |
Change threshold for highlighting |
Returns:
baseline: Baseline run metadatacomparison: Comparison run metadatasummary: Regression/improvement countsregressions: Query IDs that regressedimprovements: Query IDs that improvedquery_comparisons: Per-query delta details
export_summary¶
Export a formatted summary of benchmark results.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Result filename |
|
string |
No |
“text” |
Output format: ‘text’, ‘markdown’, or ‘json’ |
Returns:
format: Output format usedcontent: Formatted summary string
export_results¶
Export benchmark results to different formats.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Result filename |
|
string |
No |
“json” |
Output format: ‘json’, ‘csv’, or ‘html’ |
|
string |
No |
None |
File path to write output (relative to results dir) |
Returns:
status: “exported”format: Format usedcontent: Formatted content (if output_path not specified)output_path: Path to written file (if output_path specified)size_bytes: Size of exported content
generate_data¶
Generate benchmark data without running queries.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Benchmark name (e.g., ‘tpch’, ‘tpcds’) |
|
float |
No |
0.01 |
Data scale factor |
|
string |
No |
“parquet” |
Data format: ‘parquet’ or ‘csv’ |
|
bool |
No |
false |
Force regeneration if data exists |
Returns:
status: “generated” or “exists”data_path: Path to generated datafile_count: Number of files generatedtotal_size_mb: Total data size in MB
check_dependencies¶
Check platform dependencies and installation status.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
No |
None |
Specific platform to check (checks all if not provided) |
|
bool |
No |
false |
Include detailed package information |
Returns:
platforms: Dictionary of platform status with availability and missing packagessummary: Total, available, and missing dependency countsrecommendations: Installation suggestionsquick_status: Quick status for single platform queries
Analytics Tools¶
get_query_plan¶
Get query execution plan from benchmark results.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Result filename containing query plans |
|
string |
Yes |
- |
Query identifier (e.g., ‘1’, ‘Q1’, ‘q05’) |
|
string |
No |
“tree” |
Output format: ‘tree’, ‘json’, or ‘summary’ |
Returns:
status: “success” or “no_plan”query_id: Requested query IDplan: Query plan in requested formatruntime_ms: Query runtime
Note: Query plans must be captured using --capture-plans during benchmark execution.
detect_regressions¶
Automatically detect performance regressions across recent runs.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
No |
None |
Filter by platform |
|
string |
No |
None |
Filter by benchmark |
|
float |
No |
10.0 |
Regression threshold percentage |
|
int |
No |
5 |
Number of recent runs to analyze |
Returns:
comparison: Baseline and current run metadatasummary: Regression and improvement countsregressions: List of regressed queries with severityimprovements: List of improved queriesrecommendations: Suggested actions
Visualization Tools¶
suggest_charts¶
Analyze results and suggest appropriate chart types.
Parameters:
Name |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Comma-separated list of result filenames |
Returns:
suggestions: List of chart recommendations with priority and reasoningprimary: Best chart type recommendation with reasondata_profile: Analysis of result data (result_count, platforms, has_cost_data, etc.)source_files: Input files used
generate_chart¶
Generate charts from benchmark results.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
Yes |
- |
Comma-separated list of result filenames |
|
string |
No |
“performance_bar” |
Chart type: performance_bar, power_bar, distribution_box, query_heatmap, query_histogram, cost_scatter, time_series, comparison_bar, diverging_bar, summary_box, percentile_ladder, normalized_speedup, stacked_phase, sparkline_table, cdf_chart, rank_table |
|
string |
No |
None |
Template: default, flagship, head_to_head, trends, cost_optimization, comparison, latency_deep_dive, regression_triage, executive_summary |
|
string |
No |
None |
Output directory (relative to charts dir) |
|
string |
No |
“ascii” |
Output format: ‘ascii’ for terminal-friendly text output |
Returns:
status: “generated”chart_type: Generated chart type (ortemplateif using template)format: “ascii”content: ASCII chart rendered inlinesource_files: Input files usednote: Usage instructions
Examples:
# Single chart
generate_chart(result_files="run1.json", chart_type="performance_bar")
# Template (multiple charts combined with separators)
generate_chart(result_files="run1.json,run2.json", template="head_to_head")
# Platform comparison
generate_chart(result_files="duckdb.json,polars.json", chart_type="query_heatmap")
Output Features:
ANSI colors for terminal display (copy-paste preserves formatting)
Unicode box-drawing characters for clean visualization
Best/worst highlighting, legends, and scale indicators
Template output combines multiple charts with
────────separators
Historical Analysis Tools¶
get_performance_trends¶
Get performance trends over multiple benchmark runs.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
No |
None |
Filter by platform |
|
string |
No |
None |
Filter by benchmark |
|
string |
No |
“geometric_mean” |
Performance metric: ‘geometric_mean’, ‘p50’, ‘p95’, ‘p99’, ‘total_time’ |
|
int |
No |
10 |
Maximum runs to analyze |
Returns:
metric: Metric usedsummary: Run count, first/last run timestamps, trend directiondata_points: Time-series performance data
aggregate_results¶
Aggregate multiple benchmark results into summary statistics.
Parameters:
Name |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
No |
None |
Filter by platform |
|
string |
No |
None |
Filter by benchmark |
|
string |
No |
“platform” |
Grouping dimension: ‘platform’, ‘benchmark’, or ‘date’ |
Returns:
group_by: Grouping usedsummary: Total groups and runsaggregates: Statistical summaries per group (mean, std, min, max, percentiles)
Prompts¶
Prompts are reusable templates for AI analysis. Invoke via slash commands in Claude Code.
analyze_results¶
Analyze benchmark results and identify performance patterns.
Arguments (positional):
benchmark(default: “tpch”)platform(default: “duckdb”)focus(optional): Focus area like ‘slowest_queries’, ‘memory’, ‘io’
Usage:
/mcp__benchbox__analyze_results tpch duckdb slowest_queries
compare_platforms¶
Compare benchmark performance across multiple platforms.
Arguments (positional):
benchmark(default: “tpch”)platforms(default: “duckdb,polars-df”): Comma-separated platform namesscale_factor(default: 0.01)
Usage:
/mcp__benchbox__compare_platforms tpch "duckdb,polars-df,sqlite" 0.1
identify_regressions¶
Identify performance regressions between benchmark runs.
Arguments (positional):
baseline_run(optional): Baseline result filecomparison_run(optional): Comparison result filethreshold_percent(default: 10.0)
Usage:
/mcp__benchbox__identify_regressions run1.json run2.json 5
benchmark_planning¶
Help plan a benchmark strategy for a specific use case.
Arguments (positional):
use_case(default: “testing”): One of ‘testing’, ‘production’, ‘comparison’, ‘regression’platforms(optional): Comma-separated platform listtime_budget_minutes(default: 30)
Usage:
/mcp__benchbox__benchmark_planning comparison "duckdb,snowflake" 60
troubleshoot_failure¶
Diagnose and resolve benchmark failures.
Arguments (positional):
error_message(optional): Error message from failed runplatform(optional): Platform where failure occurredbenchmark(optional): Benchmark that failed
Usage:
/mcp__benchbox__troubleshoot_failure "Connection refused" snowflake tpch
benchmark_run¶
Execute a planned benchmark with validation and dependency checks.
Arguments (positional):
platform(default: “duckdb”): Target platformbenchmark(default: “tpch”): Benchmark to runscale_factor(default: 0.01): Data scale factorqueries(optional): Query subset (e.g., “1,5,10”)
Usage:
/mcp__benchbox__benchmark_run duckdb tpch 0.1
/mcp__benchbox__benchmark_run snowflake tpcds 1 "1,5,10"
This prompt:
Validates the configuration
Checks dependencies
Runs the benchmark
Provides execution summary and recommendations
platform_tuning¶
Get tuning recommendations for a specific platform.
Arguments (positional):
platform(default: “duckdb”): Platform to tuneworkload(optional): Workload characteristics description
Usage:
/mcp__benchbox__platform_tuning duckdb
/mcp__benchbox__platform_tuning snowflake "heavy aggregation workload"
This prompt provides:
Memory configuration recommendations
Parallelism settings
I/O optimization
Platform-specific tuning parameters
Resources¶
Resources provide read-only access to BenchBox data. Currently, resources are accessed indirectly through tools.
Error Handling¶
All tools return structured error information:
{
"error": "Description of the error",
"error_type": "ExceptionClassName",
"suggestion": "How to resolve the issue"
}
Common error types:
ConfigurationError: Invalid platform or benchmark configuration
DependencyError: Missing required dependencies
FileNotFoundError: Result file not found