Data Model¶
Understanding BenchBox’s result schema, data structures, and serialization formats.
Overview¶
BenchBox produces structured, machine-readable results that enable:
Reproducibility: Complete execution metadata for result verification
Comparison: Standardized format for cross-platform and temporal analysis
Integration: JSON schema compatible with analysis tools and databases
Compliance: TPC-compatible metric calculation and validation
Result Schema Hierarchy¶
BenchmarkResults
├── Core Metadata (benchmark name, platform, scale factor, timestamp)
├── Execution Phases
│ ├── Setup Phase
│ │ ├── Data Generation
│ │ ├── Schema Creation
│ │ ├── Data Loading
│ │ └── Validation
│ ├── Power Test Phase (optional)
│ └── Throughput Test Phase (optional)
├── Query Results (list of QueryResult objects)
├── Query Definitions (SQL text and parameters)
├── System Profile (hardware, software, configuration)
├── Platform Info (driver versions, configuration)
└── Validation Results (correctness checks)
Core Data Structures¶
BenchmarkResults¶
Top-level object containing complete benchmark execution information.
Key Fields:
{
# Identification
"benchmark_name": str, # e.g., "TPC-H", "TPC-DS", "ClickBench"
"platform": str, # e.g., "DuckDB", "Snowflake"
"execution_id": str, # Unique run identifier
"timestamp": datetime, # ISO 8601 timestamp
# Configuration
"scale_factor": float, # Data size multiplier
"test_execution_type": str, # "standard", "power", "throughput"
# Timing Summary
"duration_seconds": float, # Total execution time
"total_execution_time": float, # Query execution time only
"average_query_time": float, # Mean query time
"data_loading_time": float, # Time to load data
"schema_creation_time": float, # Time to create schema
# Query Statistics
"total_queries": int, # Number of queries attempted
"successful_queries": int, # Queries that succeeded
"failed_queries": int, # Queries that failed
# Detailed Results
"query_results": List[QueryResult],
"query_definitions": Dict[str, QueryDefinition],
"execution_phases": ExecutionPhases,
# Validation
"validation_status": str, # "PASSED", "FAILED", "SKIPPED"
"validation_details": dict, # Validation check results
# System Context
"system_profile": dict, # Hardware/software info
"platform_info": dict, # Platform-specific metadata
"tunings_applied": dict, # Performance tuning configuration
}
Example:
{
"benchmark_name": "TPC-H",
"platform": "DuckDB",
"scale_factor": 1.0,
"execution_id": "tpch_1729613234",
"timestamp": "2025-10-12T10:30:34Z",
"duration_seconds": 47.3,
"total_execution_time": 45.2,
"average_query_time": 2.06,
"total_queries": 22,
"successful_queries": 22,
"failed_queries": 0,
"validation_status": "PASSED"
}
QueryResult¶
Individual query execution details.
Structure:
{
"query_id": str, # e.g., "q1", "query15"
"stream_id": str, # "stream_1" (default), "stream_2" (throughput)
"execution_time": float, # Seconds
"status": str, # "SUCCESS", "FAILED", "SKIPPED"
"row_count": int, # Number of rows returned
"data_scanned_bytes": int, # Bytes scanned (if available)
"error_message": str | None, # Error details if failed
"query_text": str, # Executed SQL
"parameters": dict | None, # Query parameters used
"start_time": datetime, # Query start timestamp
"end_time": datetime, # Query completion timestamp
}
Example:
{
"query_id": "q1",
"stream_id": "stream_1",
"execution_time": 2.134,
"status": "SUCCESS",
"row_count": 4,
"data_scanned_bytes": 104857600,
"error_message": null,
"start_time": "2025-10-12T10:30:35Z",
"end_time": "2025-10-12T10:30:37Z"
}
ExecutionPhases¶
Detailed timing breakdown for benchmark phases.
Structure:
{
"setup": SetupPhase, # Data gen, schema, loading
"power_test": PowerTestPhase, # Single-stream execution
"throughput_test": ThroughputTestPhase, # Multi-stream execution
}
SetupPhase¶
{
"data_generation": {
"duration_ms": int,
"status": str,
"tables_generated": int,
"total_rows_generated": int,
"total_data_size_bytes": int,
"per_table_stats": dict,
},
"schema_creation": {
"duration_ms": int,
"status": str,
"tables_created": int,
"constraints_applied": int,
"indexes_created": int,
},
"data_loading": {
"duration_ms": int,
"status": str,
"total_rows_loaded": int,
"tables_loaded": int,
"per_table_stats": dict,
},
"validation": {
"duration_ms": int,
"row_count_validation": str,
"schema_validation": str,
"data_integrity_checks": str,
}
}
PowerTestPhase¶
{
"query_stream": List[QueryResult], # Single stream execution
"start_time": datetime,
"end_time": datetime,
"geometric_mean": float, # Geomean of query times
}
ThroughputTestPhase¶
{
"streams": List[QueryStream], # Multiple concurrent streams
"refresh_functions": List[RefreshFunction], # Maintenance operations
"start_time": datetime,
"end_time": datetime,
"measurement_interval_seconds": float,
"throughput_qph": float, # Queries per hour
}
QueryDefinition¶
SQL query template and parameters.
Structure:
{
"sql": str, # Query text (may have placeholders)
"parameters": dict | None, # Parameter values for substitution
"description": str | None, # Query purpose/description
}
Example:
{
"q1": {
"sql": "SELECT l_returnflag, l_linestatus, ...",
"parameters": {"date": "1998-09-02"},
"description": "Pricing Summary Report"
}
}
System Profile¶
Hardware and software context for reproducibility.
{
"os": str, # "Linux", "macOS", "Windows"
"os_version": str, # "Ubuntu 22.04", "macOS 14.1"
"python_version": str, # "3.11.5"
"benchbox_version": str, # "0.1.0"
"cpu_model": str, # "Intel Xeon E5-2686 v4"
"cpu_cores": int, # 8
"ram_gb": float, # 32.0
"anonymous_machine_id": str, # Hashed machine identifier
}
Platform Info¶
Platform-specific metadata.
{
"platform_name": str, # "DuckDB", "Snowflake", etc.
"driver_name": str, # "duckdb", "snowflake-connector-python"
"driver_version": str, # "0.9.2", "3.0.4"
"configuration": dict, # Platform-specific settings
"warehouse_size": str | None, # Cloud warehouse size
"cluster_id": str | None, # Cloud cluster identifier
}
Example (Snowflake):
{
"platform_name": "Snowflake",
"driver_name": "snowflake-connector-python",
"driver_version": "3.0.4",
"configuration": {
"account": "xy12345",
"warehouse": "COMPUTE_WH",
"warehouse_size": "LARGE",
"database": "BENCHBOX",
"schema": "TPCH_SF1"
}
}
Validation Results¶
Correctness verification details.
{
"validation_status": str, # "PASSED", "FAILED", "SKIPPED"
"validation_mode": str, # "none", "basic", "strict"
"checks": {
"row_count": {
"status": str, # "PASSED", "FAILED"
"expected": dict, # Expected row counts per query
"actual": dict, # Actual row counts
"mismatches": list, # Queries with wrong row counts
},
"result_checksum": {
"status": str,
"expected": dict, # Expected checksums
"actual": dict, # Actual checksums
"mismatches": list,
},
"data_integrity": {
"status": str,
"checks_performed": list,
"failures": list,
}
}
}
Serialization Formats¶
JSON (Primary Format)¶
Default output format for all benchmark results.
Features:
Human-readable
Standard library support (Python
jsonmodule)Compatible with analysis tools (jq, pandas, etc.)
Example:
# Pretty-print results
cat results.json | jq '.query_results[] | {query_id, execution_time}'
# Load into pandas
import pandas as pd
df = pd.read_json("results.json")
CSV (Export Format)¶
Use case: Simplified analysis in spreadsheets
# Export query results to CSV
benchbox export results.json --format csv --output-dir ./
CSV Columns:
query_id,execution_time,status,row_count
q1,2.134,SUCCESS,4
q2,3.421,SUCCESS,100
...
Parquet (Archival Format)¶
Use case: Long-term storage, data lakes
# Convert results to Parquet using pandas
import pandas as pd
import json
# Load JSON results
with open("results.json") as f:
data = json.load(f)
# Extract query results
query_data = []
for q in data["results"]["queries"]["details"]:
query_data.append({
"query_id": q["id"],
"execution_time_ms": q["timing"]["execution_ms"],
"status": q["status"],
})
# Convert to DataFrame and save as Parquet
df = pd.DataFrame(query_data)
df.to_parquet("results.parquet", compression="snappy")
Benefits:
Compressed (smaller file size)
Column-oriented (fast analytical queries)
Schema evolution support
Working with Results¶
Python API¶
from benchbox.core.results.models import BenchmarkResults
# Load results from JSON
results = BenchmarkResults.from_json_file("results.json")
# Access fields
print(f"Benchmark: {results.benchmark_name}")
print(f"Duration: {results.duration_seconds:.2f}s")
print(f"Success rate: {results.successful_queries}/{results.total_queries}")
# Iterate through query results
for qr in results.query_results:
if qr.status == "SUCCESS":
print(f"{qr.query_id}: {qr.execution_time:.3f}s")
# Save to file
results.to_json_file("results_copy.json")
Command-Line Tools¶
# Query results with jq
jq '.query_results[] | select(.execution_time > 5)' results.json
# Extract timing summary
jq '{benchmark: .benchmark_name, total_time: .total_execution_time, avg_time: .average_query_time}' results.json
# Compare two results
benchbox compare baseline.json current.json
Analysis Examples¶
Calculate Geometric Mean¶
import math
query_times = [qr.execution_time for qr in results.query_results
if qr.status == "SUCCESS"]
geomean = math.prod(query_times) ** (1.0 / len(query_times))
print(f"Geometric mean: {geomean:.3f}s")
Detect Regressions¶
def compare_results(baseline, current, threshold=1.1):
"""Flag queries with >10% regression"""
baseline_times = {qr.query_id: qr.execution_time
for qr in baseline.query_results}
current_times = {qr.query_id: qr.execution_time
for qr in current.query_results}
regressions = []
for qid in baseline_times:
if qid in current_times:
ratio = current_times[qid] / baseline_times[qid]
if ratio > threshold:
regressions.append((qid, ratio))
return regressions
Export to DataFrame¶
import pandas as pd
# Convert to DataFrame
df = pd.DataFrame([
{
"query_id": qr.query_id,
"execution_time": qr.execution_time,
"status": qr.status,
"row_count": qr.row_count,
}
for qr in results.query_results
])
# Analyze
print(df.describe())
print(df.groupby("status").count())
Schema Evolution¶
BenchBox maintains backward compatibility through schema versioning.
Current Version: v1 (result_schema_v1)
Compatibility Promise:
New fields may be added (with defaults)
Existing fields won’t be removed or renamed
Type changes are breaking (trigger major version)
Migration: If schema changes occur, migration guides will be provided.
See: Migration Guides
Best Practices¶
Result Storage¶
Version Control: Store result JSON files in git for history tracking
Naming Convention: Use descriptive names:
{benchmark}_{sf}_{platform}_{timestamp}.jsonArchival: Compress old results with gzip or convert to Parquet
Metadata: Include git commit SHA in execution_id for traceability
Result Analysis¶
Geometric Mean: Use for TPC benchmark reporting (not arithmetic mean)
Outliers: Investigate queries with >3x variance from baseline
Validation: Always check validation_status before trusting results
Context: Compare results with matching system_profile values
Performance Monitoring¶
Baseline: Establish baseline results with known-good configuration
Trending: Track results over time to detect gradual degradation
Alerting: Set thresholds for acceptable regression (e.g., 10%)
Root Cause: Cross-reference system_profile for environmental changes