AI/ML Primitives Benchmark¶
The AI/ML Primitives benchmark tests SQL-based AI functions across cloud data platforms. It enables consistent evaluation of AI capabilities in Snowflake Cortex, BigQuery ML, and Databricks AI Functions using TPC-H data.
Overview¶
Property |
Value |
|---|---|
Total Queries |
16 |
Categories |
4 (generative, nlp, transform, embedding) |
Data Source |
TPC-H (shared data) |
Supported Platforms |
Snowflake, BigQuery, Databricks |
Cost Control |
Built-in budget enforcement |
Why AI Primitives?¶
Modern cloud data platforms increasingly offer built-in AI/ML capabilities through SQL functions:
Snowflake Cortex:
COMPLETE,SUMMARIZE,SENTIMENT,EXTRACT_ANSWER,EMBED_TEXTBigQuery ML:
ML.GENERATE_TEXT,ML.UNDERSTAND_TEXT,ML.GENERATE_EMBEDDINGDatabricks AI:
ai_query,ai_summarize,ai_classify,ai_generate
The AI Primitives benchmark provides:
Standardized Testing: Same queries across all platforms
Cost Estimation: Pre-execution cost estimates with budget enforcement
Performance Comparison: Consistent metrics for platform evaluation
Production Data: Uses TPC-H customer/supplier comments as realistic text
Query Categories¶
Generative (4 queries)¶
Text generation and completion tasks:
Query ID |
Description |
|---|---|
|
Basic text completion with short prompt |
|
Generate customer profile summaries from comments |
|
Answer questions about customer data |
|
Generate SQL from natural language |
NLP (5 queries)¶
Natural language processing and understanding:
Query ID |
Description |
|---|---|
|
Analyze sentiment of customer comments |
|
Classify customer feedback into categories |
|
Extract entities (products, locations, etc.) |
|
Detect language of supplier comments |
|
Extract key terms from order comments |
Transform (4 queries)¶
Text transformation tasks:
Query ID |
Description |
|---|---|
|
Summarize long customer comments |
|
Translate supplier comments |
|
Correct grammar in customer feedback |
|
Rewrite comments in professional tone |
Embedding (3 queries)¶
Vector embedding generation:
Query ID |
Description |
|---|---|
|
Generate embeddings for customer comments |
|
Generate embeddings for supplier descriptions |
|
Find similar items using vector similarity |
Quick Start¶
CLI Usage¶
# Run AI Primitives benchmark on Snowflake
benchbox run --platform snowflake --benchmark ai_primitives --scale 0.01
# Dry run to estimate costs first
benchbox run --platform snowflake --benchmark ai_primitives --dry-run ./preview
# Set cost budget (fails if exceeded)
benchbox run --platform snowflake --benchmark ai_primitives \
--benchmark-option max_cost_usd=1.00
# Run specific categories only
benchbox run --platform databricks --benchmark ai_primitives \
--benchmark-option categories=nlp,transform
Programmatic Usage¶
from benchbox.core.ai_primitives import AIPrimitivesBenchmark
# Initialize with cost budget
benchmark = AIPrimitivesBenchmark(
scale_factor=0.01,
max_cost_usd=1.00,
dry_run=False
)
# Estimate costs before execution
total_cost, estimates = benchmark.estimate_cost(
platform="snowflake",
categories=["nlp", "transform"]
)
print(f"Estimated cost: ${total_cost:.4f}")
# Run benchmark
result = benchmark.run_benchmark(
connection=snowflake_connection,
platform="snowflake",
categories=["nlp"]
)
print(f"Successful: {result.successful_queries}/{result.total_queries}")
print(f"Total cost: ${result.total_cost_estimated_usd:.4f}")
Platform Support¶
Snowflake (Cortex)¶
Snowflake Cortex provides LLM functions directly in SQL:
-- Text completion
SELECT SNOWFLAKE.CORTEX.COMPLETE('llama3-8b', 'Summarize: ' || comment) AS summary
FROM customer LIMIT 10;
-- Sentiment analysis
SELECT SNOWFLAKE.CORTEX.SENTIMENT(c_comment) AS sentiment
FROM customer WHERE c_nationkey = 1;
-- Embedding generation
SELECT SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', c_comment) AS embedding
FROM customer LIMIT 10;
Supported Models: llama3-8b, llama3-70b, mistral-large, mixtral-8x7b
BigQuery (ML Functions)¶
BigQuery ML provides AI functions through external model connections:
-- Text generation
SELECT ML.GENERATE_TEXT(
MODEL `project.dataset.llm_model`,
(SELECT 'Summarize: ' || comment AS prompt),
STRUCT(100 AS max_output_tokens)
).ml_generate_text_result AS summary
FROM `project.dataset.customer` LIMIT 10;
-- Embedding generation
SELECT ML.GENERATE_EMBEDDING(
MODEL `project.dataset.embedding_model`,
(SELECT c_comment AS content)
).ml_generate_embedding_result AS embedding
FROM `project.dataset.customer` LIMIT 10;
Setup Required: Create external model connection to Vertex AI or other LLM provider.
Databricks (AI Functions)¶
Databricks provides AI functions through Foundation Model APIs:
-- Text completion
SELECT ai_query(
'databricks-meta-llama-3-1-70b-instruct',
CONCAT('Summarize this customer: ', c_comment)
) AS summary
FROM customer LIMIT 10;
-- Classification
SELECT ai_classify(c_comment, ARRAY('positive', 'negative', 'neutral')) AS sentiment
FROM customer LIMIT 10;
Supported Models: Foundation Model APIs including Llama, DBRX, and external endpoints.
Cost Management¶
AI queries incur costs based on token usage. The benchmark provides built-in cost controls.
Cost Estimation¶
from benchbox.core.ai_primitives import AIPrimitivesBenchmark
benchmark = AIPrimitivesBenchmark(scale_factor=0.01)
# Get detailed cost estimates
total_cost, estimates = benchmark.estimate_cost(platform="snowflake")
for estimate in estimates:
print(f"{estimate.query_id}: ${estimate.estimated_cost_usd:.4f}")
print(f" Tokens: {estimate.estimated_tokens}")
print(f" Model: {estimate.model}")
Budget Enforcement¶
# Set maximum cost (execution fails if exceeded)
benchbox run --platform snowflake --benchmark ai_primitives \
--benchmark-option max_cost_usd=1.00
# Dry run mode (estimates only, no AI calls)
benchbox run --platform snowflake --benchmark ai_primitives \
--dry-run ./cost_preview
Typical Costs (Estimated)¶
Category |
Queries |
Est. Cost (SF=0.01) |
|---|---|---|
Generative |
4 |
~$0.05-0.15 |
NLP |
5 |
~$0.02-0.08 |
Transform |
4 |
~$0.03-0.10 |
Embedding |
3 |
~$0.01-0.05 |
Total |
16 |
~$0.10-0.40 |
Costs vary by platform, model, and actual token usage.
Configuration Options¶
Option |
Type |
Default |
Description |
|---|---|---|---|
|
float |
0.01 |
TPC-H scale factor (affects data volume) |
|
float |
0.0 |
Maximum allowed cost (0 = unlimited) |
|
bool |
False |
Estimate costs without executing |
|
list |
all |
Filter by category |
|
list |
all |
Run specific query IDs |
Result Structure¶
@dataclass
class AIBenchmarkResult:
benchmark: str = "AI Primitives"
platform: str = ""
scale_factor: float = 1.0
dry_run: bool = False
total_queries: int = 0
successful_queries: int = 0
failed_queries: int = 0
skipped_queries: int = 0
total_execution_time_ms: float = 0.0
total_cost_estimated_usd: float = 0.0
cost_tracker: CostTracker | None = None
query_results: list[AIQueryResult] = field(default_factory=list)
Unsupported Platforms¶
The following platforms do not have built-in AI functions and will skip AI queries:
DuckDB
ClickHouse
SQLite
PostgreSQL
DataFusion
Trino/Presto
Redshift
Azure Synapse
Microsoft Fabric
When running on these platforms, queries are skipped with informative messages.
Best Practices¶
1. Start with Dry Run¶
Always estimate costs before running AI queries:
benchbox run --platform snowflake --benchmark ai_primitives \
--dry-run ./preview --scale 0.01
2. Use Small Scale Factors¶
AI functions process text row-by-row. Start with --scale 0.01:
benchbox run --platform databricks --benchmark ai_primitives --scale 0.01
3. Set Cost Budgets¶
Prevent runaway costs with budget limits:
benchmark = AIPrimitivesBenchmark(
scale_factor=0.01,
max_cost_usd=1.00 # Hard limit
)
4. Test by Category¶
Run specific categories to focus testing:
# Test NLP only
benchbox run --platform snowflake --benchmark ai_primitives \
--benchmark-option categories=nlp
5. Monitor Actual Costs¶
Compare estimates vs actuals in results:
result = benchmark.run_benchmark(connection, "snowflake")
print(f"Estimated: ${result.total_cost_estimated_usd:.4f}")
print(f"Tracker: {result.cost_tracker.get_summary()}")
Extending the Benchmark¶
Adding New Queries¶
Add queries to benchbox/core/ai_primitives/catalog/queries.yaml:
- id: my_new_query
category: nlp
description: My new AI query
model: llama3-8b
estimated_tokens: 100
batch_size: 10
cost_per_1k_tokens: 0.0003
sql: |
SELECT 'placeholder' as result
variants:
snowflake: |
SELECT SNOWFLAKE.CORTEX.SENTIMENT(c_comment) FROM customer LIMIT 10
bigquery: |
-- BigQuery variant
databricks: |
-- Databricks variant
skip_on:
- duckdb
- clickhouse
Adding New Platforms¶
Add platform to
SUPPORTED_PLATFORMSinbenchmark.pyAdd query variants in
queries.yamlImplement cost estimation for the platform’s pricing model