Platform Selection Guide¶

Tags beginner guide sql-platform

This guide helps you choose the optimal database platform for your BenchBox benchmarking needs based on your specific requirements, constraints, and objectives. Start by running uv run benchbox check-deps --matrix to see which optional connectors are installed locally, then use benchbox platforms install <name> and benchbox platforms enable <name> to activate the adapters that match your shortlist.

Curious about engines that are being evaluated next? Check the potential future platforms reference for research notes and roadmap context.

Quick Start: Choose Your Path¶

I want to start immediately with minimal setup¶

Included by default: DuckDB

No additional dependencies or configuration required
In-process analytical database with columnar storage
Suitable for learning and experimentation

I need the most cost-effective solution¶

Open-source options: DuckDB (free) or ClickHouse (self-hosted)

Open source with no licensing costs
Run on your own hardware
Minimal operational overhead

I need maximum performance¶

High-performance options: ClickHouse or Snowflake with large warehouses

Column-oriented architecture optimized for analytical workloads
Scalable compute resources
Advanced query optimization features

I want fully managed cloud service¶

Cloud platform options: BigQuery, Snowflake, or Databricks

No infrastructure management required
Auto-scaling capabilities
Enterprise support and SLAs available

I need enterprise features and governance¶

Enterprise platform options: Snowflake or Databricks

Advanced security and compliance features
Role-based access control
Audit logging and governance capabilities

I want to benchmark DataFrame libraries (not SQL)¶

DataFrame platforms: Polars-df, Pandas-df, PySpark-df, DataFusion-df

Native DataFrame API benchmarking (no SQL)
Compare Polars vs Pandas vs PySpark on identical workloads
Measure performance of data science workflows
See DataFrame Platforms Guide for details

# Quick start with DataFrame platforms
benchbox run --platform polars-df --benchmark tpch --scale 0.1   # Expression API
benchbox run --platform pandas-df --benchmark tpch --scale 0.1   # Pandas API

Platform Extras and Installation Matrix¶

BenchBox uses optional dependency extras to keep the core installation lightweight. Install just the connectors you need, or combine extras for broader coverage. Always wrap the extras specification in quotes when using shells like zsh.

Extra	What it Enables	Install Command
`(none)`	DuckDB + SQLite (local development)	`uv add benchbox`
`[cloudstorage]`	Cloudpathlib helpers for S3, GCS, Azure paths	`uv add benchbox --extra cloudstorage`
`[cloud]`	Databricks, BigQuery, Redshift, Snowflake connectors	`uv add benchbox --extra cloud`
`[databricks]`	Databricks SQL Warehouses + Unity Catalog tooling	`uv add benchbox --extra databricks`
`[bigquery]`	BigQuery client libraries + Google Cloud Storage support	`uv add benchbox --extra bigquery`
`[redshift]`	Redshift connector, boto3, and cloud storage helpers	`uv add benchbox --extra redshift`
`[snowflake]`	Snowflake connector + cloud storage helpers	`uv add benchbox --extra snowflake`
`[trino]`	Trino/Starburst distributed SQL connector	`uv add benchbox --extra trino`
`[presto]`	PrestoDB (Meta) distributed SQL connector	`uv add benchbox --extra presto`
`[clickhouse]`	ClickHouse native driver	`uv add benchbox --extra clickhouse`
`[all]`	Every connector and helper listed above	`uv add benchbox --extra all`

Need to double-check your environment? Run benchbox check-deps --matrix for the full installation matrix or benchbox check-deps --platform <name> for targeted guidance.

Presto Lineage: Trino vs PrestoDB vs Athena¶

Trino: Recommended default for new distributed deployments and Starburst; uses trino Python client and X-Trino-* headers.
PrestoDB: Choose for legacy Presto clusters or Meta ecosystem compatibility; uses presto-python-client with X-Presto-* headers.
Athena: AWS managed service built on the Presto/Trino lineage; choose when you want serverless execution without running your own coordinator.

SQL vs DataFrame: When to Choose Each¶

Before diving into platform specifics, consider whether SQL or DataFrame APIs better match your benchmarking needs.

When to Choose SQL Platforms¶

Choose SQL platforms (DuckDB, BigQuery, Snowflake, etc.) when:

You’re benchmarking database systems - Comparing cloud data warehouses or analytical databases
Your production workloads use SQL - Testing the same queries your applications execute
You need distributed execution - Cloud platforms automatically scale compute
Your team thinks in SQL - Familiar with SQL query patterns and optimization

# SQL platforms execute queries as SQL strings
benchbox run --platform duckdb --benchmark tpch --scale 1
benchbox run --platform snowflake --benchmark tpch --scale 100

When to Choose DataFrame Platforms¶

Choose DataFrame platforms (Polars-df, Pandas-df, PySpark-df, etc.) when:

You’re benchmarking DataFrame libraries - Comparing Polars vs Pandas vs PySpark
Your production code uses DataFrame APIs - Data science workflows, ML pipelines
You want to compare paradigms - SQL vs native DataFrame performance
You need lazy evaluation control - Fine-grained control over execution plans

# DataFrame platforms execute using native APIs
benchbox run --platform polars-df --benchmark tpch --scale 1
benchbox run --platform pandas-df --benchmark tpch --scale 1

Side-by-Side Comparison¶

Aspect	SQL Platforms	DataFrame Platforms
Query Format	SQL strings	Native API calls (method chaining)
Execution	Database engine	Library (in-process or distributed)
Scale Limit	Petabyte+ (cloud)	Varies: SF 10-100 (single-node), SF 1000+ (PySpark)
Use Case	Production analytics, data warehousing	Data science, ML pipelines, library evaluation
Debugging	EXPLAIN plans	Step-through execution, lazy plan inspection
Type Safety	Runtime (SQL parsing)	Some compile-time (IDE support)

DataFrame Platform Quick Reference¶

Platform	CLI Name	Best For	Max Scale Factor
Polars	`polars-df`	High-performance single-node analytics	SF 100 (lazy eval)
Pandas	`pandas-df`	Data science prototyping, compatibility	SF 10 (memory-bound)
PySpark	`pyspark-df`	Distributed processing, big data	SF 1000+ (cluster)
DataFusion	`datafusion-df`	Arrow-native workflows, Rust performance	SF 50+

For detailed DataFrame platform documentation, see the DataFrame Platforms Guide.

Detailed Selection Criteria¶

1. Data Scale Requirements¶

Small Scale (< 1GB)¶

Platform Options: DuckDB, DataFusion, SQLite

# Commonly used for development and testing
from benchbox import TPCH
from benchbox.platforms.duckdb import DuckDBAdapter

benchmark = TPCH(scale_factor=0.1)  # ~100MB
adapter = DuckDBAdapter()  # In-memory processing
results = adapter.run_benchmark(benchmark)

# Or use DataFusion for PyArrow-based workflows
from benchbox.platforms.datafusion import DataFusionAdapter
adapter = DataFusionAdapter(
    memory_limit="4G",
    data_format="parquet"
)
results = adapter.run_benchmark(benchmark)

Characteristics:

No additional setup or configuration required
Embedded databases suitable for small datasets
Free and lightweight
Commonly used in CI/CD pipelines
DataFusion offers PyArrow integration for Arrow-based workflows

Medium Scale (1GB - 100GB)¶

Platform Options: DuckDB, DataFusion, ClickHouse, BigQuery

# DuckDB with persistent storage
adapter = DuckDBAdapter(
    database_path="analytics.duckdb",
    memory_limit="16GB"
)

# Or ClickHouse for distributed processing
from benchbox.platforms.clickhouse import ClickHouseAdapter
adapter = ClickHouseAdapter(
    host="clickhouse-cluster",
    max_memory_usage="32GB",
    max_threads=16
)

Characteristics:

Balance of performance and cost
Support for complex analytical queries
Moderate resource requirements

Large Scale (100GB+)¶

Platform Options: BigQuery, Snowflake, Databricks, Redshift

# Serverless architecture with BigQuery
from benchbox.platforms.bigquery import BigQueryAdapter
adapter = BigQueryAdapter(
    project_id="analytics-project",
    dataset_id="large_scale_benchmarks",
    location="US",
    job_priority="INTERACTIVE"
)

Characteristics:

Support for petabyte-scale analytics
Auto-scaling infrastructure
Managed cloud data warehouse platforms

2. Budget and Cost Considerations¶

Free/Open Source Solutions¶

Platform	Cost Structure	Ongoing Costs
DuckDB	Completely free	Hardware only
DataFusion	Completely free	Hardware only
SQLite	Completely free	Hardware only
ClickHouse	Open source	Infrastructure costs

Example Setup:

# Completely free solution with DuckDB
adapter = DuckDBAdapter(
    database_path="cost_effective.duckdb",
    memory_limit="8GB"  # Use available RAM efficiently
)

# Or DataFusion for in-memory analytics
from benchbox.platforms.datafusion import DataFusionAdapter
adapter = DataFusionAdapter(
    memory_limit="8G",
    data_format="parquet"  # Better performance
)

Pay-per-Use Cloud Solutions¶

Platform	Pricing Model	Cost Control
BigQuery	$5/TB queried	Query limits, slots
Databricks	DBU consumption	Auto-termination

Cost-Optimized Configuration:

# BigQuery with cost controls
adapter = BigQueryAdapter(
    maximum_bytes_billed=1000000000,  # 1GB limit per query
    job_priority="BATCH",  # Lower cost
    query_cache=True  # Reuse results
)

Provisioned Cloud Solutions¶

Platform	Pricing Model	Cost Optimization
Snowflake	Warehouse time	Auto-suspend/resume
Redshift	Cluster hours	Reserved instances

Cost-Optimized Configuration:

# Snowflake with aggressive auto-suspend
adapter = SnowflakeAdapter(
    warehouse_size="MEDIUM",
    auto_suspend=60,  # 1 minute
    auto_resume=True
)

3. Performance Requirements¶

Latency-Sensitive Applications¶

Platform Options: DuckDB, ClickHouse

# DuckDB: In-process query execution
adapter = DuckDBAdapter(
    memory_limit="32GB",  # Keep data in memory
    thread_limit=None     # Use all CPU cores
)

# ClickHouse: Column-oriented analytical database
adapter = ClickHouseAdapter(
    max_threads=32,
    max_memory_usage="64GB",
    compression=True  # Reduce network I/O
)

Characteristics:

In-process or local execution reduces network overhead
Column-oriented storage optimized for analytical queries
Query latency depends on data volume, query complexity, and hardware resources

Throughput-Optimized Workloads¶

Platform Options: BigQuery, Snowflake, Databricks

# BigQuery: Serverless auto-scaling
adapter = BigQueryAdapter(
    job_priority="INTERACTIVE",
    query_cache=False,  # Always compute fresh results
    location="US"       # Multi-region deployment
)

# Snowflake: Multi-cluster scaling
adapter = SnowflakeAdapter(
    warehouse_size="X-LARGE",
    multi_cluster_warehouse=True,
    auto_suspend=600  # 10 minutes for sustained workloads
)

Performance Characteristics:

Concurrent user support: Hundreds to thousands of users
Auto-scaling: Automatic resource adjustment available
Query queuing: Workload management capabilities

4. Operational Requirements¶

Minimal Operations (Fully Managed)¶

Platform Options: BigQuery, Snowflake

# Fully managed serverless architecture
adapter = BigQueryAdapter(
    # No infrastructure to manage
    # Automatic optimization
    # Built-in monitoring
)

Characteristics:

No database administration required
Automatic updates and maintenance
Built-in monitoring and alerting
High uptime SLAs (typically 99.9%+)

Some Operations Acceptable¶

Platform Options: Databricks, Redshift, ClickHouse (managed)

# Managed service with configuration options
adapter = DatabricksAdapter(
    cluster_size="large",
    auto_terminate_minutes=30,  # Cost control
    runtime_engine="PHOTON"     # Vectorized query engine
)

Characteristics:

Managed infrastructure with user control
Configurable performance settings
Access to advanced features
Professional support available

Full Control Required¶

Platform Options: ClickHouse (self-hosted), DuckDB

# Self-managed deployment
adapter = ClickHouseAdapter(
    host="your-optimized-cluster.company.com",
    # Custom cluster configuration
    # User-managed monitoring and alerting
    # Complete data control
)

Characteristics:

Complete infrastructure control
Custom optimization opportunities
Data locality and privacy
No vendor dependencies

5. Team and Organizational Factors¶

Individual Developer/Researcher¶

Platform Options: DuckDB

# No additional setup or configuration required
adapter = DuckDBAdapter()
benchmark = TPCH(scale_factor=1)
results = adapter.run_benchmark(benchmark)
# Included by default with BenchBox

Characteristics:

No database administration knowledge required
Comprehensive documentation and community
Reproducible results across environments
Focus on analysis rather than infrastructure

Small Team (2-10 people)¶

Platform Options: DuckDB, BigQuery, or ClickHouse

# Shared cloud project access
adapter = BigQueryAdapter(
    project_id="team-analytics-project",
    dataset_id="shared_benchmarks",
    # Shared Google Cloud project
    # Collaboration features available
    # Cost visibility and controls
)

Characteristics:

Shared access and collaboration capabilities
Cost transparency and controls
Minimal administrative overhead
Scalable as team grows

Medium Team (10-50 people)¶

Platform Options: Snowflake, Databricks

# Enterprise features for growing teams
adapter = SnowflakeAdapter(
    # Role-based access control
    # Multiple warehouses for different workloads
    # Query result sharing
    # Usage monitoring and governance
)

Characteristics:

Advanced security and governance
Workload isolation capabilities
Resource management features
Enterprise support available

Large Enterprise (50+ people)¶

Platform Options: Snowflake, Databricks, Redshift

# Enterprise deployment
adapter = DatabricksAdapter(
    # Unity Catalog for governance
    # Multiple workspaces
    # Advanced security features
    # Enterprise support and SLAs
)

Characteristics:

Enterprise security and compliance features
Advanced governance capabilities
Professional services and support
Integration with enterprise tools

6. Technical Ecosystem¶

Python-Centric Environment¶

All platforms supported. DuckDB and Databricks offer native Python integration:

# DuckDB: Native Python integration
import duckdb
import pandas as pd

# Direct DataFrame integration
df = pd.read_csv("data.csv")
result = duckdb.query("SELECT * FROM df WHERE value > 100").to_df()

# Databricks: Spark/Python integration
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("BenchBox").getOrCreate()

SQL-Heavy Teams¶

Platform Options: BigQuery, Snowflake, Redshift

# ANSI SQL support with platform-specific extensions
adapter = BigQueryAdapter()
# ANSI SQL dialect support
# Platform-managed optimization
# Standard SQL interface

Multi-Language Requirements¶

Platform Options: ClickHouse, Databricks

# ClickHouse: Multiple client libraries available
# - Python, Java, Go, C++, JavaScript
# - HTTP API for any language
# - JDBC/ODBC drivers

# Databricks: Spark ecosystem language support
# - Scala, Python, Java, R, SQL
# - REST APIs
# - Notebook interface

7. Cloud Strategy¶

Google Cloud Platform¶

Native platform: BigQuery Alternative options: ClickHouse (GKE), DuckDB

# GCP-native integration
adapter = BigQueryAdapter(
    project_id="gcp-project",
    location="US",  # Or EU for data residency
    # Integration with:
    # - Cloud Storage
    # - Cloud Functions
    # - Dataflow
    # - AI/ML services
)

Amazon Web Services¶

Native platform: Redshift Alternative options: DuckDB, ClickHouse (EKS)

# AWS-native integration
adapter = RedshiftAdapter(
    # Integration with:
    # - S3 for data loading
    # - IAM for security
    # - CloudWatch for monitoring
    # - Lambda for automation
)

Microsoft Azure¶

Multi-cloud platforms available on Azure: Snowflake, Databricks Azure-native platforms (planned): Microsoft Fabric, Azure Synapse Analytics Alternative options: DuckDB

Current support:

# Multi-cloud platforms available on Azure
adapter = SnowflakeAdapter(
    # Available on Azure infrastructure
    # Integration with Azure services
    # Enterprise features
)

adapter = DatabricksAdapter(
    # Available on Azure (all three clouds: AWS, Azure, GCP)
    # Unity Catalog governance
    # Lakehouse architecture
)

Planned Azure-native platforms:

Microsoft Azure’s native analytics platforms are under evaluation for future BenchBox releases. See the Azure Platforms page for details on:

Microsoft Fabric - Unified SaaS data platform (OneLake, integrated workloads)
Azure Synapse Analytics - Enterprise data warehouse (dedicated/serverless SQL pools)
Azure Data Explorer - Real-time analytics engine (Kusto/KQL)

Azure users can currently benchmark using Databricks or Snowflake, both fully supported on Azure infrastructure.

Multi-Cloud or Cloud-Agnostic¶

Multi-cloud platforms: Snowflake, Databricks Cloud-agnostic options: DuckDB, ClickHouse

# Multi-cloud deployment
adapter = SnowflakeAdapter(
    # Available on AWS, Azure, GCP
    # Consistent interface across clouds
    # Reduced cloud vendor lock-in
)

On-Premises or Hybrid¶

Self-hosted options: ClickHouse, DuckDB Hybrid option: Databricks (private cloud)

# On-premises deployment
adapter = ClickHouseAdapter(
    host="on-prem-cluster.company.com",
    # Complete data control
    # No cloud dependencies
    # Custom security policies
)

Selection Criteria Summary¶

By Primary Goal¶

Learning/Research
├── Small data (< 1GB): DuckDB (included by default)
└── Large data (> 1GB): DuckDB or BigQuery (has free tier)

Development/Testing
├── Local development: DuckDB (no dependencies)
├── CI/CD pipeline: SQLite or DuckDB (embedded)
└── Integration testing: DuckDB (fast setup)

Production Analytics
├── Open-source: ClickHouse (self-hosted)
└── Cloud-native: BigQuery, Snowflake, or Databricks (managed)

Performance Benchmarking
├── Cross-platform comparison: Multiple platforms
└── Single-platform optimization: Platform-specific testing

By Key Decision Factors¶

1. Data scale considerations:
   ├── < 1GB → DuckDB or SQLite (embedded databases)
   ├── 1-100GB → DuckDB, ClickHouse, or cloud platforms
   └── > 100GB → Cloud platforms (BigQuery, Snowflake, Databricks, Redshift)

2. Budget model:
   ├── No cost → DuckDB, SQLite, ClickHouse (self-hosted)
   ├── Pay-per-use → BigQuery, Databricks (usage-based)
   └── Predictable costs → Snowflake, Redshift (provisioned)

3. Team size:
   ├── Individual → DuckDB (minimal setup)
   ├── Small team → DuckDB, BigQuery (shared access)
   ├── Medium team → Snowflake, Databricks (governance features)
   └── Large enterprise → Snowflake, Databricks, Redshift (enterprise features)

4. Cloud environment:
   ├── Google Cloud → BigQuery (GCP-native)
   ├── AWS → Redshift (AWS-native)
   ├── Azure → Snowflake, Databricks (available on Azure)
   ├── Multi-cloud → Snowflake, Databricks (multi-cloud support)
   └── On-premises → ClickHouse, DuckDB (self-hosted)

5. Architecture characteristics:
   ├── In-process columnar → DuckDB, ClickHouse (no network overhead)
   ├── Serverless/auto-scaling → BigQuery, Snowflake (elastic compute)
   └── Managed clusters → Various cloud platforms available

Platform Combination Strategies¶

Development to Production Pipeline¶

# Development: Fast, local, free
dev_adapter = DuckDBAdapter()

# Testing: Automated, reproducible
test_adapter = DuckDBAdapter(database_path="test.duckdb")

# Production: Scalable, managed
prod_adapter = SnowflakeAdapter(
    warehouse_size="LARGE",
    database="PRODUCTION"
)

# Use same benchmark code across all environments
benchmark = TPCH(scale_factor=0.1)  # Small for dev/test
prod_benchmark = TPCH(scale_factor=100)  # Large for prod

Multi-Platform Validation¶

# Validate results across platforms
platforms = [
    ("DuckDB", DuckDBAdapter()),
    ("ClickHouse", ClickHouseAdapter(host="test-cluster")),
    ("BigQuery", BigQueryAdapter(project_id="validation-project")),
]

benchmark = TPCH(scale_factor=1)
results = {}

for name, adapter in platforms:
    results[name] = adapter.run_benchmark(benchmark)
    print(f"{name}: {results[name].total_time:.2f}s")

# Compare query results for consistency

Cost-Performance Optimization¶

# Start with cost-effective option
adapter = DuckDBAdapter(memory_limit="16GB")
results = adapter.run_benchmark(benchmark)

if results.total_time > target_sla:
    # Upgrade to cloud platform
    adapter = SnowflakeAdapter(warehouse_size="LARGE")
    results = adapter.run_benchmark(benchmark)

# Monitor costs and performance over time

Migration Strategies¶

From SQLite to Production¶

# Phase 1: Validate on SQLite (testing only)
sqlite_adapter = SQLiteAdapter()
small_benchmark = ReadPrimitivesBenchmark(scale_factor=0.001)
sqlite_results = sqlite_adapter.run_benchmark(small_benchmark)

# Phase 2: Scale testing with DuckDB
duckdb_adapter = DuckDBAdapter()
medium_benchmark = TPCH(scale_factor=1)
duckdb_results = duckdb_adapter.run_benchmark(medium_benchmark)

# Phase 3: Production deployment
prod_adapter = BigQueryAdapter(project_id="production")
prod_benchmark = TPCH(scale_factor=100)
prod_results = prod_adapter.run_benchmark(prod_benchmark)

From DuckDB to Cloud¶

# Development on DuckDB
dev_config = {"scale_factor": 0.1}
dev_adapter = DuckDBAdapter()

# Migration validation
migration_config = {"scale_factor": 1}
cloud_adapter = BigQueryAdapter(
    project_id="migration-test",
    maximum_bytes_billed=1000000000  # Cost protection
)

# Production deployment
prod_config = {"scale_factor": 100}
prod_adapter = BigQueryAdapter(
    project_id="production",
    job_priority="INTERACTIVE"
)

Common Anti-Patterns¶

❌ Wrong Platform Choices¶

Using SQLite for large datasets

# DON'T: SQLite with TPC-H SF=10
adapter = SQLiteAdapter()
benchmark = TPCH(scale_factor=10)  # Will be extremely slow

Using expensive platforms for development

# DON'T: Snowflake X-Large for development
adapter = SnowflakeAdapter(warehouse_size="4X-LARGE")  # $$$
benchmark = TPCH(scale_factor=0.01)  # Tiny dataset

Ignoring cost controls

# DON'T: No cost limits on pay-per-query platforms
adapter = BigQueryAdapter()  # No maximum_bytes_billed
# Risk: Unexpected large bills

✅ Better Alternatives¶

Progressive scaling approach

# Start small, scale up as needed
if dataset_size < 1_000_000:
    adapter = DuckDBAdapter()
elif dataset_size < 100_000_000:
    adapter = ClickHouseAdapter()
else:
    adapter = BigQueryAdapter()

Cost-aware cloud usage

# Always set cost controls
adapter = BigQueryAdapter(
    maximum_bytes_billed=5_000_000_000,  # 5GB limit
    job_priority="BATCH"  # Lower cost
)

Environment-specific configurations

import os

if os.getenv("ENVIRONMENT") == "development":
    adapter = DuckDBAdapter()
elif os.getenv("ENVIRONMENT") == "testing":
    adapter = DuckDBAdapter(database_path="test.duckdb")
else:  # production
    adapter = SnowflakeAdapter(warehouse_size="LARGE")

Getting Started¶

First-Time Users¶

DuckDB is included by default

uv add benchbox

from benchbox import TPCH
from benchbox.platforms.duckdb import DuckDBAdapter

benchmark = TPCH(scale_factor=0.1)
adapter = DuckDBAdapter()
results = adapter.run_benchmark(benchmark)

Scale factor options
- 0.001 (1MB): Rapid iteration
- 0.01 (10MB): Quick testing
- 0.1 (100MB): Small dataset testing
- 1.0 (1GB): Medium-scale testing
Available benchmarks
- TPC-H: Industry-standard analytical benchmark
- ClickBench: Real-world analytical queries
- Primitives: Fundamental operation testing

Moving to Production¶

Evaluate cloud platforms with test datasets
Configure cost controls and monitoring
Validate with realistic data volumes
Implement security and governance policies
Plan for capacity and growth

Expert Tips¶

Performance Optimization¶

Match platform to workload characteristics
- OLAP-heavy: ClickHouse, BigQuery
- Mixed workloads: Snowflake, Databricks
- Development: DuckDB
Use appropriate scale factors for testing
- Development: 0.001-0.01
- Integration testing: 0.1-1
- Performance testing: 10-100+
Leverage platform-specific optimizations
- DuckDB: Memory sizing, threading
- ClickHouse: Partitioning, ordering
- BigQuery: Partitioning, clustering
- Snowflake: Warehouse sizing, clustering

Cost Management¶

Implement cost controls early
Monitor usage patterns and optimize
Use development/production separation
Consider reserved capacity for predictable workloads

Operational Excellence¶

Automate deployments and scaling
Implement proper monitoring and alerting
Plan for disaster recovery and backup
Document configuration and procedures

Summary¶

Platform selection depends on your specific requirements. Common platform choices by use case:

SQL Platforms¶

Initial setup: DuckDB (included by default, no dependencies)
No-cost options: DuckDB, SQLite, or ClickHouse (self-hosted)
In-process execution: DuckDB or ClickHouse (columnar architecture, no network overhead)
Enterprise teams: Snowflake or Databricks (governance features)
Cloud-native deployments: BigQuery, Snowflake, or Databricks (managed services)
AWS environments: Redshift (AWS-native integration)
Self-hosted deployments: ClickHouse (full control)

DataFrame Platforms¶

Single-node analytics: Polars-df (multi-threaded, lazy evaluation)
Data science prototyping: Pandas-df (familiar API, ecosystem compatibility)
Distributed big data: PySpark-df (cluster-scale processing)
Arrow-native workflows: DataFusion-df (Rust-based, Arrow integration)
GPU acceleration: cuDF-df (NVIDIA RAPIDS, coming soon)

DuckDB is included with BenchBox by default for SQL benchmarking. Polars is included for DataFrame benchmarking. Cloud platforms and additional DataFrame libraries can be added as requirements grow.

Platform Selection Guide¶

Quick Start: Choose Your Path¶

I want to start immediately with minimal setup¶

I need the most cost-effective solution¶

I need maximum performance¶

I want fully managed cloud service¶

I need enterprise features and governance¶

I want to benchmark DataFrame libraries (not SQL)¶

Platform Extras and Installation Matrix¶

Presto Lineage: Trino vs PrestoDB vs Athena¶

SQL vs DataFrame: When to Choose Each¶

When to Choose SQL Platforms¶

When to Choose DataFrame Platforms¶

Side-by-Side Comparison¶

DataFrame Platform Quick Reference¶

Detailed Selection Criteria¶

1. Data Scale Requirements¶

Small Scale (< 1GB)¶

Medium Scale (1GB - 100GB)¶

Large Scale (100GB+)¶

2. Budget and Cost Considerations¶

Free/Open Source Solutions¶

Pay-per-Use Cloud Solutions¶

Provisioned Cloud Solutions¶

3. Performance Requirements¶

Latency-Sensitive Applications¶

Throughput-Optimized Workloads¶

4. Operational Requirements¶

Minimal Operations (Fully Managed)¶

Some Operations Acceptable¶

Full Control Required¶

5. Team and Organizational Factors¶

Individual Developer/Researcher¶

Small Team (2-10 people)¶

Medium Team (10-50 people)¶

Large Enterprise (50+ people)¶

6. Technical Ecosystem¶

Python-Centric Environment¶

SQL-Heavy Teams¶

Multi-Language Requirements¶

7. Cloud Strategy¶

Google Cloud Platform¶

Amazon Web Services¶

Microsoft Azure¶

Multi-Cloud or Cloud-Agnostic¶

On-Premises or Hybrid¶

Selection Criteria Summary¶

By Primary Goal¶

By Key Decision Factors¶

Platform Combination Strategies¶

Development to Production Pipeline¶

Multi-Platform Validation¶

Cost-Performance Optimization¶

Migration Strategies¶

From SQLite to Production¶

From DuckDB to Cloud¶

Common Anti-Patterns¶

❌ Wrong Platform Choices¶

✅ Better Alternatives¶

Getting Started¶

First-Time Users¶

Moving to Production¶

Expert Tips¶

Performance Optimization¶

Cost Management¶

Operational Excellence¶

Summary¶

SQL Platforms¶

DataFrame Platforms¶

See Also¶

Platform Documentation¶

Getting Started¶

Understanding BenchBox¶

Benchmarks¶

Advanced Topics¶