Databricks Platform¶
Databricks provides a unified lakehouse platform combining data lakes with warehouse capabilities. BenchBox supports Databricks SQL Warehouses and classic clusters with Delta Lake optimizations.
Features¶
Delta Lake native - ACID transactions and time travel
Unity Catalog - Unified governance and security
Photon engine - Vectorized query execution
Auto-scaling - Dynamic cluster management
Multi-cloud - AWS, Azure, and GCP support
Prerequisites¶
Databricks workspace (AWS, Azure, or GCP)
SQL Warehouse or All-Purpose Cluster
Personal Access Token or OAuth credentials
Unity Catalog (recommended) or Hive Metastore
Installation¶
# Install Databricks SQL connector
pip install databricks-sql-connector databricks-sdk
# Or via BenchBox extras
pip install "benchbox[databricks]"
Configuration¶
Environment Variables (Recommended)¶
export DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
export DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/abc123def456
export DATABRICKS_TOKEN=dapi1234567890abcdef
Interactive Setup¶
benchbox platforms setup --platform databricks
CLI Options¶
benchbox run --platform databricks --benchmark tpch --scale 1.0 \
--platform-option server_hostname=your-workspace.cloud.databricks.com \
--platform-option http_path=/sql/1.0/warehouses/abc123def456 \
--platform-option access_token=dapi1234...
Platform Options¶
Option |
Default |
Description |
|---|---|---|
|
(env) |
Workspace URL |
|
(env) |
SQL Warehouse or cluster path |
|
(env) |
Personal Access Token |
|
(default) |
Unity Catalog name |
|
(auto) |
Schema name |
|
true |
Use UC Volumes for staging |
|
(auto) |
Path within volume |
Authentication Methods¶
Personal Access Token¶
# Generate token: User Settings > Developer > Access Tokens
export DATABRICKS_TOKEN=dapi1234567890abcdef
benchbox run --platform databricks --benchmark tpch --scale 1.0
OAuth (M2M)¶
# Service principal authentication
export DATABRICKS_CLIENT_ID=your_client_id
export DATABRICKS_CLIENT_SECRET=your_client_secret
benchbox run --platform databricks --benchmark tpch \
--platform-option auth_type=oauth-m2m
Azure AD (Azure Databricks)¶
# Azure Active Directory token
export ARM_CLIENT_ID=your_client_id
export ARM_CLIENT_SECRET=your_client_secret
export ARM_TENANT_ID=your_tenant_id
benchbox run --platform databricks --benchmark tpch \
--platform-option auth_type=azure-ad
Usage Examples¶
Basic Benchmark¶
# TPC-H on SQL Warehouse
benchbox run --platform databricks --benchmark tpch --scale 1.0
With Unity Catalog¶
# Specify catalog and schema
benchbox run --platform databricks --benchmark tpch --scale 10.0 \
--platform-option catalog=benchmarks \
--platform-option schema=tpch_sf10
With Tuning¶
# Apply Delta Lake optimizations
benchbox run --platform databricks --benchmark tpch --scale 10.0 \
--tuning tuned
Python API¶
from benchbox import TPCH
from benchbox.platforms.databricks import DatabricksAdapter
adapter = DatabricksAdapter(
server_hostname="your-workspace.cloud.databricks.com",
http_path="/sql/1.0/warehouses/abc123def456",
access_token="dapi1234567890abcdef",
catalog="benchmarks",
)
benchmark = TPCH(scale_factor=1.0)
benchmark.generate_data()
adapter.load_benchmark(benchmark)
results = adapter.run_benchmark(benchmark)
SQL Warehouse Sizing¶
Size |
DBU/Hour |
Recommended Scale |
|---|---|---|
2X-Small |
2 |
SF 0.01-0.1 |
X-Small |
4 |
SF 0.1-1.0 |
Small |
8 |
SF 1.0-10.0 |
Medium |
16 |
SF 10.0-100.0 |
Large |
32 |
SF 100.0+ |
Performance Features¶
Delta Lake Optimizations¶
BenchBox applies Delta optimizations with --tuning tuned:
-- Optimize file layout
OPTIMIZE lineitem ZORDER BY (l_shipdate);
-- Clustering (liquid clustering)
ALTER TABLE lineitem CLUSTER BY (l_shipdate, l_orderkey);
-- Vacuum old versions
VACUUM lineitem RETAIN 0 HOURS;
Photon Acceleration¶
Photon is automatically enabled on SQL Warehouses:
# Verify Photon is enabled
benchbox run --platform databricks --benchmark tpch --scale 1.0 \
--platform-option check_photon=true
Query Caching¶
BenchBox disables caching for accurate benchmarks by using unique query tags.
Data Loading¶
Unity Catalog Volumes (Default)¶
Data uploaded to managed volumes, then loaded via COPY INTO:
# Automatic with UC enabled
benchbox run --platform databricks --benchmark tpch --scale 1.0
External Location (S3/ADLS/GCS)¶
For large datasets, use external cloud storage:
# Configure external staging
benchbox run --platform databricks --benchmark tpch --scale 100.0 \
--staging-root s3://bucket/benchbox/ \
--platform-option external_location=s3://bucket/benchbox/
DBFS (Legacy)¶
For workspaces without Unity Catalog:
benchbox run --platform databricks --benchmark tpch --scale 1.0 \
--platform-option use_volumes=false \
--platform-option dbfs_path=/tmp/benchbox/
Cost Optimization¶
Auto-Stop¶
Configure warehouses to auto-stop:
-- Set via UI or API
-- Warehouse Settings > Auto Stop > 10 minutes
Serverless Warehouses¶
For variable workloads:
benchbox run --platform databricks --benchmark tpch \
--platform-option http_path=/sql/1.0/warehouses/serverless_wh
Troubleshooting¶
Authentication Failed¶
# Verify token is valid
curl -H "Authorization: Bearer $DATABRICKS_TOKEN" \
https://your-workspace.cloud.databricks.com/api/2.0/clusters/list
# Check token expiration
# Tokens expire after 90 days by default
SQL Warehouse Not Found¶
# List warehouses via API
curl -H "Authorization: Bearer $DATABRICKS_TOKEN" \
https://your-workspace.cloud.databricks.com/api/2.0/sql/warehouses
# Verify http_path format
# SQL Warehouse: /sql/1.0/warehouses/<warehouse_id>
# Cluster: /sql/protocolv1/o/<org_id>/<cluster_id>
Unity Catalog Access Denied¶
-- Grant catalog access
GRANT USE CATALOG ON CATALOG benchmarks TO `user@company.com`;
GRANT CREATE SCHEMA ON CATALOG benchmarks TO `user@company.com`;
Volume Upload Failed¶
# Verify volume exists and has write access
# Create volume if needed (SQL Warehouse)
CREATE VOLUME IF NOT EXISTS benchmarks.staging.uploads;
# Grant permissions
GRANT WRITE VOLUME ON VOLUME benchmarks.staging.uploads TO `user@company.com`;