Databricks Platform

Tags intermediate guide databricks cloud-platform

Databricks provides a unified lakehouse platform combining data lakes with warehouse capabilities. BenchBox supports Databricks SQL Warehouses and classic clusters with Delta Lake optimizations.

Features

  • Delta Lake native - ACID transactions and time travel

  • Unity Catalog - Unified governance and security

  • Photon engine - Vectorized query execution

  • Auto-scaling - Dynamic cluster management

  • Multi-cloud - AWS, Azure, and GCP support

Prerequisites

  • Databricks workspace (AWS, Azure, or GCP)

  • SQL Warehouse or All-Purpose Cluster

  • Personal Access Token or OAuth credentials

  • Unity Catalog (recommended) or Hive Metastore

Installation

# Install Databricks SQL connector
pip install databricks-sql-connector databricks-sdk

# Or via BenchBox extras
pip install "benchbox[databricks]"

Configuration

Interactive Setup

benchbox platforms setup --platform databricks

CLI Options

benchbox run --platform databricks --benchmark tpch --scale 1.0 \
  --platform-option server_hostname=your-workspace.cloud.databricks.com \
  --platform-option http_path=/sql/1.0/warehouses/abc123def456 \
  --platform-option access_token=dapi1234...

Platform Options

Option

Default

Description

server_hostname

(env)

Workspace URL

http_path

(env)

SQL Warehouse or cluster path

access_token

(env)

Personal Access Token

catalog

(default)

Unity Catalog name

schema

(auto)

Schema name

use_volumes

true

Use UC Volumes for staging

volume_path

(auto)

Path within volume

Authentication Methods

Personal Access Token

# Generate token: User Settings > Developer > Access Tokens
export DATABRICKS_TOKEN=dapi1234567890abcdef

benchbox run --platform databricks --benchmark tpch --scale 1.0

OAuth (M2M)

# Service principal authentication
export DATABRICKS_CLIENT_ID=your_client_id
export DATABRICKS_CLIENT_SECRET=your_client_secret

benchbox run --platform databricks --benchmark tpch \
  --platform-option auth_type=oauth-m2m

Azure AD (Azure Databricks)

# Azure Active Directory token
export ARM_CLIENT_ID=your_client_id
export ARM_CLIENT_SECRET=your_client_secret
export ARM_TENANT_ID=your_tenant_id

benchbox run --platform databricks --benchmark tpch \
  --platform-option auth_type=azure-ad

Usage Examples

Basic Benchmark

# TPC-H on SQL Warehouse
benchbox run --platform databricks --benchmark tpch --scale 1.0

With Unity Catalog

# Specify catalog and schema
benchbox run --platform databricks --benchmark tpch --scale 10.0 \
  --platform-option catalog=benchmarks \
  --platform-option schema=tpch_sf10

With Tuning

# Apply Delta Lake optimizations
benchbox run --platform databricks --benchmark tpch --scale 10.0 \
  --tuning tuned

Python API

from benchbox import TPCH
from benchbox.platforms.databricks import DatabricksAdapter

adapter = DatabricksAdapter(
    server_hostname="your-workspace.cloud.databricks.com",
    http_path="/sql/1.0/warehouses/abc123def456",
    access_token="dapi1234567890abcdef",
    catalog="benchmarks",
)

benchmark = TPCH(scale_factor=1.0)
benchmark.generate_data()
adapter.load_benchmark(benchmark)
results = adapter.run_benchmark(benchmark)

SQL Warehouse Sizing

Size

DBU/Hour

Recommended Scale

2X-Small

2

SF 0.01-0.1

X-Small

4

SF 0.1-1.0

Small

8

SF 1.0-10.0

Medium

16

SF 10.0-100.0

Large

32

SF 100.0+

Performance Features

Delta Lake Optimizations

BenchBox applies Delta optimizations with --tuning tuned:

-- Optimize file layout
OPTIMIZE lineitem ZORDER BY (l_shipdate);

-- Clustering (liquid clustering)
ALTER TABLE lineitem CLUSTER BY (l_shipdate, l_orderkey);

-- Vacuum old versions
VACUUM lineitem RETAIN 0 HOURS;

Photon Acceleration

Photon is automatically enabled on SQL Warehouses:

# Verify Photon is enabled
benchbox run --platform databricks --benchmark tpch --scale 1.0 \
  --platform-option check_photon=true

Query Caching

BenchBox disables caching for accurate benchmarks by using unique query tags.

Data Loading

Unity Catalog Volumes (Default)

Data uploaded to managed volumes, then loaded via COPY INTO:

# Automatic with UC enabled
benchbox run --platform databricks --benchmark tpch --scale 1.0

External Location (S3/ADLS/GCS)

For large datasets, use external cloud storage:

# Configure external staging
benchbox run --platform databricks --benchmark tpch --scale 100.0 \
  --staging-root s3://bucket/benchbox/ \
  --platform-option external_location=s3://bucket/benchbox/

DBFS (Legacy)

For workspaces without Unity Catalog:

benchbox run --platform databricks --benchmark tpch --scale 1.0 \
  --platform-option use_volumes=false \
  --platform-option dbfs_path=/tmp/benchbox/

Cost Optimization

Auto-Stop

Configure warehouses to auto-stop:

-- Set via UI or API
-- Warehouse Settings > Auto Stop > 10 minutes

Serverless Warehouses

For variable workloads:

benchbox run --platform databricks --benchmark tpch \
  --platform-option http_path=/sql/1.0/warehouses/serverless_wh

Troubleshooting

Authentication Failed

# Verify token is valid
curl -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  https://your-workspace.cloud.databricks.com/api/2.0/clusters/list

# Check token expiration
# Tokens expire after 90 days by default

SQL Warehouse Not Found

# List warehouses via API
curl -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  https://your-workspace.cloud.databricks.com/api/2.0/sql/warehouses

# Verify http_path format
# SQL Warehouse: /sql/1.0/warehouses/<warehouse_id>
# Cluster: /sql/protocolv1/o/<org_id>/<cluster_id>

Unity Catalog Access Denied

-- Grant catalog access
GRANT USE CATALOG ON CATALOG benchmarks TO `user@company.com`;
GRANT CREATE SCHEMA ON CATALOG benchmarks TO `user@company.com`;

Volume Upload Failed

# Verify volume exists and has write access
# Create volume if needed (SQL Warehouse)
CREATE VOLUME IF NOT EXISTS benchmarks.staging.uploads;

# Grant permissions
GRANT WRITE VOLUME ON VOLUME benchmarks.staging.uploads TO `user@company.com`;