Databricks DataFrame Platform¶

Tags intermediate guide databricks-df dataframe-platform cloud-platform

CLI name: databricks-df - use benchbox run --platform databricks-df

Databricks DataFrame executes BenchBox benchmarks using the PySpark DataFrame API on Databricks compute, via Spark Connect or Databricks Connect. It reuses the full SQL-mode Databricks infrastructure - Unity Catalog, UC Volume staging, Delta Lake tables - but runs queries as DataFrame expressions instead of SQL.

Features¶

PySpark DataFrame API - filter, groupBy, agg, join expressions on Databricks compute
Spark Connect / Databricks Connect - remote DataFrame execution
Shared SQL infrastructure - same auth, UC Volume staging, Delta tables as databricks SQL mode
Expression family - comparable to pyspark-df, datafusion-df, polars-df
Unity Catalog - catalog / schema-aware execution

When to Use¶

Goal	Recommended Platform
Evaluate Databricks SQL Warehouse performance	`databricks` (SQL)
Evaluate PySpark on Databricks Connect compute	`databricks-df`
Compare SQL vs. DataFrame on the same workload	Run both, same benchmark

Installation¶

# Installs databricks-sql-connector + databricks-sdk + databricks-connect
uv add benchbox --extra cloud-spark-databricks

The base databricks extra covers SQL-mode only; it does not pull in databricks-connect. Use cloud-spark-databricks (or the legacy databricks-connect alias) for DataFrame-mode runs. See Databricks Connect docs.

Authentication¶

Same credentials as the databricks SQL platform - set via environment variables or --platform-option:

Variable	Purpose
`DATABRICKS_HOST`	Workspace URL (`xxx.cloud.databricks.com`)
`DATABRICKS_TOKEN`	Personal access token
`DATABRICKS_HTTP_PATH`	SQL Warehouse HTTP path (used for the load phase)

The DataFrame adapter itself accepts cluster_id, but BenchBox does not yet surface that field through the shared Databricks config builder or --platform-option. Select the target cluster in your external Databricks Connect / Spark Connect configuration instead. DATABRICKS_HTTP_PATH remains required for the load phase, which uses the SQL connector regardless of execution mode.

Usage¶

# Run TPC-H with DataFrame execution
benchbox run --platform databricks-df --benchmark tpch --scale 1.0

# Compare SQL vs DataFrame on the same benchmark
benchbox run --platform databricks    --benchmark tpch --scale 1.0  # SQL
benchbox run --platform databricks-df --benchmark tpch --scale 1.0  # DataFrame

Supported Benchmarks¶

DataFrame-mode benchmarks require DataFrame query definitions. Covered today:

TPC-H (22 queries)
SSB (13 queries)
Read Primitives (selected categories)
Write Primitives (selected categories)

See the DataFrame Platforms Overview for the full DataFrame execution model and compatibility matrix.

Notes¶

Data loading uses the standard Databricks SQL connector (UC Volume staging + Delta Lake) regardless of execution mode. Only query execution differs.
Credentials and Unity Catalog configuration are shared between databricks and databricks-df - a single authenticated environment serves both.
Expect different performance profiles: SQL Warehouse (photon) vs. Databricks Connect (JVM Spark) execute the same logical query very differently.