Microsoft Fabric Spark Platform¶
Microsoft Fabric is Microsoft’s unified analytics platform providing SaaS Spark, Data Factory, Power BI, and more. BenchBox integrates with Fabric’s Spark pools via the Livy API for benchmark execution with OneLake storage.
Features¶
SaaS - Fully managed, no infrastructure to configure
OneLake - Unified storage with automatic lakehouse semantics
Delta Lake - Native Delta format support for tables
Entra ID - Azure Active Directory authentication
Livy API - Apache Livy REST API for Spark session management
Installation¶
# Install with Fabric Spark support
uv add benchbox --extra fabric-spark
# Dependencies installed: azure-identity, azure-storage-file-datalake, requests
Prerequisites¶
Microsoft Fabric Workspace with Spark capabilities
Lakehouse created in the workspace
Azure Entra ID authentication configured:
Interactive:
az loginService principal: Environment variables
Managed identity: On Azure VMs
Configuration¶
Environment Variables¶
# Required
export FABRIC_WORKSPACE_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
export FABRIC_LAKEHOUSE_ID=yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
# Optional
export AZURE_TENANT_ID=zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz
export FABRIC_SPARK_POOL=my-spark-pool
CLI Usage¶
# Basic usage
benchbox run --platform fabric-spark --benchmark tpch --scale 1.0 \
--platform-option workspace_id=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx \
--platform-option lakehouse_id=yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
# With tenant ID for service principal auth
benchbox run --platform fabric-spark --benchmark tpch --scale 1.0 \
--platform-option workspace_id=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx \
--platform-option lakehouse_id=yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy \
--platform-option tenant_id=zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz
# Dry-run to preview queries
benchbox run --platform fabric-spark --benchmark tpch --dry-run ./preview \
--platform-option workspace_id=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx \
--platform-option lakehouse_id=yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
Platform Options¶
Option |
Default |
Description |
|---|---|---|
|
required |
Fabric workspace GUID |
|
required |
Fabric Lakehouse GUID |
|
- |
Azure tenant ID (for service principal) |
|
auto-derived |
Custom Livy API endpoint URL |
|
auto-derived |
OneLake path for data staging |
|
- |
Spark pool name (uses workspace default) |
|
60 |
Statement timeout in minutes |
Python API¶
from benchbox.platforms.azure import FabricSparkAdapter
# Initialize with workspace and lakehouse
adapter = FabricSparkAdapter(
workspace_id="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
lakehouse_id="yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy",
tenant_id="zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz", # Optional
)
# Verify connection
adapter.create_connection()
# Create schema (optional, lakehouse manages automatically)
adapter.create_schema("tpch_benchmark")
# Load data to OneLake and create Delta tables
adapter.load_data(
tables=["lineitem", "orders", "customer"],
source_dir="/path/to/tpch/data",
)
# Execute query via Livy
result = adapter.execute_query("SELECT COUNT(*) FROM lineitem")
# Clean up session
adapter.close()
Execution Model¶
Fabric Spark executes benchmarks via the Livy API:
Session Creation - Livy session created in Spark pool
Statement Execution - SQL submitted as Livy statements
Result Retrieval - Results returned via Livy output
Data Staging - OneLake used for data storage
Session Cleanup - Session closed after completion
Authentication¶
Fabric Spark uses Azure Entra ID (Azure AD) for authentication:
Interactive Login (Development)¶
az login
Service Principal (Automation)¶
export AZURE_CLIENT_ID=app-client-id
export AZURE_CLIENT_SECRET=app-client-secret
export AZURE_TENANT_ID=tenant-id
Managed Identity (Azure VMs)¶
No configuration needed - automatically uses VM’s managed identity.
Cost Estimation¶
Fabric uses Capacity Units (CU) for billing:
SKU |
vCores |
Memory |
Price/Hour |
|---|---|---|---|
F2 |
2 |
10 GB |
~$0.36 |
F4 |
4 |
20 GB |
~$0.72 |
F8 |
8 |
40 GB |
~$1.44 |
F16 |
16 |
80 GB |
~$2.88 |
Scale Factor |
Data Size |
Est. Runtime |
Est. Cost |
|---|---|---|---|
0.01 |
~10 MB |
~15 min |
~$0.20 |
1.0 |
~1 GB |
~1 hour |
~$1.00 |
10.0 |
~10 GB |
~3 hours |
~$5.00 |
100.0 |
~100 GB |
~8 hours |
~$15.00 |
Estimates based on F4 SKU. Actual costs vary by SKU and workload.
Workspace Setup¶
Creating a Lakehouse¶
Go to your Fabric workspace
Click New > Lakehouse
Name your lakehouse (e.g., “benchbox_tpch”)
Note the Lakehouse ID from the URL or properties
Finding IDs¶
Workspace ID: Found in the workspace URL:
https://app.fabric.microsoft.com/groups/{WORKSPACE_ID}/...
Lakehouse ID: Found in the lakehouse URL:
https://app.fabric.microsoft.com/groups/.../lakehouses/{LAKEHOUSE_ID}
Or via PowerShell:
# List workspaces
Get-FabricWorkspace
# List lakehouses in workspace
Get-FabricLakehouse -WorkspaceId $workspaceId
Troubleshooting¶
Common Issues¶
Authentication fails:
Run
az loginfor interactive authenticationCheck service principal credentials if using automation
Verify tenant ID is correct
Workspace not found:
Verify workspace ID is a valid GUID
Check you have access to the workspace
Ensure workspace has Fabric capacity assigned
Session creation fails:
Check Fabric capacity is running (not paused)
Verify Spark pool is enabled in workspace
Review Fabric capacity limits
Query timeout:
Increase
timeout_minutesfor large scale factorsCheck for data skew in queries
Consider using larger capacity SKU
Comparison with Other Platforms¶
Aspect |
Fabric Spark |
Synapse Spark |
Databricks |
|---|---|---|---|
Deployment |
SaaS |
PaaS |
SaaS |
Storage |
OneLake |
ADLS Gen2 |
DBFS/Unity |
Billing |
Capacity Units |
vCores |
DBUs |
Startup time |
Seconds |
Minutes |
Seconds |
Integration |
Power BI, DF |
Azure ecosystem |
MLflow, SQL |