Azure Synapse Spark Platform¶
Azure Synapse Analytics is Microsoft’s enterprise analytics platform providing integrated Spark, SQL, and Data Explorer capabilities. BenchBox integrates with Synapse Spark pools via the Livy API for benchmark execution with ADLS Gen2 storage.
Features¶
Enterprise - Mature platform with extensive enterprise features
ADLS Gen2 - Azure Data Lake Storage integration
Spark Pools - Dedicated pools with configurable sizing
Entra ID - Azure Active Directory authentication
Integration - Native integration with Synapse SQL pools
Installation¶
# Install with Synapse Spark support
uv add benchbox --extra synapse-spark
# Dependencies installed: azure-identity, azure-storage-file-datalake, requests
Prerequisites¶
Azure Synapse Analytics workspace
Spark pool created in the workspace
ADLS Gen2 storage account linked to workspace
Azure Entra ID authentication configured:
Interactive:
az loginService principal: Environment variables
Managed identity: On Azure VMs
Configuration¶
Environment Variables¶
# Required
export SYNAPSE_WORKSPACE_NAME=my-synapse-workspace
export SYNAPSE_SPARK_POOL=sparkpool1
export SYNAPSE_STORAGE_ACCOUNT=mystorageaccount
export SYNAPSE_STORAGE_CONTAINER=benchbox
# Optional
export AZURE_TENANT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
export SYNAPSE_STORAGE_PATH=data/benchbox
CLI Usage¶
# Basic usage
benchbox run --platform synapse-spark --benchmark tpch --scale 1.0 \
--platform-option workspace_name=my-synapse-workspace \
--platform-option spark_pool_name=sparkpool1 \
--platform-option storage_account=mystorageaccount \
--platform-option storage_container=benchbox
# With storage path
benchbox run --platform synapse-spark --benchmark tpch --scale 1.0 \
--platform-option workspace_name=my-synapse-workspace \
--platform-option spark_pool_name=sparkpool1 \
--platform-option storage_account=mystorageaccount \
--platform-option storage_container=benchbox \
--platform-option storage_path=data/benchbox
# Dry-run to preview queries
benchbox run --platform synapse-spark --benchmark tpch --dry-run ./preview \
--platform-option workspace_name=my-synapse-workspace \
--platform-option spark_pool_name=sparkpool1 \
--platform-option storage_account=mystorageaccount \
--platform-option storage_container=benchbox
Platform Options¶
Option |
Default |
Description |
|---|---|---|
|
required |
Synapse workspace name |
|
required |
Spark pool name |
|
required |
ADLS Gen2 storage account name |
|
required |
ADLS Gen2 container name |
|
benchbox |
Path within container for staging |
|
- |
Azure tenant ID (for service principal) |
|
auto-derived |
Custom Livy API endpoint URL |
|
60 |
Statement timeout in minutes |
Python API¶
from benchbox.platforms.azure import SynapseSparkAdapter
# Initialize with workspace and storage
adapter = SynapseSparkAdapter(
workspace_name="my-synapse-workspace",
spark_pool_name="sparkpool1",
storage_account="mystorageaccount",
storage_container="benchbox",
tenant_id="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", # Optional
)
# Verify connection
adapter.create_connection()
# Create schema
adapter.create_schema("tpch_benchmark")
# Load data to ADLS and create tables
adapter.load_data(
tables=["lineitem", "orders", "customer"],
source_dir="/path/to/tpch/data",
)
# Execute query via Livy
result = adapter.execute_query("SELECT COUNT(*) FROM lineitem")
# Clean up session
adapter.close()
Execution Model¶
Synapse Spark executes benchmarks via the Livy API:
Pool Startup - Spark pool started (or uses running pool)
Session Creation - Livy session created in pool
Statement Execution - SQL submitted as Livy statements
Result Retrieval - Results returned via Livy output
Session Cleanup - Session closed after completion
Spark Pool Configuration¶
Node Sizes¶
Size |
vCores |
Memory |
Use Case |
|---|---|---|---|
Small |
4 |
32 GB |
Development, small datasets |
Medium |
8 |
64 GB |
General benchmarking |
Large |
16 |
128 GB |
Large scale factors |
XLarge |
32 |
256 GB |
Enterprise workloads |
XXLarge |
64 |
512 GB |
Maximum performance |
Auto-Pause and Auto-Scale¶
Auto-pause: Pool pauses after idle timeout (saves costs)
Auto-scale: Pool scales nodes based on workload
Min nodes: Minimum nodes to keep warm
Max nodes: Maximum nodes for scaling
Authentication¶
Synapse Spark uses Azure Entra ID (Azure AD) for authentication:
Interactive Login (Development)¶
az login
Service Principal (Automation)¶
export AZURE_CLIENT_ID=app-client-id
export AZURE_CLIENT_SECRET=app-client-secret
export AZURE_TENANT_ID=tenant-id
Managed Identity (Azure VMs)¶
No configuration needed - automatically uses VM’s managed identity.
Cost Estimation¶
Synapse Spark uses vCore-hour billing:
Node Size |
vCores |
Price/Hour |
|---|---|---|
Small |
4 |
~$0.22 |
Medium |
8 |
~$0.44 |
Large |
16 |
~$0.88 |
XLarge |
32 |
~$1.76 |
Scale Factor |
Data Size |
Est. Runtime |
Est. Cost* |
|---|---|---|---|
0.01 |
~10 MB |
~15 min |
~$0.30 |
1.0 |
~1 GB |
~1 hour |
~$2.00 |
10.0 |
~10 GB |
~3 hours |
~$8.00 |
100.0 |
~100 GB |
~8 hours |
~$25.00 |
*Estimates based on 3-node Medium pool. Actual costs vary.
Workspace Setup¶
Creating a Spark Pool¶
Go to Synapse Studio
Navigate to Manage > Apache Spark pools
Click New to create pool
Configure:
Name: e.g., “sparkpool1”
Node size: Medium (8 vCores)
Auto-scale: Enable
Auto-pause: Enable (15 min idle)
Linking Storage¶
Go to Manage > Linked services
Click New > Azure Data Lake Storage Gen2
Configure:
Name: Primary storage link
Authentication: Managed identity
Account: Your ADLS Gen2 account
Troubleshooting¶
Common Issues¶
Authentication fails:
Run
az loginfor interactive authenticationCheck service principal credentials
Verify tenant ID is correct
Spark pool not starting:
Check pool auto-start is enabled
Verify pool isn’t at max capacity
Review Azure resource quotas
Storage access denied:
Check Storage Blob Data Contributor role
Verify managed identity is configured
Check container exists and is accessible
Session timeout:
Increase
timeout_minutesfor large scale factorsCheck pool hasn’t auto-paused
Review session idle timeout settings
Comparison with Other Azure Platforms¶
Aspect |
Synapse Spark |
Fabric Spark |
Databricks |
|---|---|---|---|
Deployment |
PaaS |
SaaS |
SaaS |
Storage |
ADLS Gen2 |
OneLake |
DBFS/Unity |
Billing |
vCore-hours |
Capacity Units |
DBUs |
Integration |
Synapse SQL |
Power BI |
MLflow |
Maturity |
Established |
New |
Established |