Azure Synapse Spark Platform¶

Tags intermediate guide synapse-spark cloud-platform

Azure Synapse Analytics is Microsoft’s enterprise analytics platform providing integrated Spark, SQL, and Data Explorer capabilities. BenchBox integrates with Synapse Spark pools via the Livy API for benchmark execution with ADLS Gen2 storage.

Features¶

Enterprise - Mature platform with extensive enterprise features
ADLS Gen2 - Azure Data Lake Storage integration
Spark Pools - Dedicated pools with configurable sizing
Entra ID - Azure Active Directory authentication
Integration - Native integration with Synapse SQL pools

Installation¶

# Install with Synapse Spark support
uv add benchbox --extra synapse-spark

# Dependencies installed: azure-identity, azure-storage-file-datalake, requests

Prerequisites¶

Azure Synapse Analytics workspace
Spark pool created in the workspace
ADLS Gen2 storage account linked to workspace
Azure Entra ID authentication configured:
- Interactive: az login
- Service principal: Environment variables
- Managed identity: On Azure VMs

Configuration¶

Environment Variables¶

# Required
export SYNAPSE_WORKSPACE_NAME=my-synapse-workspace
export SYNAPSE_SPARK_POOL=sparkpool1
export SYNAPSE_STORAGE_ACCOUNT=mystorageaccount
export SYNAPSE_STORAGE_CONTAINER=benchbox

# Optional
export AZURE_TENANT_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
export SYNAPSE_STORAGE_PATH=data/benchbox

CLI Usage¶

# Basic usage
benchbox run --platform synapse-spark --benchmark tpch --scale 1.0 \
  --platform-option workspace_name=my-synapse-workspace \
  --platform-option spark_pool_name=sparkpool1 \
  --platform-option storage_account=mystorageaccount \
  --platform-option storage_container=benchbox

# With storage path
benchbox run --platform synapse-spark --benchmark tpch --scale 1.0 \
  --platform-option workspace_name=my-synapse-workspace \
  --platform-option spark_pool_name=sparkpool1 \
  --platform-option storage_account=mystorageaccount \
  --platform-option storage_container=benchbox \
  --platform-option storage_path=data/benchbox

# Dry-run to preview queries
benchbox run --platform synapse-spark --benchmark tpch --dry-run ./preview \
  --platform-option workspace_name=my-synapse-workspace \
  --platform-option spark_pool_name=sparkpool1 \
  --platform-option storage_account=mystorageaccount \
  --platform-option storage_container=benchbox

Platform Options¶

Option	Default	Description
`workspace_name`	required	Synapse workspace name
`spark_pool_name`	required	Spark pool name
`storage_account`	required	ADLS Gen2 storage account name
`storage_container`	required	ADLS Gen2 container name
`storage_path`	benchbox	Path within container for staging
`tenant_id`	-	Azure tenant ID (for service principal)
`livy_endpoint`	auto-derived	Custom Livy API endpoint URL
`timeout_minutes`	60	Statement timeout in minutes

Python API¶

from benchbox.platforms.azure import SynapseSparkAdapter

# Initialize with workspace and storage
adapter = SynapseSparkAdapter(
    workspace_name="my-synapse-workspace",
    spark_pool_name="sparkpool1",
    storage_account="mystorageaccount",
    storage_container="benchbox",
    tenant_id="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",  # Optional
)

# Verify connection
adapter.create_connection()

# Create schema
adapter.create_schema("tpch_benchmark")

# Load data to ADLS and create tables
adapter.load_data(
    tables=["lineitem", "orders", "customer"],
    source_dir="/path/to/tpch/data",
)

# Execute query via Livy
result = adapter.execute_query("SELECT COUNT(*) FROM lineitem")

# Clean up session
adapter.close()

Execution Model¶

Synapse Spark executes benchmarks via the Livy API:

Pool Startup - Spark pool started (or uses running pool)
Session Creation - Livy session created in pool
Statement Execution - SQL submitted as Livy statements
Result Retrieval - Results returned via Livy output
Session Cleanup - Session closed after completion

Spark Pool Configuration¶

Node Sizes¶

Size	vCores	Memory	Use Case
Small	4	32 GB	Development, small datasets
Medium	8	64 GB	General benchmarking
Large	16	128 GB	Large scale factors
XLarge	32	256 GB	Enterprise workloads
XXLarge	64	512 GB	Maximum performance

Auto-Pause and Auto-Scale¶

Auto-pause: Pool pauses after idle timeout (saves costs)
Auto-scale: Pool scales nodes based on workload
Min nodes: Minimum nodes to keep warm
Max nodes: Maximum nodes for scaling

Authentication¶

Synapse Spark uses Azure Entra ID (Azure AD) for authentication:

Service Principal (Automation)¶

export AZURE_CLIENT_ID=app-client-id
export AZURE_CLIENT_SECRET=app-client-secret
export AZURE_TENANT_ID=tenant-id

Managed Identity (Azure VMs)¶

No configuration needed - automatically uses VM’s managed identity.

Cost Estimation¶

Synapse Spark uses vCore-hour billing:

Node Size	vCores	Price/Hour
Small	4	~$0.22
Medium	8	~$0.44
Large	16	~$0.88
XLarge	32	~$1.76

Scale Factor	Data Size	Est. Runtime	Est. Cost*
0.01	~10 MB	~15 min	~$0.30
1.0	~1 GB	~1 hour	~$2.00
10.0	~10 GB	~3 hours	~$8.00
100.0	~100 GB	~8 hours	~$25.00

*Estimates based on 3-node Medium pool. Actual costs vary.

Workspace Setup¶

Creating a Spark Pool¶

Go to Synapse Studio
Navigate to Manage > Apache Spark pools
Click New to create pool
Configure:
- Name: e.g., “sparkpool1”
- Node size: Medium (8 vCores)
- Auto-scale: Enable
- Auto-pause: Enable (15 min idle)

Linking Storage¶

Go to Manage > Linked services
Click New > Azure Data Lake Storage Gen2
Configure:
- Name: Primary storage link
- Authentication: Managed identity
- Account: Your ADLS Gen2 account

Troubleshooting¶

Common Issues¶

Authentication fails:

Run az login for interactive authentication
Check service principal credentials
Verify tenant ID is correct

Spark pool not starting:

Check pool auto-start is enabled
Verify pool isn’t at max capacity
Review Azure resource quotas

Storage access denied:

Check Storage Blob Data Contributor role
Verify managed identity is configured
Check container exists and is accessible

Session timeout:

Increase timeout_minutes for large scale factors
Check pool hasn’t auto-paused
Review session idle timeout settings

Comparison with Other Azure Platforms¶

Aspect	Synapse Spark	Fabric Spark	Databricks
Deployment	PaaS	SaaS	SaaS
Storage	ADLS Gen2	OneLake	DBFS/Unity
Billing	vCore-hours	Capacity Units	DBUs
Integration	Synapse SQL	Power BI	MLflow
Maturity	Established	New	Established