Snowpark Connect for Spark¶

Tags intermediate guide snowflake cloud-platform

Snowpark Connect provides a PySpark DataFrame API that executes natively on Snowflake. This is NOT Apache Spark - DataFrame operations are translated to Snowflake SQL, providing a familiar API without requiring a Spark cluster.

Features¶

PySpark Compatibility - Familiar DataFrame API for Spark developers
No Cluster Required - Executes on Snowflake’s native engine
Instant Startup - No cluster provisioning delay
Snowflake Optimization - Benefits from Snowflake’s query optimizer
Credit-Based Billing - Standard Snowflake warehouse credits

Important Limitations¶

Unlike Apache Spark, Snowpark Connect has some limitations:

Feature	Snowpark Connect	Apache Spark
RDD APIs	Not supported	Full support
DataFrame.hint()	No-op	Optimizer hints
DataFrame.repartition()	No-op	Partition control
UDFs	Snowpark UDFs only	Python/Scala UDFs
Streaming	Not supported	Structured Streaming

Installation¶

# Install with Snowpark Connect support
uv add benchbox --extra snowpark-connect

# Dependencies installed: snowflake-snowpark-python

Prerequisites¶

Snowflake account with active warehouse
Credentials configured:
- Username/password authentication
- Key-pair authentication
- OAuth (browser-based)

Configuration¶

Environment Variables¶

# Required
export SNOWFLAKE_ACCOUNT=xy12345.us-east-1
export SNOWFLAKE_USER=my_user
export SNOWFLAKE_PASSWORD=my_password

# Optional
export SNOWFLAKE_WAREHOUSE=COMPUTE_WH
export SNOWFLAKE_DATABASE=BENCHBOX
export SNOWFLAKE_ROLE=SYSADMIN

CLI Usage¶

# Basic usage
benchbox run --platform snowpark-connect --benchmark tpch --scale 1.0 \
  --platform-option account=xy12345.us-east-1 \
  --platform-option user=my_user \
  --platform-option password=my_password

# With custom warehouse
benchbox run --platform snowpark-connect --benchmark tpch --scale 1.0 \
  --platform-option account=xy12345.us-east-1 \
  --platform-option user=my_user \
  --platform-option password=my_password \
  --platform-option warehouse=BENCHMARK_WH

# Dry-run to preview queries
benchbox run --platform snowpark-connect --benchmark tpch --dry-run ./preview \
  --platform-option account=xy12345.us-east-1 \
  --platform-option user=my_user \
  --platform-option password=my_password

Platform Options¶

Option	Default	Description
`account`	required	Snowflake account identifier
`user`	required	Snowflake username
`password`	-	Password (required unless using key auth)
`warehouse`	COMPUTE_WH	Virtual warehouse name
`database`	BENCHBOX	Database name
`schema`	PUBLIC	Schema name
`role`	-	Role to use for session
`authenticator`	-	Auth method (snowflake, externalbrowser, oauth)
`private_key_path`	-	Path to private key for key-pair auth
`warehouse_size`	MEDIUM	Warehouse size for scaling

Python API¶

from benchbox.platforms.snowpark_connect import SnowparkConnectAdapter

# Initialize with credentials
adapter = SnowparkConnectAdapter(
    account="xy12345.us-east-1",
    user="my_user",
    password="my_password",
    warehouse="COMPUTE_WH",
    database="BENCHBOX",
)

# Create session (no Spark cluster needed!)
adapter.create_connection()

# Create schema
adapter.create_schema("tpch_benchmark")

# Execute SQL query
result = adapter.execute_query("SELECT COUNT(*) FROM lineitem")

# Or use DataFrame API
df = adapter.get_dataframe("lineitem")
result = adapter.execute_dataframe(df.filter(df["l_quantity"] > 30).limit(10))

# Close session
adapter.close()

Execution Model¶

Snowpark Connect uses a fundamentally different execution model than Spark:

Session Start - Create Snowpark session (instant, no cluster)
DataFrame Operations - Build DataFrame expression tree
Translation - Operations translated to Snowflake SQL
Execution - SQL runs on Snowflake warehouse
Results - Data returned directly

When to Use Snowpark Connect¶

Good fit:

Teams familiar with PySpark who use Snowflake
Existing Snowflake warehouse investments
DataFrame-based analytics without cluster management
Quick ad-hoc analysis with PySpark syntax

Consider alternatives:

Need RDD APIs → Apache Spark
Need Spark Streaming → Databricks/EMR
Need full Spark UDF support → Apache Spark
Need partition control → Apache Spark

Authentication¶

Password Authentication¶

adapter = SnowparkConnectAdapter(
    account="xy12345.us-east-1",
    user="my_user",
    password="my_password",
)

Key-Pair Authentication¶

adapter = SnowparkConnectAdapter(
    account="xy12345.us-east-1",
    user="my_user",
    private_key_path="/path/to/rsa_key.p8",
    private_key_passphrase="optional_passphrase",
)

Browser-Based SSO¶

adapter = SnowparkConnectAdapter(
    account="xy12345.us-east-1",
    user="my_user",
    authenticator="externalbrowser",
)

Cost Estimation¶

Snowpark Connect uses standard Snowflake credit consumption:

Warehouse Size	Credits/Hour	Typical $/Credit
X-Small	1	$2-4
Small	2	$2-4
Medium	4	$2-4
Large	8	$2-4
X-Large	16	$2-4

Scale Factor	Data Size	Est. Runtime	Est. Cost*
0.01	~10 MB	~5 min	~$0.05
1.0	~1 GB	~30 min	~$0.50
10.0	~10 GB	~2 hours	~$4.00
100.0	~100 GB	~6 hours	~$20.00

*Estimates based on Medium warehouse. Actual costs vary by contract.

Comparison with Other Platforms¶

Aspect	Snowpark Connect	Snowflake SQL	Databricks Spark
API	PySpark DataFrame	SQL	PySpark/SQL
Cluster	None (Snowflake)	None (Snowflake)	Spark cluster
RDD Support	No	N/A	Yes
Startup	Instant	Instant	Minutes
Billing	Credits	Credits	DBU + compute
Use Case	DataFrame + Snowflake	SQL + Snowflake	Full Spark

Troubleshooting¶

Common Issues¶

Account identifier format:

Format: xy12345.region or xy12345.region.cloud
Example: xy12345.us-east-1 or xy12345.us-east-1.aws
Don’t include .snowflakecomputing.com

Warehouse not found:

Verify warehouse exists: SHOW WAREHOUSES;
Check permissions on warehouse
Warehouse may be suspended (auto-resumes on first query)

Authentication errors:

Verify account identifier is correct
Check username and password
For key-pair: ensure private key format is correct (PKCS#8)

Permission denied:

Check role has necessary privileges
Verify database and schema access
Check warehouse usage permission