Snowpark Connect for Spark¶
Snowpark Connect provides a PySpark DataFrame API that executes natively on Snowflake. This is NOT Apache Spark - DataFrame operations are translated to Snowflake SQL, providing a familiar API without requiring a Spark cluster.
Features¶
PySpark Compatibility - Familiar DataFrame API for Spark developers
No Cluster Required - Executes on Snowflake’s native engine
Instant Startup - No cluster provisioning delay
Snowflake Optimization - Benefits from Snowflake’s query optimizer
Credit-Based Billing - Standard Snowflake warehouse credits
Important Limitations¶
Unlike Apache Spark, Snowpark Connect has some limitations:
Feature |
Snowpark Connect |
Apache Spark |
|---|---|---|
RDD APIs |
Not supported |
Full support |
DataFrame.hint() |
No-op |
Optimizer hints |
DataFrame.repartition() |
No-op |
Partition control |
UDFs |
Snowpark UDFs only |
Python/Scala UDFs |
Streaming |
Not supported |
Structured Streaming |
Installation¶
# Install with Snowpark Connect support
uv add benchbox --extra snowpark-connect
# Dependencies installed: snowflake-snowpark-python
Prerequisites¶
Snowflake account with active warehouse
Credentials configured:
Username/password authentication
Key-pair authentication
OAuth (browser-based)
Configuration¶
Environment Variables¶
# Required
export SNOWFLAKE_ACCOUNT=xy12345.us-east-1
export SNOWFLAKE_USER=my_user
export SNOWFLAKE_PASSWORD=my_password
# Optional
export SNOWFLAKE_WAREHOUSE=COMPUTE_WH
export SNOWFLAKE_DATABASE=BENCHBOX
export SNOWFLAKE_ROLE=SYSADMIN
CLI Usage¶
# Basic usage
benchbox run --platform snowpark-connect --benchmark tpch --scale 1.0 \
--platform-option account=xy12345.us-east-1 \
--platform-option user=my_user \
--platform-option password=my_password
# With custom warehouse
benchbox run --platform snowpark-connect --benchmark tpch --scale 1.0 \
--platform-option account=xy12345.us-east-1 \
--platform-option user=my_user \
--platform-option password=my_password \
--platform-option warehouse=BENCHMARK_WH
# Dry-run to preview queries
benchbox run --platform snowpark-connect --benchmark tpch --dry-run ./preview \
--platform-option account=xy12345.us-east-1 \
--platform-option user=my_user \
--platform-option password=my_password
Platform Options¶
Option |
Default |
Description |
|---|---|---|
|
required |
Snowflake account identifier |
|
required |
Snowflake username |
|
- |
Password (required unless using key auth) |
|
COMPUTE_WH |
Virtual warehouse name |
|
BENCHBOX |
Database name |
|
PUBLIC |
Schema name |
|
- |
Role to use for session |
|
- |
Auth method (snowflake, externalbrowser, oauth) |
|
- |
Path to private key for key-pair auth |
|
MEDIUM |
Warehouse size for scaling |
Python API¶
from benchbox.platforms.snowpark_connect import SnowparkConnectAdapter
# Initialize with credentials
adapter = SnowparkConnectAdapter(
account="xy12345.us-east-1",
user="my_user",
password="my_password",
warehouse="COMPUTE_WH",
database="BENCHBOX",
)
# Create session (no Spark cluster needed!)
adapter.create_connection()
# Create schema
adapter.create_schema("tpch_benchmark")
# Execute SQL query
result = adapter.execute_query("SELECT COUNT(*) FROM lineitem")
# Or use DataFrame API
df = adapter.get_dataframe("lineitem")
result = adapter.execute_dataframe(df.filter(df["l_quantity"] > 30).limit(10))
# Close session
adapter.close()
Execution Model¶
Snowpark Connect uses a fundamentally different execution model than Spark:
Session Start - Create Snowpark session (instant, no cluster)
DataFrame Operations - Build DataFrame expression tree
Translation - Operations translated to Snowflake SQL
Execution - SQL runs on Snowflake warehouse
Results - Data returned directly
When to Use Snowpark Connect¶
Good fit:
Teams familiar with PySpark who use Snowflake
Existing Snowflake warehouse investments
DataFrame-based analytics without cluster management
Quick ad-hoc analysis with PySpark syntax
Consider alternatives:
Need RDD APIs → Apache Spark
Need Spark Streaming → Databricks/EMR
Need full Spark UDF support → Apache Spark
Need partition control → Apache Spark
Authentication¶
Password Authentication¶
adapter = SnowparkConnectAdapter(
account="xy12345.us-east-1",
user="my_user",
password="my_password",
)
Key-Pair Authentication¶
adapter = SnowparkConnectAdapter(
account="xy12345.us-east-1",
user="my_user",
private_key_path="/path/to/rsa_key.p8",
private_key_passphrase="optional_passphrase",
)
Browser-Based SSO¶
adapter = SnowparkConnectAdapter(
account="xy12345.us-east-1",
user="my_user",
authenticator="externalbrowser",
)
Cost Estimation¶
Snowpark Connect uses standard Snowflake credit consumption:
Warehouse Size |
Credits/Hour |
Typical $/Credit |
|---|---|---|
X-Small |
1 |
$2-4 |
Small |
2 |
$2-4 |
Medium |
4 |
$2-4 |
Large |
8 |
$2-4 |
X-Large |
16 |
$2-4 |
Scale Factor |
Data Size |
Est. Runtime |
Est. Cost* |
|---|---|---|---|
0.01 |
~10 MB |
~5 min |
~$0.05 |
1.0 |
~1 GB |
~30 min |
~$0.50 |
10.0 |
~10 GB |
~2 hours |
~$4.00 |
100.0 |
~100 GB |
~6 hours |
~$20.00 |
*Estimates based on Medium warehouse. Actual costs vary by contract.
Comparison with Other Platforms¶
Aspect |
Snowpark Connect |
Snowflake SQL |
Databricks Spark |
|---|---|---|---|
API |
PySpark DataFrame |
SQL |
PySpark/SQL |
Cluster |
None (Snowflake) |
None (Snowflake) |
Spark cluster |
RDD Support |
No |
N/A |
Yes |
Startup |
Instant |
Instant |
Minutes |
Billing |
Credits |
Credits |
DBU + compute |
Use Case |
DataFrame + Snowflake |
SQL + Snowflake |
Full Spark |
Troubleshooting¶
Common Issues¶
Account identifier format:
Format:
xy12345.regionorxy12345.region.cloudExample:
xy12345.us-east-1orxy12345.us-east-1.awsDon’t include
.snowflakecomputing.com
Warehouse not found:
Verify warehouse exists:
SHOW WAREHOUSES;Check permissions on warehouse
Warehouse may be suspended (auto-resumes on first query)
Authentication errors:
Verify account identifier is correct
Check username and password
For key-pair: ensure private key format is correct (PKCS#8)
Permission denied:
Check role has necessary privileges
Verify database and schema access
Check warehouse usage permission