Snowpark Connect for Spark

Tags intermediate guide snowflake cloud-platform

Snowpark Connect provides a PySpark DataFrame API that executes natively on Snowflake. This is NOT Apache Spark - DataFrame operations are translated to Snowflake SQL, providing a familiar API without requiring a Spark cluster.

Features

  • PySpark Compatibility - Familiar DataFrame API for Spark developers

  • No Cluster Required - Executes on Snowflake’s native engine

  • Instant Startup - No cluster provisioning delay

  • Snowflake Optimization - Benefits from Snowflake’s query optimizer

  • Credit-Based Billing - Standard Snowflake warehouse credits

Important Limitations

Unlike Apache Spark, Snowpark Connect has some limitations:

Feature

Snowpark Connect

Apache Spark

RDD APIs

Not supported

Full support

DataFrame.hint()

No-op

Optimizer hints

DataFrame.repartition()

No-op

Partition control

UDFs

Snowpark UDFs only

Python/Scala UDFs

Streaming

Not supported

Structured Streaming

Installation

# Install with Snowpark Connect support
uv add benchbox --extra snowpark-connect

# Dependencies installed: snowflake-snowpark-python

Prerequisites

  1. Snowflake account with active warehouse

  2. Credentials configured:

    • Username/password authentication

    • Key-pair authentication

    • OAuth (browser-based)

Configuration

Environment Variables

# Required
export SNOWFLAKE_ACCOUNT=xy12345.us-east-1
export SNOWFLAKE_USER=my_user
export SNOWFLAKE_PASSWORD=my_password

# Optional
export SNOWFLAKE_WAREHOUSE=COMPUTE_WH
export SNOWFLAKE_DATABASE=BENCHBOX
export SNOWFLAKE_ROLE=SYSADMIN

CLI Usage

# Basic usage
benchbox run --platform snowpark-connect --benchmark tpch --scale 1.0 \
  --platform-option account=xy12345.us-east-1 \
  --platform-option user=my_user \
  --platform-option password=my_password

# With custom warehouse
benchbox run --platform snowpark-connect --benchmark tpch --scale 1.0 \
  --platform-option account=xy12345.us-east-1 \
  --platform-option user=my_user \
  --platform-option password=my_password \
  --platform-option warehouse=BENCHMARK_WH

# Dry-run to preview queries
benchbox run --platform snowpark-connect --benchmark tpch --dry-run ./preview \
  --platform-option account=xy12345.us-east-1 \
  --platform-option user=my_user \
  --platform-option password=my_password

Platform Options

Option

Default

Description

account

required

Snowflake account identifier

user

required

Snowflake username

password

-

Password (required unless using key auth)

warehouse

COMPUTE_WH

Virtual warehouse name

database

BENCHBOX

Database name

schema

PUBLIC

Schema name

role

-

Role to use for session

authenticator

-

Auth method (snowflake, externalbrowser, oauth)

private_key_path

-

Path to private key for key-pair auth

warehouse_size

MEDIUM

Warehouse size for scaling

Python API

from benchbox.platforms.snowpark_connect import SnowparkConnectAdapter

# Initialize with credentials
adapter = SnowparkConnectAdapter(
    account="xy12345.us-east-1",
    user="my_user",
    password="my_password",
    warehouse="COMPUTE_WH",
    database="BENCHBOX",
)

# Create session (no Spark cluster needed!)
adapter.create_connection()

# Create schema
adapter.create_schema("tpch_benchmark")

# Execute SQL query
result = adapter.execute_query("SELECT COUNT(*) FROM lineitem")

# Or use DataFrame API
df = adapter.get_dataframe("lineitem")
result = adapter.execute_dataframe(df.filter(df["l_quantity"] > 30).limit(10))

# Close session
adapter.close()

Execution Model

Snowpark Connect uses a fundamentally different execution model than Spark:

  1. Session Start - Create Snowpark session (instant, no cluster)

  2. DataFrame Operations - Build DataFrame expression tree

  3. Translation - Operations translated to Snowflake SQL

  4. Execution - SQL runs on Snowflake warehouse

  5. Results - Data returned directly

When to Use Snowpark Connect

Good fit:

  • Teams familiar with PySpark who use Snowflake

  • Existing Snowflake warehouse investments

  • DataFrame-based analytics without cluster management

  • Quick ad-hoc analysis with PySpark syntax

Consider alternatives:

  • Need RDD APIs → Apache Spark

  • Need Spark Streaming → Databricks/EMR

  • Need full Spark UDF support → Apache Spark

  • Need partition control → Apache Spark

Authentication

Password Authentication

adapter = SnowparkConnectAdapter(
    account="xy12345.us-east-1",
    user="my_user",
    password="my_password",
)

Key-Pair Authentication

adapter = SnowparkConnectAdapter(
    account="xy12345.us-east-1",
    user="my_user",
    private_key_path="/path/to/rsa_key.p8",
    private_key_passphrase="optional_passphrase",
)

Browser-Based SSO

adapter = SnowparkConnectAdapter(
    account="xy12345.us-east-1",
    user="my_user",
    authenticator="externalbrowser",
)

Cost Estimation

Snowpark Connect uses standard Snowflake credit consumption:

Warehouse Size

Credits/Hour

Typical $/Credit

X-Small

1

$2-4

Small

2

$2-4

Medium

4

$2-4

Large

8

$2-4

X-Large

16

$2-4

Scale Factor

Data Size

Est. Runtime

Est. Cost*

0.01

~10 MB

~5 min

~$0.05

1.0

~1 GB

~30 min

~$0.50

10.0

~10 GB

~2 hours

~$4.00

100.0

~100 GB

~6 hours

~$20.00

*Estimates based on Medium warehouse. Actual costs vary by contract.

Comparison with Other Platforms

Aspect

Snowpark Connect

Snowflake SQL

Databricks Spark

API

PySpark DataFrame

SQL

PySpark/SQL

Cluster

None (Snowflake)

None (Snowflake)

Spark cluster

RDD Support

No

N/A

Yes

Startup

Instant

Instant

Minutes

Billing

Credits

Credits

DBU + compute

Use Case

DataFrame + Snowflake

SQL + Snowflake

Full Spark

Troubleshooting

Common Issues

Account identifier format:

  • Format: xy12345.region or xy12345.region.cloud

  • Example: xy12345.us-east-1 or xy12345.us-east-1.aws

  • Don’t include .snowflakecomputing.com

Warehouse not found:

  • Verify warehouse exists: SHOW WAREHOUSES;

  • Check permissions on warehouse

  • Warehouse may be suspended (auto-resumes on first query)

Authentication errors:

  • Verify account identifier is correct

  • Check username and password

  • For key-pair: ensure private key format is correct (PKCS#8)

Permission denied:

  • Check role has necessary privileges

  • Verify database and schema access

  • Check warehouse usage permission