Amazon Athena for Apache Spark Platform¶
Athena for Apache Spark is AWS’s interactive Spark service with sub-second startup times. Unlike EMR Serverless or Glue, it uses a notebook-style execution model with persistent sessions.
Features¶
Sub-second Startup - Pre-provisioned Spark capacity for instant execution
Interactive Sessions - Notebook-style execution with persistent state
Serverless - No cluster management required
S3 Integration - Native S3 and Glue Data Catalog support
Session-based - Efficient for multiple queries in a session
Installation¶
# Install with Athena Spark support
uv add benchbox --extra athena-spark
# Dependencies installed: boto3
Prerequisites¶
Spark-enabled Athena workgroup (created via Console or CLI)
S3 bucket for data staging
AWS credentials configured:
AWS CLI:
aws configureEnvironment variables:
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEYIAM role (on EC2/ECS/Lambda)
Configuration¶
Environment Variables¶
# Required
export ATHENA_SPARK_WORKGROUP=my-spark-workgroup
export ATHENA_S3_STAGING_DIR=s3://my-bucket/benchbox
# Optional
export AWS_REGION=us-east-1
CLI Usage¶
# Basic usage
benchbox run --platform athena-spark --benchmark tpch --scale 1.0 \
--platform-option workgroup=my-spark-workgroup \
--platform-option s3_staging_dir=s3://my-bucket/benchbox
# With custom region
benchbox run --platform athena-spark --benchmark tpch --scale 1.0 \
--platform-option workgroup=my-spark-workgroup \
--platform-option s3_staging_dir=s3://my-bucket/benchbox \
--platform-option region=eu-west-1
# With custom DPU configuration
benchbox run --platform athena-spark --benchmark tpch --scale 1.0 \
--platform-option workgroup=my-spark-workgroup \
--platform-option s3_staging_dir=s3://my-bucket/benchbox \
--platform-option coordinator_dpu_size=2 \
--platform-option max_concurrent_dpus=40
# Dry-run to preview queries
benchbox run --platform athena-spark --benchmark tpch --dry-run ./preview \
--platform-option workgroup=my-spark-workgroup \
--platform-option s3_staging_dir=s3://my-bucket/benchbox
Platform Options¶
Option |
Default |
Description |
|---|---|---|
|
required |
Spark-enabled Athena workgroup name |
|
required |
S3 path for data staging |
|
us-east-1 |
AWS region |
|
benchbox |
Glue Data Catalog database |
|
1 |
Coordinator DPU size |
|
20 |
Maximum concurrent DPUs |
|
1 |
Default executor DPU size |
|
15 |
Session idle timeout |
|
60 |
Calculation timeout |
Python API¶
from benchbox.platforms.aws import AthenaSparkAdapter
# Initialize with workgroup and staging
adapter = AthenaSparkAdapter(
workgroup="my-spark-workgroup",
s3_staging_dir="s3://my-bucket/benchbox",
region="us-east-1",
)
# Start session
adapter.create_connection()
# Create schema
adapter.create_schema("tpch_benchmark")
# Load data to S3 and create tables
adapter.load_data(
tables=["lineitem", "orders", "customer"],
source_dir="/path/to/tpch/data",
)
# Execute query via session
result = adapter.execute_query("SELECT COUNT(*) FROM lineitem")
# Terminate session
adapter.close()
Execution Model¶
Athena Spark uses a session-based execution model:
Session Start - Start a session in a Spark-enabled workgroup
Calculation Submit - Submit SQL or PySpark calculations
Result Retrieval - Results written to S3 automatically
Session End - Terminate session when complete (or auto-terminate on idle)
Creating a Spark-Enabled Workgroup¶
Via AWS Console¶
Go to Athena Console
Navigate to Workgroups
Click Create workgroup
Select Apache Spark as the engine type
Configure:
Name: e.g., “spark-benchbox”
Query result location: s3://your-bucket/athena-results/
DPU allocation settings
Create workgroup
Via AWS CLI¶
aws athena create-work-group \
--name spark-benchbox \
--configuration '{"EngineVersion": {"SelectedEngineVersion": "PySpark engine version 3"}}' \
--description "Spark workgroup for BenchBox"
DPU Configuration¶
Size |
vCPUs |
Memory |
Use Case |
|---|---|---|---|
1 DPU |
4 |
16 GB |
Development, small datasets |
2 DPU |
8 |
32 GB |
General benchmarking |
4 DPU |
16 |
64 GB |
Large scale factors |
Authentication¶
Athena Spark uses standard AWS authentication:
AWS CLI (Development)¶
aws configure
Environment Variables (Automation)¶
export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
export AWS_REGION=us-east-1
IAM Role (EC2/ECS/Lambda)¶
No configuration needed - automatically uses instance/task role.
Cost Estimation¶
Athena Spark uses DPU-hour billing:
Resource |
Price |
|---|---|
DPU-hour |
~$0.35 |
Scale Factor |
Data Size |
Est. Runtime |
Est. Cost* |
|---|---|---|---|
0.01 |
~10 MB |
~15 min |
~$0.10 |
1.0 |
~1 GB |
~45 min |
~$0.75 |
10.0 |
~10 GB |
~2 hours |
~$3.00 |
100.0 |
~100 GB |
~5 hours |
~$10.00 |
*Estimates based on 2 DPUs. Actual costs vary.
Session vs Batch Comparison¶
Aspect |
Athena Spark (Sessions) |
EMR Serverless (Batches) |
|---|---|---|
Startup |
Sub-second |
Seconds to minutes |
Model |
Interactive sessions |
Batch jobs |
State |
Persists in session |
Stateless per job |
Use Case |
Ad-hoc, exploration |
Production pipelines |
Billing |
DPU-hour |
vCPU-hour + GB-hour |
Troubleshooting¶
Common Issues¶
Workgroup not Spark-enabled:
Athena Spark requires a workgroup with Apache Spark engine
SQL workgroups will not work
Create a new Spark-enabled workgroup
Session fails to start:
Check workgroup DPU limits
Verify IAM permissions for Athena and S3
Check service quotas
Calculation timeout:
Increase
timeout_minutesfor large scale factorsCheck session hasn’t auto-terminated
Review calculation logs in CloudWatch
S3 access denied:
Check S3 bucket permissions
Verify IAM role has S3 read/write access
Ensure bucket is in same region