Amazon EMR Serverless Platform¶
Amazon EMR Serverless is AWS’s serverless deployment option for running Apache Spark without managing clusters. BenchBox integrates with EMR Serverless to execute benchmarks with automatic scaling and sub-second startup times.
Features¶
Serverless - No clusters to manage, automatic scaling
Fast startup - Sub-second cold starts with pre-initialized capacity
Cost-effective - Pay only for vCPU-hours and memory-GB-hours used
Integrated - Native S3 and Glue Data Catalog integration
Spark optimization - Uses cloud-spark shared infrastructure for config tuning
Installation¶
# Install with EMR Serverless support
uv add benchbox --extra emr-serverless
# Dependencies installed: boto3
Prerequisites¶
AWS Account with EMR Serverless access
S3 Bucket for data staging and job scripts
IAM Execution Role with the following permissions:
emr-serverless:StartApplication,emr-serverless:GetApplicationemr-serverless:StartJobRun,emr-serverless:GetJobRuns3:GetObject,s3:PutObject,s3:ListBucketglue:GetDatabase,glue:CreateDatabase,glue:GetTable,glue:CreateTable
AWS Credentials configured via CLI, environment, or IAM role
Configuration¶
Environment Variables¶
# Required
export EMR_S3_STAGING_DIR=s3://your-bucket/benchbox/
export EMR_EXECUTION_ROLE_ARN=arn:aws:iam::123456789012:role/EMRServerlessRole
# Application (one of these required)
export EMR_APPLICATION_ID=00f12345abc67890 # Existing application
# OR
export EMR_CREATE_APPLICATION=true # Create new application
# Optional
export AWS_REGION=us-east-1
export EMR_DATABASE=benchbox
export EMR_RELEASE_LABEL=emr-7.0.0
CLI Usage¶
# Basic usage with existing application
benchbox run --platform emr-serverless --benchmark tpch --scale 1.0 \
--platform-option application_id=00f12345abc67890 \
--platform-option s3_staging_dir=s3://bucket/benchbox/ \
--platform-option execution_role_arn=arn:aws:iam::123456789012:role/EMRRole
# Create new application
benchbox run --platform emr-serverless --benchmark tpch --scale 1.0 \
--platform-option s3_staging_dir=s3://bucket/benchbox/ \
--platform-option execution_role_arn=arn:aws:iam::123456789012:role/EMRRole \
--platform-option create_application=true
# Dry-run to preview queries
benchbox run --platform emr-serverless --benchmark tpch --dry-run ./preview \
--platform-option s3_staging_dir=s3://bucket/benchbox/ \
--platform-option execution_role_arn=arn:aws:iam::123456789012:role/EMRRole
Platform Options¶
Option |
Default |
Description |
|---|---|---|
|
required |
S3 path for data staging |
|
required |
IAM role ARN for job execution |
|
- |
Existing EMR Serverless application ID |
|
false |
Create new application if ID not provided |
|
us-east-1 |
AWS region |
|
benchbox |
Glue Data Catalog database |
|
emr-7.0.0 |
EMR release label |
|
60 |
Job timeout in minutes |
Python API¶
from benchbox.platforms.aws import EMRServerlessAdapter
# Initialize with existing application
adapter = EMRServerlessAdapter(
application_id="00f12345abc67890",
s3_staging_dir="s3://my-bucket/benchbox/",
execution_role_arn="arn:aws:iam::123456789012:role/EMRRole",
region="us-east-1",
database="tpch_benchmark",
)
# Or create new application
adapter = EMRServerlessAdapter(
s3_staging_dir="s3://my-bucket/benchbox/",
execution_role_arn="arn:aws:iam::123456789012:role/EMRRole",
create_application=True,
application_name="benchbox-tpch",
)
# Create database in Glue Data Catalog
adapter.create_schema("tpch_sf1")
# Load data to S3 and create Glue tables
adapter.load_data(
tables=["lineitem", "orders", "customer"],
source_dir="/path/to/tpch/data",
)
# Execute query (submitted as job run)
result = adapter.execute_query("SELECT COUNT(*) FROM lineitem")
# Clean up (optionally stops application)
adapter.close()
Execution Model¶
EMR Serverless executes benchmarks as job runs within an application:
Application - Container for job runs with auto-start/stop
Job Submission - Query script uploaded to S3 and submitted
Automatic Scaling - Resources provisioned based on workload
Result Retrieval - Results read from S3 output location
Resource Tracking - vCPU-hours and memory-GB-hours logged
Pre-Initialized Capacity¶
For sub-second startup, configure pre-initialized workers:
adapter = EMRServerlessAdapter(
application_id="00f12345abc67890",
s3_staging_dir="s3://bucket/benchbox/",
execution_role_arn="arn:aws:iam::123456789012:role/EMRRole",
initial_capacity={
"Driver": {
"workerCount": 1,
"workerConfiguration": {
"cpu": "4 vCPU",
"memory": "16 GB"
}
},
"Executor": {
"workerCount": 4,
"workerConfiguration": {
"cpu": "4 vCPU",
"memory": "16 GB"
}
}
},
)
Note: Pre-initialized capacity is charged even when idle.
Cost Estimation¶
Resource |
Price |
|---|---|
vCPU-hour |
~$0.052624 |
Memory GB-hour |
~$0.0057785 |
Scale Factor |
Data Size |
Est. Runtime |
Est. Cost |
|---|---|---|---|
0.01 |
~10 MB |
~20 min |
~$0.05 |
1.0 |
~1 GB |
~1 hour |
~$0.50 |
10.0 |
~10 GB |
~3 hours |
~$3.00 |
100.0 |
~100 GB |
~8 hours |
~$16.00 |
Estimates based on auto-scaling. Actual costs vary.
IAM Role Policy¶
Minimum required IAM policy for the execution role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-bucket",
"arn:aws:s3:::your-bucket/*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase",
"glue:GetTable",
"glue:CreateTable",
"glue:UpdateTable"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:*"
}
]
}
Also add a trust relationship for EMR Serverless:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "emr-serverless.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
Troubleshooting¶
Common Issues¶
Application fails to start:
Verify IAM role has correct trust relationship
Check region has EMR Serverless available
Review application configuration in console
Job fails immediately:
Check S3 bucket permissions
Verify execution role has required permissions
Review job logs in S3 or CloudWatch
Slow startup:
Configure pre-initialized capacity for warm workers
Check application auto-start is enabled
Review network configuration (VPC settings)
Query timeout:
Increase
timeout_minutesfor large scale factorsCheck for data skew in queries
Review Spark configuration
Comparison with Other Platforms¶
Aspect |
EMR Serverless |
AWS Glue |
EMR on EC2 |
|---|---|---|---|
Cluster management |
None |
None |
Full control |
Startup time |
Sub-second* |
30-60s |
Minutes |
Billing |
vCPU + memory hours |
DPU-hours |
Instance hours |
Use case |
Interactive |
ETL batch |
Long-running |
Scaling |
Automatic |
Automatic |
Manual/Auto |
*With pre-initialized capacity