AWS Glue Platform¶
AWS Glue is a fully managed, serverless ETL service that runs Apache Spark for distributed data processing. BenchBox integrates with Glue to execute benchmarks as batch jobs, leveraging the Glue Data Catalog for metadata and S3 for data storage.
Features¶
Serverless Spark - Automatic cluster provisioning and scaling
Pay-per-use - Charged per DPU-hour (~$0.44/DPU-hour)
Glue Data Catalog - Centralized metadata compatible with Athena/EMR
S3 integration - Native support for S3 data lakes
Spark optimization - Includes cloud-spark shared infrastructure for config tuning
Installation¶
# Install with Glue support
uv add benchbox --extra glue
# Dependencies installed: boto3
Prerequisites¶
AWS Account with Glue service access
S3 Bucket for data staging and job scripts
IAM Role with the following permissions:
glue:CreateJob,glue:StartJobRun,glue:GetJobRunglue:CreateDatabase,glue:GetDatabase,glue:CreateTables3:GetObject,s3:PutObject,s3:ListBucket
AWS Credentials configured via CLI, environment, or IAM role
Configuration¶
Environment Variables¶
# Required
export GLUE_S3_STAGING_DIR=s3://your-bucket/benchbox/
export GLUE_JOB_ROLE=arn:aws:iam::123456789012:role/GlueBenchmarkRole
# Optional
export AWS_REGION=us-east-1
export AWS_PROFILE=default
export GLUE_DATABASE=benchbox
export GLUE_WORKER_TYPE=G.1X
export GLUE_NUM_WORKERS=2
export GLUE_VERSION=4.0
CLI Usage¶
# Basic usage
benchbox run --platform glue --benchmark tpch --scale 1.0 \
--platform-option s3_staging_dir=s3://bucket/benchbox/ \
--platform-option job_role=arn:aws:iam::123456789012:role/GlueRole
# With custom workers
benchbox run --platform glue --benchmark tpch --scale 10.0 \
--platform-option s3_staging_dir=s3://bucket/benchbox/ \
--platform-option job_role=arn:aws:iam::123456789012:role/GlueRole \
--platform-option worker_type=G.2X \
--platform-option number_of_workers=10
# Dry-run to preview queries
benchbox run --platform glue --benchmark tpch --dry-run ./preview \
--platform-option s3_staging_dir=s3://bucket/benchbox/ \
--platform-option job_role=arn:aws:iam::123456789012:role/GlueRole
Platform Options¶
Option |
Default |
Description |
|---|---|---|
|
required |
S3 path for data staging |
|
required |
IAM role ARN for job execution |
|
us-east-1 |
AWS region |
|
benchbox |
Glue Data Catalog database |
|
G.1X |
Worker type (G.025X, G.1X, G.2X, Z.2X) |
|
2 |
Number of workers (min 2 for standard) |
|
4.0 |
Glue version (3.0 or 4.0) |
|
60 |
Job timeout in minutes |
Worker Types¶
Type |
vCPU |
Memory |
DPU |
Cost/Hour |
Best For |
|---|---|---|---|---|---|
G.025X |
0.25 |
0.5GB |
0.25 |
~$0.11 |
Development |
G.1X |
4 |
16GB |
1 |
~$0.44 |
Standard workloads |
G.2X |
8 |
32GB |
2 |
~$0.88 |
Memory-intensive |
Z.2X |
8 |
64GB |
2 |
~$0.88 |
ML workloads |
Python API¶
from benchbox.platforms.aws import AWSGlueAdapter
# Initialize adapter
adapter = AWSGlueAdapter(
s3_staging_dir="s3://my-bucket/benchbox/",
job_role="arn:aws:iam::123456789012:role/GlueRole",
region="us-east-1",
database="tpch_benchmark",
worker_type="G.1X",
number_of_workers=4,
)
# Create database in Glue Data Catalog
adapter.create_schema("tpch_sf1")
# Load data to S3 and create Glue tables
adapter.load_data(
tables=["lineitem", "orders", "customer"],
source_dir="/path/to/tpch/data",
)
# Execute query (submitted as Glue job)
result = adapter.execute_query("SELECT COUNT(*) FROM lineitem")
# Clean up
adapter.close()
Execution Model¶
Unlike interactive query services, Glue executes benchmarks as batch jobs:
Job Creation - BenchBox creates a Glue job with optimized Spark configuration
Script Upload - Query execution script is uploaded to S3
Job Submission - Job run is started with the query as an argument
Polling - BenchBox polls for job completion
Result Retrieval - Results are read from S3 output location
This batch model means:
Longer startup times (30-60 seconds per job)
Higher throughput for large queries
DPU-hour billing for entire job duration
Results persisted in S3
Spark Configuration¶
BenchBox automatically optimizes Spark configuration based on benchmark type and scale factor:
# Automatic configuration includes:
# - Adaptive Query Execution (AQE) settings
# - Shuffle partition tuning
# - Memory allocation
# - Join optimization
# For TPC-H at SF=10 with 4 G.1X workers:
# spark.sql.shuffle.partitions = 200
# spark.sql.adaptive.enabled = true
# spark.sql.adaptive.skewJoin.enabled = true
Cost Estimation¶
Scale Factor |
Data Size |
Workers |
Est. Runtime |
Est. Cost |
|---|---|---|---|---|
0.01 |
~10 MB |
2x G.1X |
~10 min |
~$0.15 |
1.0 |
~1 GB |
2x G.1X |
~30 min |
~$0.44 |
10.0 |
~10 GB |
4x G.1X |
~2 hours |
~$3.52 |
100.0 |
~100 GB |
10x G.2X |
~4 hours |
~$35.20 |
Estimates based on standard TPC-H workloads. Actual costs vary.
IAM Role Policy¶
Minimum required IAM policy for the Glue job role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:CreateJob",
"glue:DeleteJob",
"glue:GetJob",
"glue:StartJobRun",
"glue:GetJobRun",
"glue:GetJobRuns",
"glue:CreateDatabase",
"glue:GetDatabase",
"glue:CreateTable",
"glue:GetTable",
"glue:UpdateTable",
"glue:DeleteTable"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-bucket",
"arn:aws:s3:::your-bucket/*"
]
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:*"
}
]
}
Troubleshooting¶
Common Issues¶
Job fails to start:
Verify IAM role has correct trust relationship for Glue
Check S3 bucket permissions
Ensure Glue service is available in your region
Slow job startup:
First job in a session has cold start (~30-60s)
Subsequent jobs use warm pools if available
Use G.1X or larger for production
Query timeout:
Increase
timeout_minutesfor large scale factorsAdd more workers for parallel execution
Check Glue console for detailed job metrics
Data not found:
Verify S3 paths are correct (s3:// prefix required)
Check Glue Data Catalog for table definitions
Ensure table format matches uploaded files (parquet)
Comparison with Athena¶
Aspect |
AWS Glue |
AWS Athena |
|---|---|---|
Execution |
Batch jobs (Spark) |
Interactive queries (Trino) |
Billing |
Per DPU-hour |
Per TB scanned |
Startup |
30-60 seconds |
Instant |
Best for |
ETL, large batch |
Ad-hoc analytics |
Spark features |
Full Spark SQL |
Trino SQL |
Catalog |
Glue Data Catalog |
Glue Data Catalog |