AWS Athena Platform¶
Amazon Athena is AWS’s serverless interactive query service for analyzing data directly in Amazon S3 using standard SQL. Under the hood, Athena runs Trino, optimized for ad-hoc querying of data lakes.
Features¶
Serverless - No infrastructure to manage, scales automatically
Pay-per-query - Charged based on data scanned ($5 per TB)
S3 native - Query data directly in S3 without data movement
AWS Glue integration - Uses Glue Data Catalog for metadata
Multiple formats - Parquet, ORC, JSON, CSV, Avro support
Partition pruning - Efficient queries on partitioned data
Installation¶
# Install required dependencies
pip install pyathena boto3
# Or via BenchBox extras
pip install "benchbox[athena]"
Configuration¶
Environment Variables¶
# AWS credentials
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1
# Or use AWS profile
export AWS_PROFILE=your-profile
CLI Options¶
benchbox run --platform athena --benchmark tpch --scale 1.0 \
--platform-option region=us-east-1 \
--platform-option workgroup=primary \
--platform-option database=benchbox \
--platform-option s3_staging_dir=s3://your-bucket/athena-results/
Platform Options¶
Option |
Default |
Description |
|---|---|---|
|
us-east-1 |
AWS region |
|
primary |
Athena workgroup for cost tracking |
|
default |
Glue Data Catalog database |
|
(required) |
S3 path for query results |
|
(none) |
S3 path for benchmark data |
|
AwsDataCatalog |
Data catalog name |
|
(none) |
AWS credentials profile |
Usage Examples¶
Basic Benchmark Run¶
# Run TPC-H on Athena
benchbox run --platform athena --benchmark tpch --scale 1.0 \
--platform-option s3_staging_dir=s3://my-bucket/athena-results/ \
--platform-option database=benchmarks
Python API¶
from benchbox import TPCH
from benchbox.platforms.athena import AthenaAdapter
# Initialize adapter
adapter = AthenaAdapter(
region="us-east-1",
workgroup="primary",
database="benchmarks",
s3_staging_dir="s3://my-bucket/athena-results/",
s3_data_dir="s3://my-bucket/benchmark-data/",
)
# Load and run benchmark
benchmark = TPCH(scale_factor=1.0)
adapter.load_benchmark(benchmark)
results = adapter.run_benchmark(benchmark)
S3 Data Staging¶
BenchBox stages benchmark data to S3 before querying:
# Specify data location
benchbox run --platform athena --benchmark tpch --scale 1.0 \
--output s3://my-bucket/benchmarks/tpch_sf1/
Recommended Data Format¶
For best performance, use Parquet with Snappy compression:
benchbox run --platform athena --benchmark tpch \
--convert-format parquet \
--compression-type snappy
Cost Optimization¶
Reduce Data Scanned¶
Athena charges $5 per TB scanned. Optimize costs with:
Columnar formats - Parquet/ORC scan only needed columns
Partitioning - Partition by date/region for predicate pushdown
Compression - Smaller files = less data scanned
Workgroup Limits¶
Set query data scan limits in your workgroup:
# Create workgroup with cost controls
aws athena create-work-group \
--name benchbox \
--configuration "BytesScannedCutoffPerQuery=10737418240" # 10 GB limit
Cost Estimation¶
Scale Factor |
Data Size |
Est. Full Run Cost |
|---|---|---|
0.1 |
~100 MB |
< $0.01 |
1.0 |
~1 GB |
~$0.02 |
10.0 |
~10 GB |
~$0.20 |
100.0 |
~100 GB |
~$2.00 |
Performance Tips¶
Use Partitioning¶
-- Create partitioned table
CREATE EXTERNAL TABLE lineitem (...)
PARTITIONED BY (l_shipdate STRING)
STORED AS PARQUET
LOCATION 's3://bucket/lineitem/'
Enable Query Result Reuse¶
benchbox run --platform athena --benchmark tpch \
--platform-option result_reuse_enabled=true
Optimize File Sizes¶
Minimum: 128 MB per file
Optimal: 256 MB - 1 GB per file
Avoid many small files
Limitations¶
Query timeout: 30 minutes maximum
Result size: 2 GB maximum per query
Concurrent queries: Limited by workgroup (default: 20)
No updates: Read-only queries on S3 data
Troubleshooting¶
Access Denied¶
# Verify S3 permissions
aws s3 ls s3://your-bucket/
# Check IAM policy includes:
# - s3:GetObject
# - s3:ListBucket
# - s3:PutObject (for results)
# - athena:StartQueryExecution
# - glue:GetTable, glue:GetDatabase
Query Timeout¶
# For long-running queries, increase timeout
benchbox run --platform athena --benchmark tpcds \
--platform-option query_timeout=1800 # 30 minutes
Data Not Found¶
# Verify Glue table exists
aws glue get-table --database-name benchmarks --name lineitem
# Run MSCK REPAIR for partitioned tables
MSCK REPAIR TABLE lineitem;