AWS Athena Platform¶

Tags intermediate guide athena cloud-platform

Amazon Athena is AWS’s serverless interactive query service for analyzing data directly in Amazon S3 using standard SQL. Under the hood, Athena runs Trino, optimized for ad-hoc querying of data lakes.

Features¶

Serverless - No infrastructure to manage, scales automatically
Pay-per-query - Charged based on data scanned ($5 per TB)
S3 native - Query data directly in S3 without data movement
AWS Glue integration - Uses Glue Data Catalog for metadata
Multiple formats - Parquet, ORC, JSON, CSV, Avro support
Partition pruning - Efficient queries on partitioned data

Installation¶

# Install required dependencies
pip install pyathena boto3

# Or via BenchBox extras
pip install "benchbox[athena]"

Configuration¶

Environment Variables¶

# AWS credentials
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1

# Or use AWS profile
export AWS_PROFILE=your-profile

CLI Options¶

benchbox run --platform athena --benchmark tpch --scale 1.0 \
  --platform-option region=us-east-1 \
  --platform-option workgroup=primary \
  --platform-option database=benchbox \
  --platform-option s3_staging_dir=s3://your-bucket/athena-results/

Platform Options¶

Option	Default	Description
`region`	us-east-1	AWS region
`workgroup`	primary	Athena workgroup for cost tracking
`database`	default	Glue Data Catalog database
`s3_staging_dir`	(required)	S3 path for query results
`s3_data_dir`	(none)	S3 path for benchmark data
`catalog`	AwsDataCatalog	Data catalog name
`aws_profile`	(none)	AWS credentials profile

Usage Examples¶

Basic Benchmark Run¶

# Run TPC-H on Athena
benchbox run --platform athena --benchmark tpch --scale 1.0 \
  --platform-option s3_staging_dir=s3://my-bucket/athena-results/ \
  --platform-option database=benchmarks

Python API¶

from benchbox import TPCH
from benchbox.platforms.athena import AthenaAdapter

# Initialize adapter
adapter = AthenaAdapter(
    region="us-east-1",
    workgroup="primary",
    database="benchmarks",
    s3_staging_dir="s3://my-bucket/athena-results/",
    s3_data_dir="s3://my-bucket/benchmark-data/",
)

# Load and run benchmark
benchmark = TPCH(scale_factor=1.0)
adapter.load_benchmark(benchmark)
results = adapter.run_benchmark(benchmark)

S3 Data Staging¶

BenchBox stages benchmark data to S3 before querying:

# Specify data location
benchbox run --platform athena --benchmark tpch --scale 1.0 \
  --output s3://my-bucket/benchmarks/tpch_sf1/

Recommended Data Format¶

For best performance, use Parquet with Snappy compression:

benchbox run --platform athena --benchmark tpch \
  --convert-format parquet \
  --compression-type snappy

Cost Optimization¶

Reduce Data Scanned¶

Athena charges $5 per TB scanned. Optimize costs with:

Columnar formats - Parquet/ORC scan only needed columns
Partitioning - Partition by date/region for predicate pushdown
Compression - Smaller files = less data scanned

Workgroup Limits¶

Set query data scan limits in your workgroup:

# Create workgroup with cost controls
aws athena create-work-group \
  --name benchbox \
  --configuration "BytesScannedCutoffPerQuery=10737418240"  # 10 GB limit

Cost Estimation¶

Scale Factor	Data Size	Est. Full Run Cost
0.1	~100 MB	< $0.01
1.0	~1 GB	~$0.02
10.0	~10 GB	~$0.20
100.0	~100 GB	~$2.00

Performance Tips¶

Use Partitioning¶

-- Create partitioned table
CREATE EXTERNAL TABLE lineitem (...)
PARTITIONED BY (l_shipdate STRING)
STORED AS PARQUET
LOCATION 's3://bucket/lineitem/'

Enable Query Result Reuse¶

benchbox run --platform athena --benchmark tpch \
  --platform-option result_reuse_enabled=true

Optimize File Sizes¶

Minimum: 128 MB per file
Optimal: 256 MB - 1 GB per file
Avoid many small files

Limitations¶

Query timeout: 30 minutes maximum
Result size: 2 GB maximum per query
Concurrent queries: Limited by workgroup (default: 20)
No updates: Read-only queries on S3 data

Troubleshooting¶

Access Denied¶

# Verify S3 permissions
aws s3 ls s3://your-bucket/

# Check IAM policy includes:
# - s3:GetObject
# - s3:ListBucket
# - s3:PutObject (for results)
# - athena:StartQueryExecution
# - glue:GetTable, glue:GetDatabase

Query Timeout¶

# For long-running queries, increase timeout
benchbox run --platform athena --benchmark tpcds \
  --platform-option query_timeout=1800  # 30 minutes

Data Not Found¶

# Verify Glue table exists
aws glue get-table --database-name benchmarks --name lineitem

# Run MSCK REPAIR for partitioned tables
MSCK REPAIR TABLE lineitem;