Amazon Redshift Platform Adapter¶
The Redshift adapter provides AWS-native data warehouse execution with S3 integration and columnar storage optimization.
Overview¶
Amazon Redshift is a fully managed petabyte-scale data warehouse service that provides:
Columnar storage - Optimized for analytical queries
Massively parallel processing - Distributed query execution
S3 integration - COPY command for bulk loading
Automatic backups - Point-in-time recovery capabilities
Concurrency scaling - Automatic scaling for concurrent workloads
Redshift Spectrum - Query data directly in S3
Common use cases:
AWS-native analytics workloads
Large-scale data warehousing (TB to PB scale)
Integration with AWS ecosystem
Reserved or serverless deployment options
Federated queries across data lake and warehouse
Quick Start¶
Basic Configuration¶
from benchbox.tpch import TPCH
from benchbox.platforms.redshift import RedshiftAdapter
# Connect to Redshift cluster
adapter = RedshiftAdapter(
host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
port=5439,
database="dev",
username="admin",
password="SecurePassword123"
)
# Run benchmark
benchmark = TPCH(scale_factor=1.0)
results = benchmark.run_with_platform(adapter)
With S3 Data Loading¶
# Efficient loading via S3 COPY command
adapter = RedshiftAdapter(
host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
username="admin",
password="SecurePassword123",
database="benchbox",
s3_bucket="my-redshift-data",
s3_prefix="benchbox/staging",
iam_role="arn:aws:iam::123456789:role/RedshiftCopyRole"
)
API Reference¶
RedshiftAdapter Class¶
- class RedshiftAdapter(**config)[source]¶
Bases:
PlatformAdapterAmazon Redshift platform adapter with S3 integration.
- __init__(**config)[source]¶
Initialize the platform adapter with configuration.
- Parameters:
**config – Platform-specific configuration options
- property platform_name: str¶
Return the name of this database platform.
Default implementation returns the class name. Concrete adapters may override to provide a user-friendly display name. Tests may instantiate lightweight mock adapters without overriding this property.
- get_platform_info(connection=None)[source]¶
Get Redshift platform information.
Captures comprehensive Redshift configuration including: - Deployment type (serverless vs provisioned) - Capacity configuration (RPUs for serverless, node type/count for provisioned) - Redshift version - WLM (Workload Management) configuration - AWS region - Encryption and security settings
Uses fallback chain: AWS API → SQL queries → hostname parsing Gracefully degrades if permissions are insufficient or AWS credentials unavailable.
- check_server_database_exists(**connection_config)[source]¶
Check if database exists in Redshift cluster.
Connects to admin database to query pg_database for the target database.
- drop_database(**connection_config)[source]¶
Drop database in Redshift cluster.
Connects to admin database to drop the target database. Note: DROP DATABASE must run with autocommit enabled. Note: Redshift doesn’t support IF EXISTS for DROP DATABASE, so we check first.
- create_schema(benchmark, connection)[source]¶
Create schema using Redshift-optimized table definitions.
- load_data(benchmark, connection, data_dir)[source]¶
Load data using Redshift COPY command with S3 integration.
- configure_for_benchmark(connection, benchmark_type)[source]¶
Apply Redshift-specific optimizations based on benchmark type.
- validate_session_cache_control(connection)[source]¶
Validate that session-level cache control settings were successfully applied.
- Parameters:
connection (Any) – Active Redshift database connection
- Returns:
validated: bool - Whether validation passed
cache_disabled: bool - Whether cache is actually disabled
settings: dict - Actual session settings
warnings: list[str] - Any validation warnings
errors: list[str] - Any validation errors
- Return type:
dict with
- Raises:
ConfigurationError – If cache control validation fails and strict_validation=True
- execute_query(connection, query, query_id, benchmark_type=None, scale_factor=None, validate_row_count=True, stream_id=None)[source]¶
Execute query with detailed timing and performance tracking.
- supports_tuning_type(tuning_type)[source]¶
Check if Redshift supports a specific tuning type.
Redshift supports: - DISTRIBUTION: Via DISTSTYLE and DISTKEY clauses - SORTING: Via SORTKEY clause (compound and interleaved) - PARTITIONING: Through table design patterns and date partitioning
- Parameters:
tuning_type – The type of tuning to check support for
- Returns:
True if the tuning type is supported by Redshift
- Return type:
bool
- generate_tuning_clause(table_tuning)[source]¶
Generate Redshift-specific tuning clauses for CREATE TABLE statements.
Redshift supports: - DISTSTYLE (EVEN | KEY | ALL) DISTKEY (column) - SORTKEY (column1, column2, …) or INTERLEAVED SORTKEY (column1, column2, …)
- Parameters:
table_tuning – The tuning configuration for the table
- Returns:
SQL clause string to be appended to CREATE TABLE statement
- Return type:
str
- apply_table_tunings(table_tuning, connection)[source]¶
Apply tuning configurations to a Redshift table.
Redshift tuning approach: - DISTRIBUTION: Handled via DISTSTYLE/DISTKEY in CREATE TABLE - SORTING: Handled via SORTKEY in CREATE TABLE - Post-creation optimizations via ANALYZE and VACUUM
- Parameters:
table_tuning – The tuning configuration to apply
connection (Any) – Redshift connection
- Raises:
ValueError – If the tuning configuration is invalid for Redshift
- apply_unified_tuning(unified_config, connection)[source]¶
Apply unified tuning configuration to Redshift.
- Parameters:
unified_config (UnifiedTuningConfiguration) – Unified tuning configuration to apply
connection (Any) – Redshift connection
- apply_platform_optimizations(platform_config, connection)[source]¶
Apply Redshift-specific platform optimizations.
Redshift optimizations include: - Workload Management (WLM) queue configuration - Query group settings for resource allocation - Compression encoding optimization - Statistics collection and maintenance
- Parameters:
platform_config (PlatformOptimizationConfiguration) – Platform optimization configuration
connection (Any) – Redshift connection
- apply_constraint_configuration(primary_key_config, foreign_key_config, connection)[source]¶
Apply constraint configurations to Redshift.
Note: Redshift supports PRIMARY KEY and FOREIGN KEY constraints for query optimization, but they are informational only (not enforced). Constraints must be applied during table creation time.
- Parameters:
primary_key_config (PrimaryKeyConfiguration) – Primary key constraint configuration
foreign_key_config (ForeignKeyConfiguration) – Foreign key constraint configuration
connection (Any) – Redshift connection
Constructor Parameters¶
RedshiftAdapter(
host: str,
username: str,
password: str,
port: int = 5439,
database: str = "dev",
cluster_identifier: Optional[str] = None,
s3_bucket: Optional[str] = None,
s3_prefix: str = "benchbox-data",
iam_role: Optional[str] = None,
aws_access_key_id: Optional[str] = None,
aws_secret_access_key: Optional[str] = None,
aws_region: str = "us-east-1",
wlm_query_slot_count: int = 1,
compression_encoding: str = "AUTO",
auto_vacuum: bool = True,
auto_analyze: bool = True
)
Parameters:
Connection: - host (str): Cluster endpoint hostname - username (str): Database user - password (str): User password - port (int): Cluster port. Default: 5439 - database (str): Database name. Default: “dev”
S3 Integration: - s3_bucket (str, optional): S3 bucket for data staging - s3_prefix (str): Prefix in bucket. Default: “benchbox-data” - iam_role (str, optional): IAM role ARN for COPY - aws_access_key_id (str, optional): AWS access key - aws_secret_access_key (str, optional): AWS secret key - aws_region (str): AWS region. Default: “us-east-1”
Optimization: - wlm_query_slot_count (int): WLM slots. Default: 1 - compression_encoding (str): Compression type. Default: “AUTO” - auto_vacuum (bool): Auto vacuum tables. Default: True - auto_analyze (bool): Auto analyze tables. Default: True
Configuration Examples¶
IAM Role Authentication (Recommended)¶
# Create IAM role with S3 read permissions
aws iam create-role --role-name RedshiftCopyRole \
--assume-role-policy-document '{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Service": "redshift.amazonaws.com"},
"Action": "sts:AssumeRole"
}]
}'
# Attach S3 read policy
aws iam attach-role-policy --role-name RedshiftCopyRole \
--policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
# Associate role with cluster
Amazon Redshift modify-cluster-iam-roles \
--cluster-identifier my-cluster \
--add-iam-roles arn:aws:iam::123456789:role/RedshiftCopyRole
adapter = RedshiftAdapter(
host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
username="admin",
password="password",
s3_bucket="my-data-bucket",
iam_role="arn:aws:iam::123456789:role/RedshiftCopyRole"
)
Access Key Authentication¶
adapter = RedshiftAdapter(
host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
username="admin",
password="password",
s3_bucket="my-data-bucket",
aws_access_key_id="AKIAIOSFODNN7EXAMPLE",
aws_secret_access_key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
aws_region="us-east-1"
)
Workload Management (WLM)¶
# Use multiple query slots for large queries
adapter = RedshiftAdapter(
host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
username="admin",
password="password",
wlm_query_slot_count=3 # Use 3 slots for more resources
)
Data Loading¶
Via S3 COPY (Recommended)¶
from benchbox.platforms.redshift import RedshiftAdapter
from benchbox.tpch import TPCH
from pathlib import Path
adapter = RedshiftAdapter(
host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
username="admin",
password="password",
database="benchbox",
s3_bucket="my-redshift-staging",
s3_prefix="benchbox/tpch",
iam_role="arn:aws:iam::123456789:role/RedshiftCopyRole"
)
# Generate data locally
benchmark = TPCH(scale_factor=1.0)
data_dir = Path("./tpch_data")
benchmark.generate_data(data_dir)
# Load data (automatically uploads to S3 then uses COPY)
conn = adapter.create_connection()
table_stats, load_time = adapter.load_data(benchmark, conn, data_dir)
print(f"Loaded {sum(table_stats.values()):,} rows in {load_time:.2f}s")
Direct Loading (Small Datasets)¶
# For small datasets (< 100MB), skip S3
adapter = RedshiftAdapter(
host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
username="admin",
password="password"
# No s3_bucket specified - uses direct INSERT
)
Advanced Features¶
Distribution Keys¶
# Create table with distribution key
cursor.execute("""
CREATE TABLE orders (
o_orderkey BIGINT,
o_custkey BIGINT,
o_orderstatus CHAR(1),
o_totalprice DECIMAL(15,2),
o_orderdate DATE
)
DISTSTYLE KEY
DISTKEY (o_custkey)
SORTKEY (o_orderdate)
""")
Sort Keys¶
# Compound sort key (most common)
CREATE TABLE lineitem (...)
SORTKEY (l_shipdate, l_orderkey)
# Interleaved sort key (for multiple filters)
CREATE TABLE lineitem (...)
INTERLEAVED SORTKEY (l_shipdate, l_orderkey, l_partkey)
Compression¶
# Automatic compression analysis
adapter = RedshiftAdapter(
host="my-cluster...",
compression_encoding="AUTO" # Redshift analyzes and applies optimal encoding
)
# Check compression
cursor.execute("""
SELECT
"column",
type,
encoding
FROM pg_table_def
WHERE tablename = 'lineitem'
""")
Vacuum and Analyze¶
# Manual maintenance
adapter.vacuum_table(conn, "lineitem")
adapter.analyze_table(conn, "lineitem")
# Or automatic
adapter = RedshiftAdapter(
auto_vacuum=True,
auto_analyze=True
)
Best Practices¶
Distribution Strategy¶
Choose appropriate DISTSTYLE:
-- EVEN: Small tables, no joins CREATE TABLE region (...) DISTSTYLE EVEN -- KEY: Large fact tables (distribute by join key) CREATE TABLE orders (...) DISTSTYLE KEY DISTKEY (o_custkey) -- ALL: Small dimension tables (broadcast to all nodes) CREATE TABLE nation (...) DISTSTYLE ALL
Sort Keys¶
Use compound sort keys for range/equality filters:
-- Good for: WHERE l_shipdate BETWEEN ... AND l_orderkey = ... SORTKEY (l_shipdate, l_orderkey)
Use interleaved for multiple filter combinations:
-- Good for varying filter combinations INTERLEAVED SORTKEY (l_shipdate, l_orderkey, l_partkey)
Data Loading¶
Use COPY from S3 for best performance
Load compressed files (GZIP recommended)
Use manifest files for multiple files
Run ANALYZE after loading
Cost Optimization¶
Use reserved instances for predictable workloads
Pause clusters when not in use
Use concurrency scaling for burst workloads
Monitor query performance with system tables
Common Issues¶
Connection Timeout¶
Problem: Cannot connect to cluster
Solutions:
# 1. Check cluster status
Amazon Redshift describe-clusters --cluster-identifier my-cluster
# 2. Verify security group allows inbound on port 5439
# 3. Check VPC routing and NAT gateway
# 4. Test connectivity
psql -h my-cluster.123456.us-east-1.redshift.amazonaws.com \
-U admin -d dev -p 5439
S3 COPY Errors¶
Problem: COPY command fails
Solutions:
# 1. Verify IAM role permissions
# Role needs: s3:GetObject, s3:ListBucket
# 2. Check S3 bucket region matches cluster region
# 3. View error details
cursor.execute("""
SELECT * FROM stl_load_errors
ORDER BY starttime DESC
LIMIT 10
""")
Slow Query Performance¶
Problem: Queries slower than expected
Solutions:
# 1. Check query execution plan
plan = adapter.get_query_plan(conn, query)
# 2. Verify distribution keys
cursor.execute("""
SELECT
TRIM(t.name) AS table,
TRIM(c.name) AS column,
c.distkey
FROM stv_tbl_perm t
JOIN pg_attribute a ON a.attrelid = t.id
JOIN pg_class c ON c.oid = t.id
WHERE c.distkey = TRUE
""")
# 3. Check sort key usage
cursor.execute("""
SELECT * FROM svv_table_info
WHERE "table" = 'lineitem'
""")
# 4. Run VACUUM and ANALYZE
adapter.vacuum_table(conn, "lineitem")
adapter.analyze_table(conn, "lineitem")
See Also¶
Platform Documentation¶
Platform Selection Guide - Choosing Redshift vs other platforms
Multi-Platform Database Support - Quick setup for all platforms
Platform Comparison Matrix - Feature comparison
API Reference¶
Snowflake Platform Adapter - Snowflake adapter
Databricks Platform Adapter - Databricks adapter
BigQuery Platform Adapter - BigQuery adapter
Base Benchmark API - Base benchmark interface
Python API Reference - Python API overview
External Resources¶
Redshift Documentation - Official Redshift docs
Best Practices - Performance optimization
Distribution Styles - Distribution guidance
COPY Command - Data loading reference