Amazon Redshift Platform Adapter¶

Tags reference python-api cloud-platform

The Redshift adapter provides AWS-native data warehouse execution with S3 integration and columnar storage optimization.

Overview¶

Amazon Redshift is a fully managed petabyte-scale data warehouse service that provides:

Columnar storage - Optimized for analytical queries
Massively parallel processing - Distributed query execution
S3 integration - COPY command for bulk loading
Automatic backups - Point-in-time recovery capabilities
Concurrency scaling - Automatic scaling for concurrent workloads
Redshift Spectrum - Query data directly in S3

Common use cases:

AWS-native analytics workloads
Large-scale data warehousing (TB to PB scale)
Integration with AWS ecosystem
Reserved or serverless deployment options
Federated queries across data lake and warehouse

Quick Start¶

Basic Configuration¶

from benchbox.tpch import TPCH
from benchbox.platforms.redshift import RedshiftAdapter

# Connect to Redshift cluster
adapter = RedshiftAdapter(
    host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
    port=5439,
    database="dev",
    username="admin",
    password="SecurePassword123"
)

# Run benchmark
benchmark = TPCH(scale_factor=1.0)
results = benchmark.run_with_platform(adapter)

With S3 Data Loading¶

# Efficient loading via S3 COPY command
adapter = RedshiftAdapter(
    host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
    username="admin",
    password="SecurePassword123",
    database="benchbox",
    s3_bucket="my-redshift-data",
    s3_prefix="benchbox/staging",
    iam_role="arn:aws:iam::123456789:role/RedshiftCopyRole"
)

API Reference¶

RedshiftAdapter Class¶

class RedshiftAdapter(**config)[source]¶

Bases: PlatformAdapter

Amazon Redshift platform adapter with S3 integration.

__init__(**config)[source]¶

Initialize the platform adapter with configuration.

Parameters:: **config – Platform-specific configuration options

property platform_name: str¶

Return the name of this database platform.

Default implementation returns the class name. Concrete adapters may override to provide a user-friendly display name. Tests may instantiate lightweight mock adapters without overriding this property.

static add_cli_arguments(parser)[source]¶

Add Redshift-specific CLI arguments.

classmethod from_config(config)[source]¶

Create Redshift adapter from unified configuration.

get_platform_info(connection=None)[source]¶

Get Redshift platform information.

Captures comprehensive Redshift configuration including: - Deployment type (serverless vs provisioned) - Capacity configuration (RPUs for serverless, node type/count for provisioned) - Redshift version - WLM (Workload Management) configuration - AWS region - Encryption and security settings

Uses fallback chain: AWS API → SQL queries → hostname parsing Gracefully degrades if permissions are insufficient or AWS credentials unavailable.

get_target_dialect()[source]¶

Return the target SQL dialect for Redshift.

check_server_database_exists(**connection_config)[source]¶

Check if database exists in Redshift cluster.

Connects to admin database to query pg_database for the target database.

drop_database(**connection_config)[source]¶

Drop database in Redshift cluster.

Connects to admin database to drop the target database. Note: DROP DATABASE must run with autocommit enabled. Note: Redshift doesn’t support IF EXISTS for DROP DATABASE, so we check first.

create_connection(**connection_config)[source]¶

Create optimized Redshift connection.

create_schema(benchmark, connection)[source]¶

Create schema using Redshift-optimized table definitions.

load_data(benchmark, connection, data_dir)[source]¶

Load data using Redshift COPY command with S3 integration.

configure_for_benchmark(connection, benchmark_type)[source]¶

Apply Redshift-specific optimizations based on benchmark type.

validate_session_cache_control(connection)[source]¶

Validate that session-level cache control settings were successfully applied.

Parameters:

connection (Any) – Active Redshift database connection

Returns:

validated: bool - Whether validation passed
cache_disabled: bool - Whether cache is actually disabled
settings: dict - Actual session settings
warnings: list[str] - Any validation warnings
errors: list[str] - Any validation errors

Return type:

dict with

Raises:

ConfigurationError – If cache control validation fails and strict_validation=True

execute_query(connection, query, query_id, benchmark_type=None, scale_factor=None, validate_row_count=True, stream_id=None)[source]¶

Execute query with detailed timing and performance tracking.

analyze_table(connection, table_name)[source]¶

Run ANALYZE on table for query optimization.

vacuum_table(connection, table_name)[source]¶

Run VACUUM on table for space reclamation.

get_query_plan(connection, query)[source]¶

Get query execution plan for analysis.

close_connection(connection)[source]¶

Close Redshift connection.

supports_tuning_type(tuning_type)[source]¶

Check if Redshift supports a specific tuning type.

Redshift supports: - DISTRIBUTION: Via DISTSTYLE and DISTKEY clauses - SORTING: Via SORTKEY clause (compound and interleaved) - PARTITIONING: Through table design patterns and date partitioning

Parameters:: tuning_type – The type of tuning to check support for
Returns:: True if the tuning type is supported by Redshift
Return type:: bool

generate_tuning_clause(table_tuning)[source]¶

Generate Redshift-specific tuning clauses for CREATE TABLE statements.

Redshift supports: - DISTSTYLE (EVEN | KEY | ALL) DISTKEY (column) - SORTKEY (column1, column2, …) or INTERLEAVED SORTKEY (column1, column2, …)

Parameters:: table_tuning – The tuning configuration for the table
Returns:: SQL clause string to be appended to CREATE TABLE statement
Return type:: str

apply_table_tunings(table_tuning, connection)[source]¶

Apply tuning configurations to a Redshift table.

Redshift tuning approach: - DISTRIBUTION: Handled via DISTSTYLE/DISTKEY in CREATE TABLE - SORTING: Handled via SORTKEY in CREATE TABLE - Post-creation optimizations via ANALYZE and VACUUM

Parameters:

table_tuning – The tuning configuration to apply
connection (Any) – Redshift connection

Raises:

ValueError – If the tuning configuration is invalid for Redshift

apply_unified_tuning(unified_config, connection)[source]¶

Apply unified tuning configuration to Redshift.

Parameters:

unified_config (UnifiedTuningConfiguration) – Unified tuning configuration to apply
connection (Any) – Redshift connection

apply_platform_optimizations(platform_config, connection)[source]¶

Apply Redshift-specific platform optimizations.

Redshift optimizations include: - Workload Management (WLM) queue configuration - Query group settings for resource allocation - Compression encoding optimization - Statistics collection and maintenance

Parameters:

platform_config (PlatformOptimizationConfiguration) – Platform optimization configuration
connection (Any) – Redshift connection

apply_constraint_configuration(primary_key_config, foreign_key_config, connection)[source]¶

Apply constraint configurations to Redshift.

Note: Redshift supports PRIMARY KEY and FOREIGN KEY constraints for query optimization, but they are informational only (not enforced). Constraints must be applied during table creation time.

Parameters:

primary_key_config (PrimaryKeyConfiguration) – Primary key constraint configuration
foreign_key_config (ForeignKeyConfiguration) – Foreign key constraint configuration
connection (Any) – Redshift connection

Constructor Parameters¶

RedshiftAdapter(
    host: str,
    username: str,
    password: str,
    port: int = 5439,
    database: str = "dev",
    cluster_identifier: Optional[str] = None,
    s3_bucket: Optional[str] = None,
    s3_prefix: str = "benchbox-data",
    iam_role: Optional[str] = None,
    aws_access_key_id: Optional[str] = None,
    aws_secret_access_key: Optional[str] = None,
    aws_region: str = "us-east-1",
    wlm_query_slot_count: int = 1,
    compression_encoding: str = "AUTO",
    auto_vacuum: bool = True,
    auto_analyze: bool = True
)

Parameters:

Connection: - host (str): Cluster endpoint hostname - username (str): Database user - password (str): User password - port (int): Cluster port. Default: 5439 - database (str): Database name. Default: “dev”

S3 Integration: - s3_bucket (str, optional): S3 bucket for data staging - s3_prefix (str): Prefix in bucket. Default: “benchbox-data” - iam_role (str, optional): IAM role ARN for COPY - aws_access_key_id (str, optional): AWS access key - aws_secret_access_key (str, optional): AWS secret key - aws_region (str): AWS region. Default: “us-east-1”

Optimization: - wlm_query_slot_count (int): WLM slots. Default: 1 - compression_encoding (str): Compression type. Default: “AUTO” - auto_vacuum (bool): Auto vacuum tables. Default: True - auto_analyze (bool): Auto analyze tables. Default: True

Configuration Examples¶

IAM Role Authentication (Recommended)¶

# Create IAM role with S3 read permissions
aws iam create-role --role-name RedshiftCopyRole \
    --assume-role-policy-document '{
        "Version": "2012-10-17",
        "Statement": [{
            "Effect": "Allow",
            "Principal": {"Service": "redshift.amazonaws.com"},
            "Action": "sts:AssumeRole"
        }]
    }'

# Attach S3 read policy
aws iam attach-role-policy --role-name RedshiftCopyRole \
    --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

# Associate role with cluster
Amazon Redshift modify-cluster-iam-roles \
    --cluster-identifier my-cluster \
    --add-iam-roles arn:aws:iam::123456789:role/RedshiftCopyRole

adapter = RedshiftAdapter(
    host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
    username="admin",
    password="password",
    s3_bucket="my-data-bucket",
    iam_role="arn:aws:iam::123456789:role/RedshiftCopyRole"
)

Access Key Authentication¶

adapter = RedshiftAdapter(
    host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
    username="admin",
    password="password",
    s3_bucket="my-data-bucket",
    aws_access_key_id="AKIAIOSFODNN7EXAMPLE",
    aws_secret_access_key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
    aws_region="us-east-1"
)

Workload Management (WLM)¶

# Use multiple query slots for large queries
adapter = RedshiftAdapter(
    host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
    username="admin",
    password="password",
    wlm_query_slot_count=3  # Use 3 slots for more resources
)

Data Loading¶

Via S3 COPY (Recommended)¶

from benchbox.platforms.redshift import RedshiftAdapter
from benchbox.tpch import TPCH
from pathlib import Path

adapter = RedshiftAdapter(
    host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
    username="admin",
    password="password",
    database="benchbox",
    s3_bucket="my-redshift-staging",
    s3_prefix="benchbox/tpch",
    iam_role="arn:aws:iam::123456789:role/RedshiftCopyRole"
)

# Generate data locally
benchmark = TPCH(scale_factor=1.0)
data_dir = Path("./tpch_data")
benchmark.generate_data(data_dir)

# Load data (automatically uploads to S3 then uses COPY)
conn = adapter.create_connection()
table_stats, load_time = adapter.load_data(benchmark, conn, data_dir)

print(f"Loaded {sum(table_stats.values()):,} rows in {load_time:.2f}s")

Direct Loading (Small Datasets)¶

# For small datasets (< 100MB), skip S3
adapter = RedshiftAdapter(
    host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
    username="admin",
    password="password"
    # No s3_bucket specified - uses direct INSERT
)

Advanced Features¶

Distribution Keys¶

# Create table with distribution key
cursor.execute("""
    CREATE TABLE orders (
        o_orderkey BIGINT,
        o_custkey BIGINT,
        o_orderstatus CHAR(1),
        o_totalprice DECIMAL(15,2),
        o_orderdate DATE
    )
    DISTSTYLE KEY
    DISTKEY (o_custkey)
    SORTKEY (o_orderdate)
""")

Sort Keys¶

# Compound sort key (most common)
CREATE TABLE lineitem (...)
SORTKEY (l_shipdate, l_orderkey)

# Interleaved sort key (for multiple filters)
CREATE TABLE lineitem (...)
INTERLEAVED SORTKEY (l_shipdate, l_orderkey, l_partkey)

Compression¶

# Automatic compression analysis
adapter = RedshiftAdapter(
    host="my-cluster...",
    compression_encoding="AUTO"  # Redshift analyzes and applies optimal encoding
)

# Check compression
cursor.execute("""
    SELECT
        "column",
        type,
        encoding
    FROM pg_table_def
    WHERE tablename = 'lineitem'
""")

Vacuum and Analyze¶

# Manual maintenance
adapter.vacuum_table(conn, "lineitem")
adapter.analyze_table(conn, "lineitem")

# Or automatic
adapter = RedshiftAdapter(
    auto_vacuum=True,
    auto_analyze=True
)

Best Practices¶

Distribution Strategy¶

Choose appropriate DISTSTYLE:

-- EVEN: Small tables, no joins
CREATE TABLE region (...) DISTSTYLE EVEN

-- KEY: Large fact tables (distribute by join key)
CREATE TABLE orders (...) DISTSTYLE KEY DISTKEY (o_custkey)

-- ALL: Small dimension tables (broadcast to all nodes)
CREATE TABLE nation (...) DISTSTYLE ALL

Sort Keys¶

Use compound sort keys for range/equality filters:

-- Good for: WHERE l_shipdate BETWEEN ... AND l_orderkey = ...
SORTKEY (l_shipdate, l_orderkey)

Use interleaved for multiple filter combinations:

-- Good for varying filter combinations
INTERLEAVED SORTKEY (l_shipdate, l_orderkey, l_partkey)

Data Loading¶

Use COPY from S3 for best performance
Load compressed files (GZIP recommended)
Use manifest files for multiple files
Run ANALYZE after loading

Cost Optimization¶

Use reserved instances for predictable workloads
Pause clusters when not in use
Use concurrency scaling for burst workloads
Monitor query performance with system tables

Common Issues¶

Connection Timeout¶

Problem: Cannot connect to cluster

Solutions:

# 1. Check cluster status
Amazon Redshift describe-clusters --cluster-identifier my-cluster

# 2. Verify security group allows inbound on port 5439
# 3. Check VPC routing and NAT gateway

# 4. Test connectivity
psql -h my-cluster.123456.us-east-1.redshift.amazonaws.com \
     -U admin -d dev -p 5439

S3 COPY Errors¶

Problem: COPY command fails

Solutions:

# 1. Verify IAM role permissions
# Role needs: s3:GetObject, s3:ListBucket

# 2. Check S3 bucket region matches cluster region

# 3. View error details
cursor.execute("""
    SELECT * FROM stl_load_errors
    ORDER BY starttime DESC
    LIMIT 10
""")

Slow Query Performance¶

Problem: Queries slower than expected

Solutions:

# 1. Check query execution plan
plan = adapter.get_query_plan(conn, query)

# 2. Verify distribution keys
cursor.execute("""
    SELECT
        TRIM(t.name) AS table,
        TRIM(c.name) AS column,
        c.distkey
    FROM stv_tbl_perm t
    JOIN pg_attribute a ON a.attrelid = t.id
    JOIN pg_class c ON c.oid = t.id
    WHERE c.distkey = TRUE
""")

# 3. Check sort key usage
cursor.execute("""
    SELECT * FROM svv_table_info
    WHERE "table" = 'lineitem'
""")

# 4. Run VACUUM and ANALYZE
adapter.vacuum_table(conn, "lineitem")
adapter.analyze_table(conn, "lineitem")

Amazon Redshift Platform Adapter¶

Overview¶

Quick Start¶

Basic Configuration¶

With S3 Data Loading¶

API Reference¶

RedshiftAdapter Class¶

Constructor Parameters¶

Configuration Examples¶

IAM Role Authentication (Recommended)¶

Access Key Authentication¶

Workload Management (WLM)¶

Data Loading¶

Via S3 COPY (Recommended)¶

Direct Loading (Small Datasets)¶

Advanced Features¶

Distribution Keys¶

Sort Keys¶

Compression¶

Vacuum and Analyze¶

Best Practices¶

Distribution Strategy¶

Sort Keys¶

Data Loading¶

Cost Optimization¶

Common Issues¶

Connection Timeout¶

S3 COPY Errors¶

Slow Query Performance¶

See Also¶

Platform Documentation¶

API Reference¶

External Resources¶