Amazon Redshift Platform Adapter

Tags reference python-api cloud-platform

The Redshift adapter provides AWS-native data warehouse execution with S3 integration and columnar storage optimization.

Overview

Amazon Redshift is a fully managed petabyte-scale data warehouse service that provides:

  • Columnar storage - Optimized for analytical queries

  • Massively parallel processing - Distributed query execution

  • S3 integration - COPY command for bulk loading

  • Automatic backups - Point-in-time recovery capabilities

  • Concurrency scaling - Automatic scaling for concurrent workloads

  • Redshift Spectrum - Query data directly in S3

Common use cases:

  • AWS-native analytics workloads

  • Large-scale data warehousing (TB to PB scale)

  • Integration with AWS ecosystem

  • Reserved or serverless deployment options

  • Federated queries across data lake and warehouse

Quick Start

Basic Configuration

from benchbox.tpch import TPCH
from benchbox.platforms.redshift import RedshiftAdapter

# Connect to Redshift cluster
adapter = RedshiftAdapter(
    host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
    port=5439,
    database="dev",
    username="admin",
    password="SecurePassword123"
)

# Run benchmark
benchmark = TPCH(scale_factor=1.0)
results = benchmark.run_with_platform(adapter)

With S3 Data Loading

# Efficient loading via S3 COPY command
adapter = RedshiftAdapter(
    host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
    username="admin",
    password="SecurePassword123",
    database="benchbox",
    s3_bucket="my-redshift-data",
    s3_prefix="benchbox/staging",
    iam_role="arn:aws:iam::123456789:role/RedshiftCopyRole"
)

API Reference

RedshiftAdapter Class

class RedshiftAdapter(**config)[source]

Bases: PlatformAdapter

Amazon Redshift platform adapter with S3 integration.

__init__(**config)[source]

Initialize the platform adapter with configuration.

Parameters:

**config – Platform-specific configuration options

property platform_name: str

Return the name of this database platform.

Default implementation returns the class name. Concrete adapters may override to provide a user-friendly display name. Tests may instantiate lightweight mock adapters without overriding this property.

static add_cli_arguments(parser)[source]

Add Redshift-specific CLI arguments.

classmethod from_config(config)[source]

Create Redshift adapter from unified configuration.

get_platform_info(connection=None)[source]

Get Redshift platform information.

Captures comprehensive Redshift configuration including: - Deployment type (serverless vs provisioned) - Capacity configuration (RPUs for serverless, node type/count for provisioned) - Redshift version - WLM (Workload Management) configuration - AWS region - Encryption and security settings

Uses fallback chain: AWS API → SQL queries → hostname parsing Gracefully degrades if permissions are insufficient or AWS credentials unavailable.

get_target_dialect()[source]

Return the target SQL dialect for Redshift.

check_server_database_exists(**connection_config)[source]

Check if database exists in Redshift cluster.

Connects to admin database to query pg_database for the target database.

drop_database(**connection_config)[source]

Drop database in Redshift cluster.

Connects to admin database to drop the target database. Note: DROP DATABASE must run with autocommit enabled. Note: Redshift doesn’t support IF EXISTS for DROP DATABASE, so we check first.

create_connection(**connection_config)[source]

Create optimized Redshift connection.

create_schema(benchmark, connection)[source]

Create schema using Redshift-optimized table definitions.

load_data(benchmark, connection, data_dir)[source]

Load data using Redshift COPY command with S3 integration.

configure_for_benchmark(connection, benchmark_type)[source]

Apply Redshift-specific optimizations based on benchmark type.

validate_session_cache_control(connection)[source]

Validate that session-level cache control settings were successfully applied.

Parameters:

connection (Any) – Active Redshift database connection

Returns:

  • validated: bool - Whether validation passed

  • cache_disabled: bool - Whether cache is actually disabled

  • settings: dict - Actual session settings

  • warnings: list[str] - Any validation warnings

  • errors: list[str] - Any validation errors

Return type:

dict with

Raises:

ConfigurationError – If cache control validation fails and strict_validation=True

execute_query(connection, query, query_id, benchmark_type=None, scale_factor=None, validate_row_count=True, stream_id=None)[source]

Execute query with detailed timing and performance tracking.

analyze_table(connection, table_name)[source]

Run ANALYZE on table for query optimization.

vacuum_table(connection, table_name)[source]

Run VACUUM on table for space reclamation.

get_query_plan(connection, query)[source]

Get query execution plan for analysis.

close_connection(connection)[source]

Close Redshift connection.

supports_tuning_type(tuning_type)[source]

Check if Redshift supports a specific tuning type.

Redshift supports: - DISTRIBUTION: Via DISTSTYLE and DISTKEY clauses - SORTING: Via SORTKEY clause (compound and interleaved) - PARTITIONING: Through table design patterns and date partitioning

Parameters:

tuning_type – The type of tuning to check support for

Returns:

True if the tuning type is supported by Redshift

Return type:

bool

generate_tuning_clause(table_tuning)[source]

Generate Redshift-specific tuning clauses for CREATE TABLE statements.

Redshift supports: - DISTSTYLE (EVEN | KEY | ALL) DISTKEY (column) - SORTKEY (column1, column2, …) or INTERLEAVED SORTKEY (column1, column2, …)

Parameters:

table_tuning – The tuning configuration for the table

Returns:

SQL clause string to be appended to CREATE TABLE statement

Return type:

str

apply_table_tunings(table_tuning, connection)[source]

Apply tuning configurations to a Redshift table.

Redshift tuning approach: - DISTRIBUTION: Handled via DISTSTYLE/DISTKEY in CREATE TABLE - SORTING: Handled via SORTKEY in CREATE TABLE - Post-creation optimizations via ANALYZE and VACUUM

Parameters:
  • table_tuning – The tuning configuration to apply

  • connection (Any) – Redshift connection

Raises:

ValueError – If the tuning configuration is invalid for Redshift

apply_unified_tuning(unified_config, connection)[source]

Apply unified tuning configuration to Redshift.

Parameters:
  • unified_config (UnifiedTuningConfiguration) – Unified tuning configuration to apply

  • connection (Any) – Redshift connection

apply_platform_optimizations(platform_config, connection)[source]

Apply Redshift-specific platform optimizations.

Redshift optimizations include: - Workload Management (WLM) queue configuration - Query group settings for resource allocation - Compression encoding optimization - Statistics collection and maintenance

Parameters:
  • platform_config (PlatformOptimizationConfiguration) – Platform optimization configuration

  • connection (Any) – Redshift connection

apply_constraint_configuration(primary_key_config, foreign_key_config, connection)[source]

Apply constraint configurations to Redshift.

Note: Redshift supports PRIMARY KEY and FOREIGN KEY constraints for query optimization, but they are informational only (not enforced). Constraints must be applied during table creation time.

Parameters:
  • primary_key_config (PrimaryKeyConfiguration) – Primary key constraint configuration

  • foreign_key_config (ForeignKeyConfiguration) – Foreign key constraint configuration

  • connection (Any) – Redshift connection

Constructor Parameters

RedshiftAdapter(
    host: str,
    username: str,
    password: str,
    port: int = 5439,
    database: str = "dev",
    cluster_identifier: Optional[str] = None,
    s3_bucket: Optional[str] = None,
    s3_prefix: str = "benchbox-data",
    iam_role: Optional[str] = None,
    aws_access_key_id: Optional[str] = None,
    aws_secret_access_key: Optional[str] = None,
    aws_region: str = "us-east-1",
    wlm_query_slot_count: int = 1,
    compression_encoding: str = "AUTO",
    auto_vacuum: bool = True,
    auto_analyze: bool = True
)

Parameters:

Connection: - host (str): Cluster endpoint hostname - username (str): Database user - password (str): User password - port (int): Cluster port. Default: 5439 - database (str): Database name. Default: “dev”

S3 Integration: - s3_bucket (str, optional): S3 bucket for data staging - s3_prefix (str): Prefix in bucket. Default: “benchbox-data” - iam_role (str, optional): IAM role ARN for COPY - aws_access_key_id (str, optional): AWS access key - aws_secret_access_key (str, optional): AWS secret key - aws_region (str): AWS region. Default: “us-east-1”

Optimization: - wlm_query_slot_count (int): WLM slots. Default: 1 - compression_encoding (str): Compression type. Default: “AUTO” - auto_vacuum (bool): Auto vacuum tables. Default: True - auto_analyze (bool): Auto analyze tables. Default: True

Configuration Examples

Access Key Authentication

adapter = RedshiftAdapter(
    host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
    username="admin",
    password="password",
    s3_bucket="my-data-bucket",
    aws_access_key_id="AKIAIOSFODNN7EXAMPLE",
    aws_secret_access_key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
    aws_region="us-east-1"
)

Workload Management (WLM)

# Use multiple query slots for large queries
adapter = RedshiftAdapter(
    host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
    username="admin",
    password="password",
    wlm_query_slot_count=3  # Use 3 slots for more resources
)

Data Loading

Direct Loading (Small Datasets)

# For small datasets (< 100MB), skip S3
adapter = RedshiftAdapter(
    host="my-cluster.123456.us-east-1.redshift.amazonaws.com",
    username="admin",
    password="password"
    # No s3_bucket specified - uses direct INSERT
)

Advanced Features

Distribution Keys

# Create table with distribution key
cursor.execute("""
    CREATE TABLE orders (
        o_orderkey BIGINT,
        o_custkey BIGINT,
        o_orderstatus CHAR(1),
        o_totalprice DECIMAL(15,2),
        o_orderdate DATE
    )
    DISTSTYLE KEY
    DISTKEY (o_custkey)
    SORTKEY (o_orderdate)
""")

Sort Keys

# Compound sort key (most common)
CREATE TABLE lineitem (...)
SORTKEY (l_shipdate, l_orderkey)

# Interleaved sort key (for multiple filters)
CREATE TABLE lineitem (...)
INTERLEAVED SORTKEY (l_shipdate, l_orderkey, l_partkey)

Compression

# Automatic compression analysis
adapter = RedshiftAdapter(
    host="my-cluster...",
    compression_encoding="AUTO"  # Redshift analyzes and applies optimal encoding
)

# Check compression
cursor.execute("""
    SELECT
        "column",
        type,
        encoding
    FROM pg_table_def
    WHERE tablename = 'lineitem'
""")

Vacuum and Analyze

# Manual maintenance
adapter.vacuum_table(conn, "lineitem")
adapter.analyze_table(conn, "lineitem")

# Or automatic
adapter = RedshiftAdapter(
    auto_vacuum=True,
    auto_analyze=True
)

Best Practices

Distribution Strategy

  1. Choose appropriate DISTSTYLE:

    -- EVEN: Small tables, no joins
    CREATE TABLE region (...) DISTSTYLE EVEN
    
    -- KEY: Large fact tables (distribute by join key)
    CREATE TABLE orders (...) DISTSTYLE KEY DISTKEY (o_custkey)
    
    -- ALL: Small dimension tables (broadcast to all nodes)
    CREATE TABLE nation (...) DISTSTYLE ALL
    

Sort Keys

  1. Use compound sort keys for range/equality filters:

    -- Good for: WHERE l_shipdate BETWEEN ... AND l_orderkey = ...
    SORTKEY (l_shipdate, l_orderkey)
    
  2. Use interleaved for multiple filter combinations:

    -- Good for varying filter combinations
    INTERLEAVED SORTKEY (l_shipdate, l_orderkey, l_partkey)
    

Data Loading

  1. Use COPY from S3 for best performance

  2. Load compressed files (GZIP recommended)

  3. Use manifest files for multiple files

  4. Run ANALYZE after loading

Cost Optimization

  1. Use reserved instances for predictable workloads

  2. Pause clusters when not in use

  3. Use concurrency scaling for burst workloads

  4. Monitor query performance with system tables

Common Issues

Connection Timeout

Problem: Cannot connect to cluster

Solutions:

# 1. Check cluster status
Amazon Redshift describe-clusters --cluster-identifier my-cluster

# 2. Verify security group allows inbound on port 5439
# 3. Check VPC routing and NAT gateway

# 4. Test connectivity
psql -h my-cluster.123456.us-east-1.redshift.amazonaws.com \
     -U admin -d dev -p 5439

S3 COPY Errors

Problem: COPY command fails

Solutions:

# 1. Verify IAM role permissions
# Role needs: s3:GetObject, s3:ListBucket

# 2. Check S3 bucket region matches cluster region

# 3. View error details
cursor.execute("""
    SELECT * FROM stl_load_errors
    ORDER BY starttime DESC
    LIMIT 10
""")

Slow Query Performance

Problem: Queries slower than expected

Solutions:

# 1. Check query execution plan
plan = adapter.get_query_plan(conn, query)

# 2. Verify distribution keys
cursor.execute("""
    SELECT
        TRIM(t.name) AS table,
        TRIM(c.name) AS column,
        c.distkey
    FROM stv_tbl_perm t
    JOIN pg_attribute a ON a.attrelid = t.id
    JOIN pg_class c ON c.oid = t.id
    WHERE c.distkey = TRUE
""")

# 3. Check sort key usage
cursor.execute("""
    SELECT * FROM svv_table_info
    WHERE "table" = 'lineitem'
""")

# 4. Run VACUUM and ANALYZE
adapter.vacuum_table(conn, "lineitem")
adapter.analyze_table(conn, "lineitem")

See Also

Platform Documentation

API Reference

External Resources