LakeSail Sail Platform

Tags intermediate guide lakesail sql-platform dataframe-platform spark-compatible local

LakeSail Sail is a Rust-based, drop-in replacement for Apache Spark built on DataFusion. It delivers 4x faster execution with 94% lower hardware costs compared to Apache Spark (TPC-H SF100). BenchBox supports LakeSail in both SQL mode (lakesail) and DataFrame mode (lakesail-df), connecting via the standard Spark Connect protocol for zero-rewrite migration from PySpark.

Features

  • 4x faster than Spark - Rust-based DataFusion engine with significant performance gains at lower cost

  • Zero-rewrite migration - Uses standard PySpark client via Spark Connect protocol

  • Dual execution modes - SQL benchmarking (lakesail) and DataFrame benchmarking (lakesail-df)

  • Spark SQL dialect - SQLGlot spark dialect for query translation

  • Local and distributed - Multi-threaded single-host or distributed cluster deployment

  • Parquet and ORC - Native support for columnar table formats

  • Adaptive Query Execution - AQE support for runtime query optimization

  • Full PySpark API - DataFrame, Column expression, and Window function compatibility

Quick Start

# Install PySpark client (LakeSail uses the standard PySpark package)
uv add pyspark pyarrow

# Start your LakeSail Sail server (see LakeSail documentation)
# Default endpoint: sc://localhost:50051

# Run SQL benchmark
benchbox run --platform lakesail --benchmark tpch --scale 1.0

# Run DataFrame benchmark
benchbox run --platform lakesail-df --benchmark tpch --scale 1.0

Configuration

LakeSail Sail connects to a running Sail server via the Spark Connect protocol. No additional authentication is required for local deployments.

Configuration Methods

CLI Options:

benchbox run --platform lakesail --benchmark tpch --scale 1.0 \
    --lakesail-endpoint sc://localhost:50051 \
    --lakesail-mode local \
    --driver-memory 8g \
    --shuffle-partitions 16

Environment Variables:

# Store credentials for reuse
benchbox credentials set lakesail \
    --option endpoint=sc://my-sail-server:50051 \
    --option sail_mode=distributed \
    --option sail_workers=4

benchbox run --platform lakesail --benchmark tpch --scale 1.0

Configuration Options

Option

CLI Flag

Default

Description

endpoint

--lakesail-endpoint

sc://localhost:50051

Sail server Spark Connect URL

sail_mode

--lakesail-mode

local

Deployment mode: local or distributed

sail_workers

--lakesail-workers

-

Worker count for distributed mode

app_name

--app-name

BenchBox-LakeSail

Application name for the session

driver_memory

--driver-memory

4g

Driver memory allocation (e.g., 4g, 8g)

shuffle_partitions

--shuffle-partitions

200

Number of shuffle partitions

table_format

--table-format

parquet

Table format: parquet or orc

adaptive_enabled

--adaptive-enabled

true

Enable Adaptive Query Execution (AQE)

disable_cache

-

true

Disable result caching for accurate benchmarking

spark_config

-

{}

Additional Spark configuration properties (dict)

Usage Examples

SQL Mode

# TPC-H at scale factor 1
benchbox run --platform lakesail --benchmark tpch --scale 1.0

# TPC-DS at scale factor 10 with tuned settings
benchbox run --platform lakesail --benchmark tpcds --scale 10.0 \
    --driver-memory 16g \
    --shuffle-partitions 32

# Specific queries only
benchbox run --platform lakesail --benchmark tpch --scale 1.0 --queries Q1,Q6,Q17

# Dry run to preview execution
benchbox run --dry-run ./preview --platform lakesail --benchmark tpch

DataFrame Mode

# TPC-H DataFrame benchmark
benchbox run --platform lakesail-df --benchmark tpch --scale 1.0

# With custom endpoint and memory
benchbox run --platform lakesail-df --benchmark tpch --scale 1.0 \
    --lakesail-endpoint sc://sail-server:50051 \
    --driver-memory 8g

Comparison with Apache Spark

Run the same benchmark on both platforms to compare performance:

# LakeSail Sail
benchbox run --platform lakesail --benchmark tpch --scale 10.0

# Apache Spark (for comparison)
benchbox run --platform spark --benchmark tpch --scale 10.0

# Compare results
benchbox results compare lakesail_tpch_sf10.json spark_tpch_sf10.json

Python API

SQL Mode

from benchbox import TPCH
from benchbox.platforms.lakesail import LakeSailAdapter

adapter = LakeSailAdapter(
    endpoint="sc://localhost:50051",
    sail_mode="local",
    driver_memory="8g",
    shuffle_partitions=16,
    table_format="parquet",
)

benchmark = TPCH(scale_factor=1.0)
benchmark.generate_data()
adapter.load_benchmark(benchmark)
results = adapter.run_benchmark(benchmark)

DataFrame Mode

from benchbox.platforms.dataframe.lakesail_df import LakeSailDataFrameAdapter

adapter = LakeSailDataFrameAdapter(
    endpoint="sc://localhost:50051",
    driver_memory="8g",
    shuffle_partitions=16,
    enable_aqe=True,
)

# Use as context manager for automatic cleanup
with adapter as ctx:
    df = ctx.read_parquet(Path("lineitem.parquet"))
    result = df.filter(df["l_quantity"] > 25).groupBy("l_returnflag").count()
    rows = ctx.collect(result)

Architecture

LakeSail Sail replaces the Spark execution engine with a Rust-based runtime built on Apache DataFusion, while maintaining full compatibility with the Spark Connect protocol.

Spark Connect Integration

┌──────────────────┐     Spark Connect     ┌──────────────────────┐
│  PySpark Client   │ ──── Protocol ────>  │   LakeSail Sail       │
│  (standard API)   │     (gRPC)           │   Server              │
└──────────────────┘                       ├──────────────────────┤
                                           │  Query Optimizer      │
                                           │  (DataFusion)         │
                                           ├──────────────────────┤
                                           │  Rust Execution       │
                                           │  Workers              │
                                           └──────────────────────┘

Key architectural points:

  • Client: Standard PySpark library – no custom client needed

  • Protocol: Spark Connect (gRPC) for client-server communication

  • SQL Dialect: Spark SQL, translated via SQLGlot spark dialect

  • Optimizer: DataFusion query optimizer with cost-based optimization

  • Execution: Rust-based workers for vectorized query processing

  • Constraints: Primary and foreign keys are informational only (not enforced), matching Spark behavior

Tuning Support

LakeSail supports the following tuning types:

Tuning Type

Supported

Notes

Partitioning

Yes

PARTITIONED BY clause on table creation

Sorting

Yes

Sort-based optimizations

Primary Keys

Informational

Not enforced, used for optimizer hints

Foreign Keys

Informational

Not enforced, used for optimizer hints

Deployment Modes

Local Mode

Single-node, multi-threaded execution. Best for development, testing, and small-to-medium scale benchmarks.

benchbox run --platform lakesail --benchmark tpch --scale 1.0 \
    --lakesail-mode local \
    --driver-memory 8g
  • Sail server runs on a single machine

  • Multi-threaded query execution via Rust workers

  • No cluster coordination overhead

  • Default endpoint: sc://localhost:50051

Distributed Mode

Multi-node cluster execution for large-scale benchmarks.

benchbox run --platform lakesail --benchmark tpch --scale 100.0 \
    --lakesail-mode distributed \
    --lakesail-workers 4 \
    --driver-memory 16g \
    --shuffle-partitions 200
  • Multiple Rust worker nodes coordinated by a Sail server

  • Horizontal scaling for large datasets

  • Requires cluster infrastructure setup (see LakeSail documentation)

Comparison: LakeSail vs Apache Spark

Feature

LakeSail Sail

Apache Spark

Language

Rust (DataFusion)

Scala/Java (JVM)

Performance

~4x faster (TPC-H SF100)

Baseline

Hardware Cost

~94% lower

Baseline

API Compatibility

Full PySpark API

Native

SQL Dialect

Spark SQL

Spark SQL

Connection Protocol

Spark Connect

Native / Spark Connect

Migration Effort

Zero rewrites

N/A

Local Mode

Multi-threaded Rust

JVM-based

Distributed Mode

Rust workers

JVM executors

DataFrame API

Full PySpark compatibility

Native

Table Formats

Parquet, ORC

Parquet, ORC, Delta, Iceberg

Maturity

Newer

Battle-tested

When to Use LakeSail

Use LakeSail when:

  • You want Spark compatibility with native execution performance (avoiding JVM overhead)

  • Migrating from an existing PySpark/Spark SQL workload with no code changes

  • Running OLAP and analytics benchmarks where execution speed matters

  • Hardware cost reduction is a priority

  • You need both SQL and DataFrame benchmarking on the same engine

Use Apache Spark instead when:

  • You need the broadest ecosystem of connectors and integrations

  • You require Delta Lake or Iceberg table format support

  • You depend on Spark-specific plugins or UDFs not yet supported by Sail

  • You need a battle-tested production platform with long-term support history

Troubleshooting

Connection Refused

Failed to connect to LakeSail Sail: Connection refused

Solutions:

  1. Verify the Sail server is running and accepting connections

  2. Check the endpoint URL format: sc://host:port (default: sc://localhost:50051)

  3. Verify the port is not blocked by a firewall

  4. Check server logs for startup errors

PySpark Not Installed

PySpark not installed. Install with: pip install pyspark pyarrow

Solutions:

  1. Install the PySpark client: uv add pyspark pyarrow

  2. LakeSail uses the standard PySpark package – no special client needed

Database Creation Failed

Failed to create database: ...

Solutions:

  1. Verify the Sail server has write permissions

  2. Check that the database name uses only alphanumeric characters and underscores

  3. Try connecting with a simpler database name: --platform-option database=test

Query Execution Timeout

Query execution timed out

Solutions:

  1. Increase driver memory: --driver-memory 16g

  2. Adjust shuffle partitions: --shuffle-partitions 32

  3. For large-scale benchmarks, use distributed mode: --lakesail-mode distributed

  4. Check Sail server resource availability

Table Already Exists

Table 'lineitem' already exists

Solutions:

  1. The adapter handles this automatically by dropping and recreating tables

  2. Force data regeneration: benchbox run --force all --platform lakesail ...

  3. Manually drop the database: connect to the Sail server and run DROP DATABASE <name> CASCADE

Compressed Data Files

BenchBox generates zstd-compressed data files by default (e.g., lineitem.tbl.1.zst). LakeSail’s Sail engine supports reading compressed CSV files, but its CSV reader defaults to UNCOMPRESSED and does not auto-detect compression from file extensions.

BenchBox handles this automatically – during data loading, it detects the compression type from the file extension and passes the appropriate compression option to the Spark CSV reader. No user action is required.

If you encounter No files found in the specified paths errors during data loading, ensure you are running a current version of BenchBox that includes this auto-detection. As a workaround, you can regenerate data without compression:

benchbox run --platform lakesail --benchmark tpch --compression none --force datagen

See Also