LakeSail Sail Platform¶
LakeSail Sail is a Rust-based, drop-in replacement for Apache Spark built on DataFusion. It delivers 4x faster execution with 94% lower hardware costs compared to Apache Spark (TPC-H SF100). BenchBox supports LakeSail in both SQL mode (lakesail) and DataFrame mode (lakesail-df), connecting via the standard Spark Connect protocol for zero-rewrite migration from PySpark.
Features¶
4x faster than Spark - Rust-based DataFusion engine with significant performance gains at lower cost
Zero-rewrite migration - Uses standard PySpark client via Spark Connect protocol
Dual execution modes - SQL benchmarking (
lakesail) and DataFrame benchmarking (lakesail-df)Spark SQL dialect - SQLGlot
sparkdialect for query translationLocal and distributed - Multi-threaded single-host or distributed cluster deployment
Parquet and ORC - Native support for columnar table formats
Adaptive Query Execution - AQE support for runtime query optimization
Full PySpark API - DataFrame, Column expression, and Window function compatibility
Quick Start¶
# Install PySpark client (LakeSail uses the standard PySpark package)
uv add pyspark pyarrow
# Start your LakeSail Sail server (see LakeSail documentation)
# Default endpoint: sc://localhost:50051
# Run SQL benchmark
benchbox run --platform lakesail --benchmark tpch --scale 1.0
# Run DataFrame benchmark
benchbox run --platform lakesail-df --benchmark tpch --scale 1.0
Configuration¶
LakeSail Sail connects to a running Sail server via the Spark Connect protocol. No additional authentication is required for local deployments.
Configuration Methods¶
CLI Options:
benchbox run --platform lakesail --benchmark tpch --scale 1.0 \
--lakesail-endpoint sc://localhost:50051 \
--lakesail-mode local \
--driver-memory 8g \
--shuffle-partitions 16
Environment Variables:
# Store credentials for reuse
benchbox credentials set lakesail \
--option endpoint=sc://my-sail-server:50051 \
--option sail_mode=distributed \
--option sail_workers=4
benchbox run --platform lakesail --benchmark tpch --scale 1.0
Configuration Options¶
Option |
CLI Flag |
Default |
Description |
|---|---|---|---|
|
|
|
Sail server Spark Connect URL |
|
|
|
Deployment mode: |
|
|
- |
Worker count for distributed mode |
|
|
|
Application name for the session |
|
|
|
Driver memory allocation (e.g., |
|
|
|
Number of shuffle partitions |
|
|
|
Table format: |
|
|
|
Enable Adaptive Query Execution (AQE) |
|
- |
|
Disable result caching for accurate benchmarking |
|
- |
|
Additional Spark configuration properties (dict) |
Usage Examples¶
SQL Mode¶
# TPC-H at scale factor 1
benchbox run --platform lakesail --benchmark tpch --scale 1.0
# TPC-DS at scale factor 10 with tuned settings
benchbox run --platform lakesail --benchmark tpcds --scale 10.0 \
--driver-memory 16g \
--shuffle-partitions 32
# Specific queries only
benchbox run --platform lakesail --benchmark tpch --scale 1.0 --queries Q1,Q6,Q17
# Dry run to preview execution
benchbox run --dry-run ./preview --platform lakesail --benchmark tpch
DataFrame Mode¶
# TPC-H DataFrame benchmark
benchbox run --platform lakesail-df --benchmark tpch --scale 1.0
# With custom endpoint and memory
benchbox run --platform lakesail-df --benchmark tpch --scale 1.0 \
--lakesail-endpoint sc://sail-server:50051 \
--driver-memory 8g
Comparison with Apache Spark¶
Run the same benchmark on both platforms to compare performance:
# LakeSail Sail
benchbox run --platform lakesail --benchmark tpch --scale 10.0
# Apache Spark (for comparison)
benchbox run --platform spark --benchmark tpch --scale 10.0
# Compare results
benchbox results compare lakesail_tpch_sf10.json spark_tpch_sf10.json
Python API¶
SQL Mode¶
from benchbox import TPCH
from benchbox.platforms.lakesail import LakeSailAdapter
adapter = LakeSailAdapter(
endpoint="sc://localhost:50051",
sail_mode="local",
driver_memory="8g",
shuffle_partitions=16,
table_format="parquet",
)
benchmark = TPCH(scale_factor=1.0)
benchmark.generate_data()
adapter.load_benchmark(benchmark)
results = adapter.run_benchmark(benchmark)
DataFrame Mode¶
from benchbox.platforms.dataframe.lakesail_df import LakeSailDataFrameAdapter
adapter = LakeSailDataFrameAdapter(
endpoint="sc://localhost:50051",
driver_memory="8g",
shuffle_partitions=16,
enable_aqe=True,
)
# Use as context manager for automatic cleanup
with adapter as ctx:
df = ctx.read_parquet(Path("lineitem.parquet"))
result = df.filter(df["l_quantity"] > 25).groupBy("l_returnflag").count()
rows = ctx.collect(result)
Architecture¶
LakeSail Sail replaces the Spark execution engine with a Rust-based runtime built on Apache DataFusion, while maintaining full compatibility with the Spark Connect protocol.
Spark Connect Integration¶
┌──────────────────┐ Spark Connect ┌──────────────────────┐
│ PySpark Client │ ──── Protocol ────> │ LakeSail Sail │
│ (standard API) │ (gRPC) │ Server │
└──────────────────┘ ├──────────────────────┤
│ Query Optimizer │
│ (DataFusion) │
├──────────────────────┤
│ Rust Execution │
│ Workers │
└──────────────────────┘
Key architectural points:
Client: Standard PySpark library – no custom client needed
Protocol: Spark Connect (gRPC) for client-server communication
SQL Dialect: Spark SQL, translated via SQLGlot
sparkdialectOptimizer: DataFusion query optimizer with cost-based optimization
Execution: Rust-based workers for vectorized query processing
Constraints: Primary and foreign keys are informational only (not enforced), matching Spark behavior
Tuning Support¶
LakeSail supports the following tuning types:
Tuning Type |
Supported |
Notes |
|---|---|---|
Partitioning |
Yes |
|
Sorting |
Yes |
Sort-based optimizations |
Primary Keys |
Informational |
Not enforced, used for optimizer hints |
Foreign Keys |
Informational |
Not enforced, used for optimizer hints |
Deployment Modes¶
Local Mode¶
Single-node, multi-threaded execution. Best for development, testing, and small-to-medium scale benchmarks.
benchbox run --platform lakesail --benchmark tpch --scale 1.0 \
--lakesail-mode local \
--driver-memory 8g
Sail server runs on a single machine
Multi-threaded query execution via Rust workers
No cluster coordination overhead
Default endpoint:
sc://localhost:50051
Distributed Mode¶
Multi-node cluster execution for large-scale benchmarks.
benchbox run --platform lakesail --benchmark tpch --scale 100.0 \
--lakesail-mode distributed \
--lakesail-workers 4 \
--driver-memory 16g \
--shuffle-partitions 200
Multiple Rust worker nodes coordinated by a Sail server
Horizontal scaling for large datasets
Requires cluster infrastructure setup (see LakeSail documentation)
Comparison: LakeSail vs Apache Spark¶
Feature |
LakeSail Sail |
Apache Spark |
|---|---|---|
Language |
Rust (DataFusion) |
Scala/Java (JVM) |
Performance |
~4x faster (TPC-H SF100) |
Baseline |
Hardware Cost |
~94% lower |
Baseline |
API Compatibility |
Full PySpark API |
Native |
SQL Dialect |
Spark SQL |
Spark SQL |
Connection Protocol |
Spark Connect |
Native / Spark Connect |
Migration Effort |
Zero rewrites |
N/A |
Local Mode |
Multi-threaded Rust |
JVM-based |
Distributed Mode |
Rust workers |
JVM executors |
DataFrame API |
Full PySpark compatibility |
Native |
Table Formats |
Parquet, ORC |
Parquet, ORC, Delta, Iceberg |
Maturity |
Newer |
Battle-tested |
When to Use LakeSail¶
Use LakeSail when:
You want Spark compatibility with native execution performance (avoiding JVM overhead)
Migrating from an existing PySpark/Spark SQL workload with no code changes
Running OLAP and analytics benchmarks where execution speed matters
Hardware cost reduction is a priority
You need both SQL and DataFrame benchmarking on the same engine
Use Apache Spark instead when:
You need the broadest ecosystem of connectors and integrations
You require Delta Lake or Iceberg table format support
You depend on Spark-specific plugins or UDFs not yet supported by Sail
You need a battle-tested production platform with long-term support history
Troubleshooting¶
Connection Refused¶
Failed to connect to LakeSail Sail: Connection refused
Solutions:
Verify the Sail server is running and accepting connections
Check the endpoint URL format:
sc://host:port(default:sc://localhost:50051)Verify the port is not blocked by a firewall
Check server logs for startup errors
PySpark Not Installed¶
PySpark not installed. Install with: pip install pyspark pyarrow
Solutions:
Install the PySpark client:
uv add pyspark pyarrowLakeSail uses the standard PySpark package – no special client needed
Database Creation Failed¶
Failed to create database: ...
Solutions:
Verify the Sail server has write permissions
Check that the database name uses only alphanumeric characters and underscores
Try connecting with a simpler database name:
--platform-option database=test
Query Execution Timeout¶
Query execution timed out
Solutions:
Increase driver memory:
--driver-memory 16gAdjust shuffle partitions:
--shuffle-partitions 32For large-scale benchmarks, use distributed mode:
--lakesail-mode distributedCheck Sail server resource availability
Table Already Exists¶
Table 'lineitem' already exists
Solutions:
The adapter handles this automatically by dropping and recreating tables
Force data regeneration:
benchbox run --force all --platform lakesail ...Manually drop the database: connect to the Sail server and run
DROP DATABASE <name> CASCADE
See Also¶
Spark Platform - Apache Spark benchmarking (for comparison)
Deployment Modes Guide - Platform deployment architecture
Platform Selection Guide - Choose the right platform
Getting Started - Quick start guide
LakeSail Documentation - Official LakeSail documentation portal