LakeSail Sail Platform¶
LakeSail Sail is a Rust-based, drop-in replacement for Apache Spark built on DataFusion. It delivers 4x faster execution with 94% lower hardware costs compared to Apache Spark (TPC-H SF100). BenchBox supports LakeSail in both SQL mode (lakesail) and DataFrame mode (lakesail-df), connecting via the standard Spark Connect protocol for zero-rewrite migration from PySpark.
Features¶
4x faster than Spark - Rust-based DataFusion engine with significant performance gains at lower cost
Zero-rewrite migration - Uses standard PySpark client via Spark Connect protocol
Dual execution modes - SQL benchmarking (
lakesail) and DataFrame benchmarking (lakesail-df)Spark SQL dialect - SQLGlot
sparkdialect for query translationLocal and distributed - Multi-threaded single-host or distributed cluster deployment
Parquet and ORC - Native support for columnar table formats
Adaptive Query Execution - AQE support for runtime query optimization
Full PySpark API - DataFrame, Column expression, and Window function compatibility
Quick Start¶
# Install PySpark client (LakeSail uses the standard PySpark package)
uv add pyspark pyarrow
# Start your LakeSail Sail server (see LakeSail documentation)
# Default endpoint: sc://localhost:50051
# Run SQL benchmark
benchbox run --platform lakesail --benchmark tpch --scale 1.0
# Run DataFrame benchmark
benchbox run --platform lakesail-df --benchmark tpch --scale 1.0
Configuration¶
LakeSail Sail connects to a running Sail server via the Spark Connect protocol. No additional authentication is required for local deployments.
Configuration Methods¶
CLI Options:
benchbox run --platform lakesail --benchmark tpch --scale 1.0 \
--lakesail-endpoint sc://localhost:50051 \
--lakesail-mode local \
--driver-memory 8g \
--shuffle-partitions 16
Environment Variables:
# Store credentials for reuse
benchbox credentials set lakesail \
--option endpoint=sc://my-sail-server:50051 \
--option sail_mode=distributed \
--option sail_workers=4
benchbox run --platform lakesail --benchmark tpch --scale 1.0
Configuration Options¶
Option |
CLI Flag |
Default |
Description |
|---|---|---|---|
|
|
|
Sail server Spark Connect URL |
|
|
|
Deployment mode: |
|
|
- |
Worker count for distributed mode |
|
|
|
Application name for the session |
|
|
|
Driver memory allocation (e.g., |
|
|
|
Number of shuffle partitions |
|
|
|
Table format: |
|
|
|
Enable Adaptive Query Execution (AQE) |
|
- |
|
Disable result caching for accurate benchmarking |
|
- |
|
Additional Spark configuration properties (dict) |
Usage Examples¶
SQL Mode¶
# TPC-H at scale factor 1
benchbox run --platform lakesail --benchmark tpch --scale 1.0
# TPC-DS at scale factor 10 with tuned settings
benchbox run --platform lakesail --benchmark tpcds --scale 10.0 \
--driver-memory 16g \
--shuffle-partitions 32
# Specific queries only
benchbox run --platform lakesail --benchmark tpch --scale 1.0 --queries Q1,Q6,Q17
# Dry run to preview execution
benchbox run --dry-run ./preview --platform lakesail --benchmark tpch
DataFrame Mode¶
# TPC-H DataFrame benchmark
benchbox run --platform lakesail-df --benchmark tpch --scale 1.0
# With custom endpoint and memory
benchbox run --platform lakesail-df --benchmark tpch --scale 1.0 \
--lakesail-endpoint sc://sail-server:50051 \
--driver-memory 8g
Comparison with Apache Spark¶
Run the same benchmark on both platforms to compare performance:
# LakeSail Sail
benchbox run --platform lakesail --benchmark tpch --scale 10.0
# Apache Spark (for comparison)
benchbox run --platform spark --benchmark tpch --scale 10.0
# Compare results
benchbox results compare lakesail_tpch_sf10.json spark_tpch_sf10.json
Python API¶
SQL Mode¶
from benchbox import TPCH
from benchbox.platforms.lakesail import LakeSailAdapter
adapter = LakeSailAdapter(
endpoint="sc://localhost:50051",
sail_mode="local",
driver_memory="8g",
shuffle_partitions=16,
table_format="parquet",
)
benchmark = TPCH(scale_factor=1.0)
benchmark.generate_data()
adapter.load_benchmark(benchmark)
results = adapter.run_benchmark(benchmark)
DataFrame Mode¶
from benchbox.platforms.dataframe.lakesail_df import LakeSailDataFrameAdapter
adapter = LakeSailDataFrameAdapter(
endpoint="sc://localhost:50051",
driver_memory="8g",
shuffle_partitions=16,
enable_aqe=True,
)
# Use as context manager for automatic cleanup
with adapter as ctx:
df = ctx.read_parquet(Path("lineitem.parquet"))
result = df.filter(df["l_quantity"] > 25).groupBy("l_returnflag").count()
rows = ctx.collect(result)
Architecture¶
LakeSail Sail replaces the Spark execution engine with a Rust-based runtime built on Apache DataFusion, while maintaining full compatibility with the Spark Connect protocol.
Spark Connect Integration¶
┌──────────────────┐ Spark Connect ┌──────────────────────┐
│ PySpark Client │ ──── Protocol ────> │ LakeSail Sail │
│ (standard API) │ (gRPC) │ Server │
└──────────────────┘ ├──────────────────────┤
│ Query Optimizer │
│ (DataFusion) │
├──────────────────────┤
│ Rust Execution │
│ Workers │
└──────────────────────┘
Key architectural points:
Client: Standard PySpark library – no custom client needed
Protocol: Spark Connect (gRPC) for client-server communication
SQL Dialect: Spark SQL, translated via SQLGlot
sparkdialectOptimizer: DataFusion query optimizer with cost-based optimization
Execution: Rust-based workers for vectorized query processing
Constraints: Primary and foreign keys are informational only (not enforced), matching Spark behavior
Tuning Support¶
LakeSail supports the following tuning types:
Tuning Type |
Supported |
Notes |
|---|---|---|
Partitioning |
Yes |
|
Sorting |
Yes |
Sort-based optimizations |
Primary Keys |
Informational |
Not enforced, used for optimizer hints |
Foreign Keys |
Informational |
Not enforced, used for optimizer hints |
Deployment Modes¶
Local Mode¶
Single-node, multi-threaded execution. Best for development, testing, and small-to-medium scale benchmarks.
benchbox run --platform lakesail --benchmark tpch --scale 1.0 \
--lakesail-mode local \
--driver-memory 8g
Sail server runs on a single machine
Multi-threaded query execution via Rust workers
No cluster coordination overhead
Default endpoint:
sc://localhost:50051
Distributed Mode¶
Multi-node cluster execution for large-scale benchmarks.
benchbox run --platform lakesail --benchmark tpch --scale 100.0 \
--lakesail-mode distributed \
--lakesail-workers 4 \
--driver-memory 16g \
--shuffle-partitions 200
Multiple Rust worker nodes coordinated by a Sail server
Horizontal scaling for large datasets
Requires cluster infrastructure setup (see LakeSail documentation)
Comparison: LakeSail vs Apache Spark¶
Feature |
LakeSail Sail |
Apache Spark |
|---|---|---|
Language |
Rust (DataFusion) |
Scala/Java (JVM) |
Performance |
~4x faster (TPC-H SF100) |
Baseline |
Hardware Cost |
~94% lower |
Baseline |
API Compatibility |
Full PySpark API |
Native |
SQL Dialect |
Spark SQL |
Spark SQL |
Connection Protocol |
Spark Connect |
Native / Spark Connect |
Migration Effort |
Zero rewrites |
N/A |
Local Mode |
Multi-threaded Rust |
JVM-based |
Distributed Mode |
Rust workers |
JVM executors |
DataFrame API |
Full PySpark compatibility |
Native |
Table Formats |
Parquet, ORC |
Parquet, ORC, Delta, Iceberg |
Maturity |
Newer |
Battle-tested |
When to Use LakeSail¶
Use LakeSail when:
You want Spark compatibility with native execution performance (avoiding JVM overhead)
Migrating from an existing PySpark/Spark SQL workload with no code changes
Running OLAP and analytics benchmarks where execution speed matters
Hardware cost reduction is a priority
You need both SQL and DataFrame benchmarking on the same engine
Use Apache Spark instead when:
You need the broadest ecosystem of connectors and integrations
You require Delta Lake or Iceberg table format support
You depend on Spark-specific plugins or UDFs not yet supported by Sail
You need a battle-tested production platform with long-term support history
Troubleshooting¶
Connection Refused¶
Failed to connect to LakeSail Sail: Connection refused
Solutions:
Verify the Sail server is running and accepting connections
Check the endpoint URL format:
sc://host:port(default:sc://localhost:50051)Verify the port is not blocked by a firewall
Check server logs for startup errors
PySpark Not Installed¶
PySpark not installed. Install with: pip install pyspark pyarrow
Solutions:
Install the PySpark client:
uv add pyspark pyarrowLakeSail uses the standard PySpark package – no special client needed
Database Creation Failed¶
Failed to create database: ...
Solutions:
Verify the Sail server has write permissions
Check that the database name uses only alphanumeric characters and underscores
Try connecting with a simpler database name:
--platform-option database=test
Query Execution Timeout¶
Query execution timed out
Solutions:
Increase driver memory:
--driver-memory 16gAdjust shuffle partitions:
--shuffle-partitions 32For large-scale benchmarks, use distributed mode:
--lakesail-mode distributedCheck Sail server resource availability
Table Already Exists¶
Table 'lineitem' already exists
Solutions:
The adapter handles this automatically by dropping and recreating tables
Force data regeneration:
benchbox run --force all --platform lakesail ...Manually drop the database: connect to the Sail server and run
DROP DATABASE <name> CASCADE
Compressed Data Files¶
BenchBox generates zstd-compressed data files by default (e.g., lineitem.tbl.1.zst). LakeSail’s Sail engine supports reading compressed CSV files, but its CSV reader defaults to UNCOMPRESSED and does not auto-detect compression from file extensions.
BenchBox handles this automatically – during data loading, it detects the compression type from the file extension and passes the appropriate compression option to the Spark CSV reader. No user action is required.
If you encounter No files found in the specified paths errors during data loading, ensure you are running a current version of BenchBox that includes this auto-detection. As a workaround, you can regenerate data without compression:
benchbox run --platform lakesail --benchmark tpch --compression none --force datagen
See Also¶
Spark Platform - Apache Spark benchmarking (for comparison)
Deployment Modes Guide - Platform deployment architecture
Platform Selection Guide - Choose the right platform
Getting Started - Quick start guide
LakeSail Documentation - Official LakeSail documentation portal