Polars Platform¶
Polars is a high-performance DataFrame library implemented in Rust with Python bindings. It provides both lazy and eager execution modes, with automatic query optimization and parallelization.
BenchBox supports two modes for benchmarking Polars:
Mode |
CLI Name |
Query Type |
Best For |
|---|---|---|---|
SQL Mode |
|
SQL strings via SQLContext |
SQL-oriented workflows |
DataFrame Mode |
|
Native expression API |
DataFrame-native workflows |
Features¶
High performance - Rust-based implementation with SIMD optimizations
Lazy execution - Build query plans and optimize before execution
Memory efficient - Zero-copy operations and out-of-core processing
SQL support - Query DataFrames using SQL via
pl.SQLContextParallel execution - Automatic parallelization across CPU cores
Native file formats - Parquet, CSV, JSON, Arrow IPC
Installation¶
# Install Polars
pip install polars
# Or with all optional dependencies
pip install "polars[all]"
Configuration¶
CLI Options¶
# SQL Mode - queries executed via Polars SQLContext
benchbox run --platform polars --benchmark tpch --scale 0.1
# DataFrame Mode - queries executed via native expression API
benchbox run --platform polars-df --benchmark tpch --scale 0.1
Platform Options (SQL Mode: polars)¶
Option |
Default |
Description |
|---|---|---|
|
lazy |
Execution mode (lazy, eager) |
|
false |
Enable streaming execution for large datasets |
|
auto |
Number of threads for parallel execution |
Platform Options (DataFrame Mode: polars-df)¶
Option |
Default |
Description |
|---|---|---|
|
false |
Enable streaming mode for large datasets |
|
true |
Rechunk data for better memory layout |
|
- |
Limit rows to read (for testing) |
Usage Examples¶
Basic Benchmark Run¶
# SQL Mode - Run TPC-H using SQL queries
benchbox run --platform polars --benchmark tpch --scale 0.1
# DataFrame Mode - Run TPC-H using native expressions
benchbox run --platform polars-df --benchmark tpch --scale 0.1
Compare SQL vs DataFrame Performance¶
# Same benchmark, different execution paradigms
benchbox run --platform polars --benchmark tpch --scale 1 --output-dir ./polars-sql
benchbox run --platform polars-df --benchmark tpch --scale 1 --output-dir ./polars-df
# Compare results
benchbox compare ./polars-sql ./polars-df
Python API¶
from benchbox import TPCH
from benchbox.platforms.polars_platform import PolarsAdapter
# Initialize adapter
adapter = PolarsAdapter()
# Load and run benchmark
benchmark = TPCH(scale_factor=0.1)
benchmark.generate_data()
adapter.load_benchmark(benchmark)
results = adapter.run_benchmark(benchmark)
Direct Polars Usage¶
import polars as pl
from benchbox import TPCH
# Generate benchmark data
tpch = TPCH(scale_factor=0.1)
tpch.generate_data()
# Load data into Polars
lineitem = pl.scan_csv(tpch.tables["lineitem"], separator="|")
orders = pl.scan_csv(tpch.tables["orders"], separator="|")
# Run SQL query
ctx = pl.SQLContext()
ctx.register("lineitem", lineitem)
ctx.register("orders", orders)
result = ctx.execute("""
SELECT l_returnflag, SUM(l_quantity) as total_qty
FROM lineitem
GROUP BY l_returnflag
ORDER BY l_returnflag
""").collect()
print(result)
Execution Modes¶
Lazy Execution (Default)¶
Lazy execution builds a query plan and optimizes before execution:
# Polars optimizes the entire query plan
df = (
pl.scan_csv("lineitem.csv")
.filter(pl.col("l_quantity") > 10)
.group_by("l_returnflag")
.agg(pl.sum("l_quantity"))
.collect() # Execute the optimized plan
)
Optimizations include:
Predicate pushdown
Projection pushdown
Common subexpression elimination
Join reordering
Eager Execution¶
For immediate results without optimization:
adapter = PolarsAdapter(execution_mode="eager")
Streaming Mode¶
For datasets larger than memory:
adapter = PolarsAdapter(streaming=True)
Performance Characteristics¶
Strengths¶
CPU-bound queries: Excellent vectorized execution
In-memory analytics: Optimized for RAM-resident data
File scanning: Efficient Parquet/CSV reading with pushdown
Parallel aggregations: Automatic multi-core utilization
Considerations¶
Single-node only: No distributed execution
Memory bound: Large datasets require streaming mode
SQL subset: Some advanced SQL features not yet supported
Benchmark Recommendations¶
Best For¶
Local development and testing
Small to medium datasets (up to ~100GB)
Quick iteration and prototyping
Comparison baseline against DuckDB
Scale Factor Guidelines¶
Scale Factor |
Memory Required |
Use Case |
|---|---|---|
0.01 |
~100 MB |
Unit testing, CI/CD |
0.1 |
~500 MB |
Integration testing |
1.0 |
~4 GB |
Standard benchmarking |
10.0 |
~40 GB (streaming) |
Large-scale testing |
Note: Execution times vary based on query complexity, hardware, and configuration. Run benchmarks to establish baselines for your environment.
Comparison with DuckDB¶
Aspect |
Polars |
DuckDB |
|---|---|---|
Language |
Rust/Python |
C++/Python |
Execution |
DataFrame API + SQL |
SQL-first |
Persistence |
External files |
Embedded database |
Best for |
DataFrame workflows |
SQL queries |
Troubleshooting¶
Memory Errors¶
# Enable streaming for large datasets
adapter = PolarsAdapter(streaming=True)
SQL Syntax Errors¶
Polars SQL supports a subset of SQL. For unsupported features, use the DataFrame API:
# Instead of unsupported SQL, use DataFrame API
result = (
df.filter(pl.col("column") > 10)
.group_by("category")
.agg(pl.sum("value"))
)
Thread Configuration¶
# Limit thread usage
import polars as pl
pl.Config.set_global_string_cache()
DataFrame Mode Deep Dive¶
The polars-df platform executes TPC-H queries using Polars’ native expression API instead of SQL. This provides:
Full lazy optimization - Polars query planner optimizes the entire expression chain
Type safety - Expression errors caught at construction time
Composability - Build complex queries from reusable expression components
Expression API Example¶
# TPC-H Q1 implemented with Polars expressions
result = (
lineitem.filter(col("l_shipdate") <= lit(cutoff_date))
.group_by("l_returnflag", "l_linestatus")
.agg(
col("l_quantity").sum().alias("sum_qty"),
col("l_extendedprice").sum().alias("sum_base_price"),
(col("l_extendedprice") * (lit(1) - col("l_discount")))
.sum().alias("sum_disc_price"),
)
.sort("l_returnflag", "l_linestatus")
)
When to Use DataFrame Mode¶
Use Case |
Recommended Mode |
|---|---|
SQL-based workflows |
|
DataFrame-native development |
|
Comparing with Pandas |
|
Maximum optimization control |
|
Streaming Mode for Large Data¶
# Enable streaming for datasets larger than memory
benchbox run --platform polars-df --benchmark tpch --scale 100 \
--platform-option streaming=true