Polars Platform¶

Tags intermediate guide polars dataframe-platform

Polars is a high-performance DataFrame library implemented in Rust with Python bindings. It provides both lazy and eager execution modes, with automatic query optimization and parallelization.

BenchBox supports two modes for benchmarking Polars:

Mode	CLI Name	Query Type	Best For
SQL Mode	`polars`	SQL strings via SQLContext	SQL-oriented workflows
DataFrame Mode	`polars-df`	Native expression API	DataFrame-native workflows

Features¶

High performance - Rust-based implementation with SIMD optimizations
Lazy execution - Build query plans and optimize before execution
Memory efficient - Zero-copy operations and out-of-core processing
SQL support - Query DataFrames using SQL via pl.SQLContext
Parallel execution - Automatic parallelization across CPU cores
Native file formats - Parquet, CSV, JSON, Arrow IPC

Installation¶

# Install Polars
pip install polars

# Or with all optional dependencies
pip install "polars[all]"

Configuration¶

CLI Options¶

# SQL Mode - queries executed via Polars SQLContext
benchbox run --platform polars --benchmark tpch --scale 0.1

# DataFrame Mode - queries executed via native expression API
benchbox run --platform polars-df --benchmark tpch --scale 0.1

Platform Options (SQL Mode: `polars`)¶

Option	Default	Description
`execution_mode`	lazy	Execution mode (lazy, eager)
`streaming`	false	Enable streaming execution for large datasets
`n_threads`	auto	Number of threads for parallel execution

Platform Options (DataFrame Mode: `polars-df`)¶

Option	Default	Description
`streaming`	false	Enable streaming mode for large datasets
`rechunk`	true	Rechunk data for better memory layout
`n_rows`	-	Limit rows to read (for testing)

Usage Examples¶

Basic Benchmark Run¶

# SQL Mode - Run TPC-H using SQL queries
benchbox run --platform polars --benchmark tpch --scale 0.1

# DataFrame Mode - Run TPC-H using native expressions
benchbox run --platform polars-df --benchmark tpch --scale 0.1

Compare SQL vs DataFrame Performance¶

# Same benchmark, different execution paradigms
benchbox run --platform polars --benchmark tpch --scale 1 --output-dir ./polars-sql
benchbox run --platform polars-df --benchmark tpch --scale 1 --output-dir ./polars-df

# Compare results
benchbox compare ./polars-sql ./polars-df

Python API¶

from benchbox import TPCH
from benchbox.platforms.polars_platform import PolarsAdapter

# Initialize adapter
adapter = PolarsAdapter()

# Load and run benchmark
benchmark = TPCH(scale_factor=0.1)
benchmark.generate_data()
adapter.load_benchmark(benchmark)
results = adapter.run_benchmark(benchmark)

Direct Polars Usage¶

import polars as pl
from benchbox import TPCH

# Generate benchmark data
tpch = TPCH(scale_factor=0.1)
tpch.generate_data()

# Load data into Polars
lineitem = pl.scan_csv(tpch.tables["lineitem"], separator="|")
orders = pl.scan_csv(tpch.tables["orders"], separator="|")

# Run SQL query
ctx = pl.SQLContext()
ctx.register("lineitem", lineitem)
ctx.register("orders", orders)

result = ctx.execute("""
    SELECT l_returnflag, SUM(l_quantity) as total_qty
    FROM lineitem
    GROUP BY l_returnflag
    ORDER BY l_returnflag
""").collect()

print(result)

Execution Modes¶

Lazy Execution (Default)¶

Lazy execution builds a query plan and optimizes before execution:

# Polars optimizes the entire query plan
df = (
    pl.scan_csv("lineitem.csv")
    .filter(pl.col("l_quantity") > 10)
    .group_by("l_returnflag")
    .agg(pl.sum("l_quantity"))
    .collect()  # Execute the optimized plan
)

Optimizations include:

Predicate pushdown
Projection pushdown
Common subexpression elimination
Join reordering

Eager Execution¶

For immediate results without optimization:

adapter = PolarsAdapter(execution_mode="eager")

Streaming Mode¶

For datasets larger than memory:

adapter = PolarsAdapter(streaming=True)

Performance Characteristics¶

Strengths¶

CPU-bound queries: Excellent vectorized execution
In-memory analytics: Optimized for RAM-resident data
File scanning: Efficient Parquet/CSV reading with pushdown
Parallel aggregations: Automatic multi-core utilization

Considerations¶

Single-node only: No distributed execution
Memory bound: Large datasets require streaming mode
SQL subset: Some advanced SQL features not yet supported

Benchmark Recommendations¶

Best For¶

Local development and testing
Small to medium datasets (up to ~100GB)
Quick iteration and prototyping
Comparison baseline against DuckDB

Scale Factor Guidelines¶

Scale Factor	Memory Required	Use Case
0.01	~100 MB	Unit testing, CI/CD
0.1	~500 MB	Integration testing
1.0	~4 GB	Standard benchmarking
10.0	~40 GB (streaming)	Large-scale testing

Note: Execution times vary based on query complexity, hardware, and configuration. Run benchmarks to establish baselines for your environment.

Comparison with DuckDB¶

Aspect	Polars	DuckDB
Language	Rust/Python	C++/Python
Execution	DataFrame API + SQL	SQL-first
Persistence	External files	Embedded database
Best for	DataFrame workflows	SQL queries

Troubleshooting¶

Memory Errors¶

# Enable streaming for large datasets
adapter = PolarsAdapter(streaming=True)

SQL Syntax Errors¶

Polars SQL supports a subset of SQL. For unsupported features, use the DataFrame API:

# Instead of unsupported SQL, use DataFrame API
result = (
    df.filter(pl.col("column") > 10)
    .group_by("category")
    .agg(pl.sum("value"))
)

Thread Configuration¶

# Limit thread usage
import polars as pl
pl.Config.set_global_string_cache()

DataFrame Mode Deep Dive¶

The polars-df platform executes TPC-H queries using Polars’ native expression API instead of SQL. This provides:

Full lazy optimization - Polars query planner optimizes the entire expression chain
Type safety - Expression errors caught at construction time
Composability - Build complex queries from reusable expression components

Expression API Example¶

# TPC-H Q1 implemented with Polars expressions
result = (
    lineitem.filter(col("l_shipdate") <= lit(cutoff_date))
    .group_by("l_returnflag", "l_linestatus")
    .agg(
        col("l_quantity").sum().alias("sum_qty"),
        col("l_extendedprice").sum().alias("sum_base_price"),
        (col("l_extendedprice") * (lit(1) - col("l_discount")))
            .sum().alias("sum_disc_price"),
    )
    .sort("l_returnflag", "l_linestatus")
)

When to Use DataFrame Mode¶

Use Case	Recommended Mode
SQL-based workflows	`polars` (SQL mode)
DataFrame-native development	`polars-df` (DataFrame mode)
Comparing with Pandas	`polars-df` (same paradigm)
Maximum optimization control	`polars-df` (expression API)

Streaming Mode for Large Data¶

# Enable streaming for datasets larger than memory
benchbox run --platform polars-df --benchmark tpch --scale 100 \
  --platform-option streaming=true