Vortex Guide¶

Tags guide formats vortex experimental

Vortex is a columnar format with composable encodings, designed for high-performance analytics. This guide covers Vortex’s architecture, current status, and when to consider it for benchmarking.

Overview¶

Every few years, a new columnar format appears claiming to surpass Parquet. Most fade away. Vortex is different: it’s backed by the Linux Foundation AI & Data Foundation, with contributions from Microsoft, Snowflake, and Palantir. Its SIGMOD 2024 paper was recognized by TUM’s database group for its adaptive compression approach.

We added Vortex support to BenchBox because we’re curious about its performance claims and because DuckDB users asked for it. This guide covers what Vortex is, how it works, and what our initial assessment shows.

Fair warning: Vortex is in incubation. The format is evolving, platform support is limited, and performance characteristics may change. We’re sharing what we’ve learned, not declaring a winner.

Project Status and Maturity¶

Origin¶

Vortex was developed by SpiralDB and donated to the Linux Foundation AI & Data Foundation in August 2025. It’s an Incubation-stage project, meaning the specification is stabilizing but not yet production-hardened.

Current Status (Incubation)¶

Key milestones:

SIGMOD 2024: “Vortex: A Stream-oriented Storage Engine For Big Data Analytics”
August 2025: Donated to LF AI & Data Foundation
Contributors: Microsoft, Snowflake, Palantir

Backward Compatibility¶

Backward compatibility is guaranteed from version 0.36.0+. Earlier versions may have breaking changes.

Design Philosophy¶

Composable Encodings¶

Vortex’s core insight: No single compression scheme is best for all data types and distributions.

Parquet uses a fixed set of encoding schemes (dictionary, RLE, delta, etc.). Vortex provides composable encodings that can be chained based on data characteristics:

FSST for strings (specialized string compression)
ALP for integers (adaptive low-precision encoding)
Custom encodings for specific data patterns

Cloud Storage Optimization¶

The format is designed for:

Minimal read overhead: Complete footer information loads within 64KB, enabling two-round-trip reads from cloud storage
Wide schemas: Efficient handling of tables with many columns
Partial reads: Column pruning and predicate pushdown

Modern Hardware Targets¶

Vortex targets modern hardware:

GPU workloads: Memory layout optimized for GPU processing
Vectorized operations: SIMD-friendly data layout
Large memory: Designed for systems with substantial RAM

Performance Claims¶

Official Claims¶

From the official Vortex documentation:

Metric	Claimed Improvement vs Parquet
Random access	100x faster
Scan operations	10-20x faster
Write performance	5x faster
Compression ratio	Similar

External Validation¶

TUM database group: Recognized Vortex for adaptive compression
Microsoft: Demonstrated 30% runtime reductions when running Spark workloads with Vortex in Apache Iceberg

Caveats¶

These are significant claims. Our initial testing shows:

Storage sizes are similar to well-tuned Parquet for TPC-H data
At SF1 with data in memory, format differences are minimal (compute dominates I/O)
The claimed 10-20x improvements would be more visible at larger scale factors where I/O becomes the bottleneck
Tooling is still maturing

Vortex Architecture¶

File Structure¶

file.vortex
├── Magic: VTXF (4 bytes)
├── Data segments (compressed column chunks)
├── Postscript (max 65KB)
│   ├── DType segment (schema)
│   ├── Layout segment
│   ├── Statistics segment
│   └── Footer segment
├── Version tag (16-bit)
├── Postscript length (16-bit)
└── Magic: VTXF (4 bytes)

The postscript design is notable: complete footer information loads within 64KB, enabling two-round-trip reads from cloud storage. This matters for S3/GCS workloads where each request has latency overhead.

Encoding Strategies¶

Vortex differs from Parquet in how it encodes data:

Parquet approach:

Fixed encoding schemes per logical type
Dictionary, RLE, delta encoding
Compression applied after encoding

Vortex approach:

Composable encodings that chain together
Type-aware compressors (FSST for strings, ALP for integers)
Per-segment compression selection
Adaptive encoding based on data distribution

This flexibility means Vortex can potentially achieve better compression for specific data patterns, though our testing shows similar ratios to well-tuned Parquet.

Compression Options¶

Vortex supports standard compression algorithms:

None
LZ4
ZLib
ZStd

Each segment can use different compression, enabling fine-grained optimization.

Platform Support¶

DuckDB Extension¶

# Install vortex extension (one-time)
duckdb -c "INSTALL vortex; LOAD vortex;"

# Query Vortex files
SELECT * FROM read_vortex('customer.vortex')

DataFusion Support¶

DataFusion has experimental Vortex support in progress.

Limitations¶

Platform	Support Level	Notes
DuckDB	Extension	`INSTALL vortex; LOAD vortex;`
DataFusion	Experimental	Native support in progress
Others	Not supported	Use Parquet

Vortex support is currently limited to DuckDB and DataFusion. If you need cross-platform benchmarks, stick with Parquet.

When to Consider Vortex¶

Best-Fit Scenarios¶

Scenario	Why Vortex
DuckDB-centric workflows	Native extension support
Analytical workloads with selective queries	Fast random access
Cloud storage with 100ms+ round-trip latency	Efficient read patterns
Exploring new technologies	Stay current with format evolution

When to Stay with Parquet¶

Scenario	Why Parquet
Cross-platform benchmarks	Universal support
Production stability	10+ years of battle-testing
Cloud data warehouses	No Vortex support on Snowflake, Databricks
Ecosystem tooling	Most tools expect Parquet

Maturity Considerations¶

Vortex is in Incubation stage:

API may change before 1.0 release
Extension compatibility requires attention
Community smaller than Parquet ecosystem
Documentation still evolving

For production benchmarks, we recommend Parquet. For exploration and DuckDB-specific testing, Vortex is worth trying.

BenchBox Usage¶

Installation¶

# Install vortex Python library
uv add vortex

Running Benchmarks¶

# Convert data to Vortex format
benchbox convert --input ./data --format vortex

# Run benchmark with Vortex on DuckDB
benchbox run --platform duckdb --benchmark tpch --format vortex --scale 1

Reading Vortex Files¶

# Python vortex library
import vortex
array = vortex.io.read('customer.vortex')
table = array.to_arrow()

# DuckDB (requires extension)
conn.execute("INSTALL vortex; LOAD vortex;")
conn.execute("SELECT * FROM read_vortex('customer.vortex')")