Vortex Guide¶
Vortex is a columnar format with composable encodings, designed for high-performance analytics. This guide covers Vortex’s architecture, current status, and when to consider it for benchmarking.
Overview¶
Every few years, a new columnar format appears claiming to surpass Parquet. Most fade away. Vortex is different: it’s backed by the Linux Foundation AI & Data Foundation, with contributions from Microsoft, Snowflake, and Palantir. Its SIGMOD 2024 paper was recognized by TUM’s database group for its adaptive compression approach.
We added Vortex support to BenchBox because we’re curious about its performance claims and because DuckDB users asked for it. This guide covers what Vortex is, how it works, and what our initial assessment shows.
Fair warning: Vortex is in incubation. The format is evolving, platform support is limited, and performance characteristics may change. We’re sharing what we’ve learned, not declaring a winner.
Project Status and Maturity¶
Origin¶
Vortex was developed by SpiralDB and donated to the Linux Foundation AI & Data Foundation in August 2025. It’s an Incubation-stage project, meaning the specification is stabilizing but not yet production-hardened.
Current Status (Incubation)¶
Key milestones:
SIGMOD 2024: “Vortex: A Stream-oriented Storage Engine For Big Data Analytics”
August 2025: Donated to LF AI & Data Foundation
Contributors: Microsoft, Snowflake, Palantir
Backward Compatibility¶
Backward compatibility is guaranteed from version 0.36.0+. Earlier versions may have breaking changes.
Design Philosophy¶
Composable Encodings¶
Vortex’s core insight: No single compression scheme is best for all data types and distributions.
Parquet uses a fixed set of encoding schemes (dictionary, RLE, delta, etc.). Vortex provides composable encodings that can be chained based on data characteristics:
FSST for strings (specialized string compression)
ALP for integers (adaptive low-precision encoding)
Custom encodings for specific data patterns
Cloud Storage Optimization¶
The format is designed for:
Minimal read overhead: Complete footer information loads within 64KB, enabling two-round-trip reads from cloud storage
Wide schemas: Efficient handling of tables with many columns
Partial reads: Column pruning and predicate pushdown
Modern Hardware Targets¶
Vortex targets modern hardware:
GPU workloads: Memory layout optimized for GPU processing
Vectorized operations: SIMD-friendly data layout
Large memory: Designed for systems with substantial RAM
Performance Claims¶
Official Claims¶
From the official Vortex documentation:
Metric |
Claimed Improvement vs Parquet |
|---|---|
Random access |
100x faster |
Scan operations |
10-20x faster |
Write performance |
5x faster |
Compression ratio |
Similar |
External Validation¶
TUM database group: Recognized Vortex for adaptive compression
Microsoft: Demonstrated 30% runtime reductions when running Spark workloads with Vortex in Apache Iceberg
Caveats¶
These are significant claims. Our initial testing shows:
Storage sizes are similar to well-tuned Parquet for TPC-H data
At SF1 with data in memory, format differences are minimal (compute dominates I/O)
The claimed 10-20x improvements would be more visible at larger scale factors where I/O becomes the bottleneck
Tooling is still maturing
Vortex Architecture¶
File Structure¶
file.vortex
├── Magic: VTXF (4 bytes)
├── Data segments (compressed column chunks)
├── Postscript (max 65KB)
│ ├── DType segment (schema)
│ ├── Layout segment
│ ├── Statistics segment
│ └── Footer segment
├── Version tag (16-bit)
├── Postscript length (16-bit)
└── Magic: VTXF (4 bytes)
The postscript design is notable: complete footer information loads within 64KB, enabling two-round-trip reads from cloud storage. This matters for S3/GCS workloads where each request has latency overhead.
Encoding Strategies¶
Vortex differs from Parquet in how it encodes data:
Parquet approach:
Fixed encoding schemes per logical type
Dictionary, RLE, delta encoding
Compression applied after encoding
Vortex approach:
Composable encodings that chain together
Type-aware compressors (FSST for strings, ALP for integers)
Per-segment compression selection
Adaptive encoding based on data distribution
This flexibility means Vortex can potentially achieve better compression for specific data patterns, though our testing shows similar ratios to well-tuned Parquet.
Compression Options¶
Vortex supports standard compression algorithms:
None
LZ4
ZLib
ZStd
Each segment can use different compression, enabling fine-grained optimization.
Platform Support¶
DuckDB Extension¶
# Install vortex extension (one-time)
duckdb -c "INSTALL vortex; LOAD vortex;"
# Query Vortex files
SELECT * FROM read_vortex('customer.vortex')
DataFusion Support¶
DataFusion has experimental Vortex support in progress.
Limitations¶
Platform |
Support Level |
Notes |
|---|---|---|
DuckDB |
Extension |
|
DataFusion |
Experimental |
Native support in progress |
Others |
Not supported |
Use Parquet |
Vortex support is currently limited to DuckDB and DataFusion. If you need cross-platform benchmarks, stick with Parquet.
When to Consider Vortex¶
Best-Fit Scenarios¶
Scenario |
Why Vortex |
|---|---|
DuckDB-centric workflows |
Native extension support |
Analytical workloads with selective queries |
Fast random access |
Cloud storage with 100ms+ round-trip latency |
Efficient read patterns |
Exploring new technologies |
Stay current with format evolution |
When to Stay with Parquet¶
Scenario |
Why Parquet |
|---|---|
Cross-platform benchmarks |
Universal support |
Production stability |
10+ years of battle-testing |
Cloud data warehouses |
No Vortex support on Snowflake, Databricks |
Ecosystem tooling |
Most tools expect Parquet |
Maturity Considerations¶
Vortex is in Incubation stage:
API may change before 1.0 release
Extension compatibility requires attention
Community smaller than Parquet ecosystem
Documentation still evolving
For production benchmarks, we recommend Parquet. For exploration and DuckDB-specific testing, Vortex is worth trying.
BenchBox Usage¶
Installation¶
# Install vortex Python library
uv add vortex
Running Benchmarks¶
# Convert data to Vortex format
benchbox convert --input ./data --format vortex
# Run benchmark with Vortex on DuckDB
benchbox run --platform duckdb --benchmark tpch --format vortex --scale 1
Reading Vortex Files¶
# Python vortex library
import vortex
array = vortex.io.read('customer.vortex')
table = array.to_arrow()
# DuckDB (requires extension)
conn.execute("INSTALL vortex; LOAD vortex;")
conn.execute("SELECT * FROM read_vortex('customer.vortex')")
See Also¶
Table Format Guides: Overview of all formats
Format Conversion Reference: CLI commands for format conversion
Parquet Deep Dive: The established alternative
Compression Guide: Data compression strategies