Parquet Deep Dive¶
Apache Parquet is the de facto standard for analytical data storage. Understanding row groups, column chunks, and compression options helps you run better benchmarks.
Overview¶
Apache Parquet has become the universal language of analytical data. Created in 2013 as a collaboration between Twitter and Cloudera, it’s now an Apache top-level project with support across virtually every analytics platform.
We made Parquet the default format in BenchBox because it’s the one format we can count on everywhere. DuckDB, Spark, Snowflake, BigQuery, Polars, DataFusion: they all read Parquet natively. This universality matters when you’re running the same benchmark across platforms.
But using Parquet effectively requires understanding a few core concepts. This guide covers what benchmarkers need to know: row groups, column chunks, compression options, and when (rarely) to tune these settings.
Why Columnar Matters for Analytics¶
Before diving into Parquet specifics, let’s understand why columnar storage helps analytical workloads.
Row vs Columnar Storage¶
Storage Type |
Data Layout |
Best For |
|---|---|---|
Row-oriented |
All columns for row 1, then all columns for row 2… |
OLTP, full-row access |
Columnar |
All values for column 1, then all values for column 2… |
OLAP, analytical queries |
Analytical queries typically read few columns but many rows:
-- Only needs 'revenue' column, but all 6M rows
SELECT SUM(l_extendedprice) FROM lineitem;
-- Only needs 4 columns out of 16
SELECT l_returnflag, l_linestatus, SUM(l_quantity)
FROM lineitem
GROUP BY l_returnflag, l_linestatus;
With row storage, the query reads entire rows, including columns it doesn’t need. With columnar storage, it reads only the columns requested.
Analytical Query Patterns¶
Parquet enables several optimizations for analytical queries:
Column pruning: Read only required columns, skip the rest
Predicate pushdown: Apply filters at the storage layer before loading into memory
Better compression: Similar values group together within columns
Vectorized processing: Process columns in batches using CPU SIMD instructions
Parquet File Structure¶
Row Groups¶
Parquet files are divided into row groups, chunks of rows stored together. Each row group is independently readable, enabling parallel processing.
lineitem.parquet
├── Row Group 0 (rows 0-999,999)
│ ├── Column: l_orderkey (INT64, compressed)
│ ├── Column: l_quantity (DECIMAL, compressed)
│ └── Column: l_extendedprice (DECIMAL, compressed)
├── Row Group 1 (rows 1,000,000-1,999,999)
│ └── ...
└── Footer (schema, statistics, offsets)
Why row groups matter for benchmarks:
Parallelism: Each row group can be processed independently. More row groups enable more parallel tasks, up to a point.
Memory: Larger row groups require more memory to process. For memory-constrained environments, smaller row groups help.
Statistics: Parquet stores min/max values per column per row group. Queries with range filters can skip row groups entirely.
BenchBox default: 1 million rows per row group (the Parquet default). This works well for most benchmark scenarios.
Column Chunks¶
Within each row group, data is stored by column. This is where columnar benefits appear.
Column pruning: A query selecting 4 columns from a 16-column table reads 25% of the data.
Better compression: Similar values group together. A column of dates compresses better than mixed-type rows.
Vectorized processing: Modern query engines process columns in batches (vectors), exploiting CPU SIMD instructions.
Compression in Parquet¶
For quick compression settings, see Format Conversion: Compression Guide.
Algorithm Comparison¶
Parquet supports several compression algorithms:
Algorithm |
Speed |
Ratio |
When to use |
|---|---|---|---|
Snappy |
Fastest |
Good |
Low-latency queries, interactive workloads |
LZ4 |
Fast |
Good |
Balance of speed and compression |
Zstd |
Medium |
Best |
Storage efficiency, batch workloads |
Gzip |
Slow |
Good |
Compatibility with older tools |
BenchBox recommendation: Zstd for most benchmarks. It provides the best compression ratio with acceptable decompression speed. At scale factor 10 and above, smaller files mean less I/O, which often outweighs the decompression cost.
Compression Levels¶
Zstd supports levels 1-22. Higher levels provide better compression at the cost of slower writes:
Priority |
Recommended |
Reason |
|---|---|---|
Storage cost |
Zstd:9 |
Smallest files |
Query speed |
Snappy |
Fastest decompression |
Balanced |
Zstd:3 |
BenchBox default |
Debugging |
None |
No decompression overhead |
For most benchmarks, the default Zstd:3 works well. Higher compression levels (Zstd:9) save storage but add write time. Snappy trades ~20% larger files for faster decompression.
When to Adjust Compression¶
# Default (Zstd level 3)
benchbox run --platform duckdb --benchmark tpch --scale 1
# Higher compression (smaller files, slower write)
benchbox run --platform duckdb --benchmark tpch --compression zstd:9
# Faster compression (larger files, faster write)
benchbox run --platform duckdb --benchmark tpch --compression snappy
# No compression (debugging, baseline)
benchbox run --platform duckdb --benchmark tpch --compression none
Row Group Tuning¶
When to Increase Row Group Size¶
Running at SF100+ (reduce footer overhead)
Using cloud storage (fewer S3/GCS GET requests)
Memory is not constrained
When to Decrease Row Group Size¶
Maximizing parallelism on many-core systems
Running in memory-constrained environments
Debugging query execution patterns
BenchBox uses the 1M row default, which works for most scenarios. Custom row group sizes require manual data generation with PyArrow.
Best Practices for Benchmarking¶
Recommended Settings¶
# Standard benchmark (reproducible, efficient)
benchbox run --platform duckdb --benchmark tpch --scale 10 --compression zstd:3
# Storage-focused comparison
benchbox run --platform duckdb --benchmark tpch --scale 10 --compression zstd:9
# Performance baseline (measure decompression impact)
benchbox run --platform duckdb --benchmark tpch --scale 10 --compression none
Common Mistakes to Avoid¶
1. Using CSV for benchmarks
CSV parsing adds overhead that skews results. A “fast” platform might just be a better CSV parser, not a better query engine.
2. Ignoring compression settings
Different tools use different defaults. DuckDB might generate Snappy, while Spark uses Zstd. Specify compression explicitly for reproducible comparisons.
3. Comparing across formats
Parquet vs CSV isn’t a fair comparison. If you’re evaluating formats, compare Parquet to Parquet with different settings, or to other columnar formats like ORC or Vortex.
4. Forgetting about statistics
Queries that filter on columns with good min/max separation benefit from predicate pushdown. If your benchmark doesn’t include filtering queries, you’re missing a key Parquet optimization.
See Also¶
Table Format Guides: Overview of all formats
Format Conversion Reference: CLI commands for format conversion
Compression Guide: Data compression strategies
Delta Lake Guide: Table format built on Parquet