Maya just joined TheWorldShop, a global online marketplace selling everything from groceries to electronics. On her first day the CFO complains that the daily revenue dashboard takes seven minutes and costs $40 per refresh. Maya opens the warehouse and finds four years of orders sitting in CSV. The dashboard only needs one column, but every query reads all hundred. Today she learns why CSV makes you pay for data you never use, and how columnar storage flips that.
Overview
Parquet stores each column of a table in a contiguous byte range rather than storing each row together. This single layout decision means a query that touches 3 columns out of 100 reads roughly 3% of the file's data bytes instead of 100%. The tradeoff is write complexity and random-access penalty — Parquet is purpose-built for analytical reads, not point lookups or frequent small updates.
How It Works
A row-oriented format (CSV, Avro, row-major Parquet predecessor) lays out data like this:
To compute SUM(revenue) you must read every field of every row to reach the revenue bytes scattered throughout the file. On a 10-column table with 1 billion rows, that means reading ~10x more data than necessary.
Parquet's columnar layout:
SUM(revenue) seeks directly to the revenue column chunk in each row group and reads only those bytes. Dictionary encoding and min/max statistics let the engine skip entire row groups where no matching values exist.
Three additional properties follow from the layout:
Encoding efficiency. Consecutive values in a column share a type and often share prefixes or low cardinality. PLAIN_DICTIONARY, RLE, and DELTA encodings exploit this; they cannot exploit it across interleaved row bytes.
Compression efficiency. Compressors (Snappy, Zstd) operate on blocks of similar-type data. A column of INT32 timestamps compresses far better than a row of mixed types.
Vectorized execution. CPUs process columnar arrays with SIMD instructions. A tight loop over float64[N] executes in vectorized form; a loop that dereferences struct fields does not.
The Thrift Definition
The top-level file structure is described by FileMetaData:
Worked Example
When to Use / When to Avoid
| Use When | Avoid When |
|---|---|
| Analytical queries aggregate or filter a subset of columns | Queries need every column of every row (e.g. SELECT * on random rows) |
| Data is written once and read many times | Frequent single-row updates or deletes are required |
| Columns have repeated values (low cardinality, sorted IDs) | Each row is unique across all columns — encoding gains vanish |
| Integration with columnar engines (Spark, DuckDB, Athena, BigQuery) | You need human-readable output without a reader library |
| File size and I/O cost matter more than write latency | Append latency is the bottleneck (Parquet requires buffering a full row group) |
Key Takeaway
Parquet's columnar layout earns its keep only when queries are selective over columns — the moment you SELECT * at scale, you pay the full row-group-buffering cost of writes with none of the read savings.