Why Parquet? The Case for Columnar Storage

Explore this on a real file in the Parquet Viewer. Drop any .parquet — or load the built-in sample — to see the schema, row groups, encodings, compression and statistics from this lesson, 100% in your browser. Open the tool →

Maya just joined TheWorldShop, a global online marketplace selling everything from groceries to electronics. On her first day the CFO complains that the daily revenue dashboard takes seven minutes and costs $40 per refresh. Maya opens the warehouse and finds four years of orders sitting in CSV. The dashboard only needs one column, but every query reads all hundred. Today she learns why CSV makes you pay for data you never use, and how columnar storage flips that.


Overview

Parquet stores each column of a table in a contiguous byte range rather than storing each row together. This single layout decision means a query that touches 3 columns out of 100 reads roughly 3% of the file's data bytes instead of 100%. The tradeoff is write complexity and random-access penalty — Parquet is purpose-built for analytical reads, not point lookups or frequent small updates.

How It Works

A row-oriented format (CSV, Avro, row-major Parquet predecessor) lays out data like this:

[row 0: id=1, name="alice", age=30, country="US", revenue=120.5, ...]
[row 1: id=2, name="bob",   age=25, country="DE", revenue=88.0,  ...]

To compute SUM(revenue) you must read every field of every row to reach the revenue bytes scattered throughout the file. On a 10-column table with 1 billion rows, that means reading ~10x more data than necessary.

Parquet's columnar layout:

┌──────────────────────────────────────────────────────────────────┐
│  ROW GROUP 0  (rows 0 – 999,999)                                 │
│  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌─────────────┐   │
│  │  col: id   │ │ col: name  │ │ col: age   │ │col: revenue │   │
│  │ [1,2,3,…]  │ │["alice",…] │ │ [30,25,…]  │ │[120.5,88,…] │   │
│  └────────────┘ └────────────┘ └────────────┘ └─────────────┘   │
└──────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│  ROW GROUP 1  (rows 1,000,000 – 1,999,999)  …                    │
└──────────────────────────────────────────────────────────────────┘
[FileMetaData — Thrift-serialized footer]
[4-byte footer length][PAR1 magic]

SUM(revenue) seeks directly to the revenue column chunk in each row group and reads only those bytes. Dictionary encoding and min/max statistics let the engine skip entire row groups where no matching values exist.

Three additional properties follow from the layout:

Encoding efficiency. Consecutive values in a column share a type and often share prefixes or low cardinality. PLAIN_DICTIONARY, RLE, and DELTA encodings exploit this; they cannot exploit it across interleaved row bytes.

Compression efficiency. Compressors (Snappy, Zstd) operate on blocks of similar-type data. A column of INT32 timestamps compresses far better than a row of mixed types.

Vectorized execution. CPUs process columnar arrays with SIMD instructions. A tight loop over float64[N] executes in vectorized form; a loop that dereferences struct fields does not.

The Thrift Definition

The top-level file structure is described by FileMetaData:

// excerpt from parquet.thrift
struct FileMetaData {
  1: required i32 version                          // always 1 or 2
  2: required list<SchemaElement> schema;          // flattened DFS schema tree
  3: required i64 num_rows                         // total row count across all row groups
  4: required list<RowGroup> row_groups            // one per horizontal partition
  5: optional list<KeyValue> key_value_metadata    // arbitrary writer-defined metadata
  6: optional string created_by                    // writer identity string
  7: optional list<ColumnOrder> column_orders;     // sort order for statistics
  8: optional EncryptionAlgorithm encryption_algorithm
  9: optional binary footer_signing_key_metadata
}

Worked Example

import pyarrow as pa
import pyarrow.parquet as pq
import time, os

# Generate a wide table — 20 columns, 1 million rows
n = 1_000_000
cols = {"id": pa.array(range(n), type=pa.int64())}
for i in range(19):
    cols[f"col_{i}"] = pa.array([float(i) * j for j in range(n)], type=pa.float64())

table = pa.table(cols)

pq.write_table(table, "/tmp/wide.parquet", row_group_size=250_000)

# Simulate analytical query: read only 'id' and 'col_0'
t0 = time.perf_counter()
result = pq.read_table("/tmp/wide.parquet", columns=["id", "col_0"])
t1 = time.perf_counter()

file_size = os.path.getsize("/tmp/wide.parquet")
print(f"File size: {file_size / 1e6:.1f} MB")
print(f"Rows read: {len(result):,}")
print(f"Read time (2 of 20 cols): {(t1-t0)*1000:.1f} ms")
# Expected: read time ~10% of reading all 20 columns

When to Use / When to Avoid

Use WhenAvoid When
Analytical queries aggregate or filter a subset of columnsQueries need every column of every row (e.g. SELECT * on random rows)
Data is written once and read many timesFrequent single-row updates or deletes are required
Columns have repeated values (low cardinality, sorted IDs)Each row is unique across all columns — encoding gains vanish
Integration with columnar engines (Spark, DuckDB, Athena, BigQuery)You need human-readable output without a reader library
File size and I/O cost matter more than write latencyAppend latency is the bottleneck (Parquet requires buffering a full row group)

Key Takeaway

Parquet's columnar layout earns its keep only when queries are selective over columns — the moment you SELECT * at scale, you pay the full row-group-buffering cost of writes with none of the read savings.