Module: Foundations | Duration: 8 min read | Lesson: 1 of 17

Maya just joined TheWorldShop, a global marketplace that sells everything from groceries to electronics. On her first day the CFO complains that the daily revenue dashboard takes seven minutes and costs $40 every refresh. Maya opens the warehouse and finds four years of orders sitting in CSV. The dashboard needs one column. Every query reads all hundred. Why does CSV make you pay for data you never touch, and what flips that?

2. Concept Explanation

Parquet stores each column of a table in its own contiguous byte range instead of storing each row together. That single decision means a query touching 3 columns out of 100 reads roughly 3% of the file's data bytes, not 100%. The tradeoff is write complexity and a random-access penalty. Parquet is built for analytical reads, not point lookups or frequent small updates.

A row-oriented format (CSV, Avro) lays data out like this:

[row 0: id=1, name="alice", age=30, country="US", revenue=120.5, ...]
[row 1: id=2, name="bob",   age=25, country="DE", revenue=88.0,  ...]

To compute SUM(revenue) you read every field of every row to reach the revenue bytes scattered through the file. On a 10-column table with a billion rows, that's about 10x more data than you need.

Parquet's columnar layout groups by column instead:

┌──────────────────────────────────────────────────────────────────┐
│  ROW GROUP 0  (rows 0 – 999,999)                                 │
│  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌─────────────┐   │
│  │  col: id   │ │ col: name  │ │ col: age   │ │col: revenue │   │
│  │ [1,2,3,…]  │ │["alice",…] │ │ [30,25,…]  │ │[120.5,88,…] │   │
│  └────────────┘ └────────────┘ └────────────┘ └─────────────┘   │
└──────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│  ROW GROUP 1  (rows 1,000,000 – 1,999,999)  …                    │
└──────────────────────────────────────────────────────────────────┘
[FileMetaData - Thrift-serialized footer]
[4-byte footer length][PAR1 magic]

SUM(revenue) seeks straight to the revenue column chunk in each row group and reads only those bytes. Min/max statistics let the engine skip whole row groups where no matching value can exist.

Three more wins follow from the layout:

Encoding. Consecutive values in a column share a type and often share prefixes or low cardinality. Dictionary, RLE, and delta encodings exploit that. They can't exploit it across interleaved row bytes.

Compression. Compressors like Snappy and Zstd work on blocks of similar data. A column of INT32 timestamps compresses far better than a row of mixed types.

Vectorized execution. CPUs process columnar arrays with SIMD instructions. A tight loop over float64[N] runs vectorized. A loop that chases struct fields does not.

The top-level structure is described by FileMetaData in parquet.thrift:

struct FileMetaData {
  1: required i32 version                          // always 1 or 2
  2: required list<SchemaElement> schema;          // flattened DFS schema tree
  3: required i64 num_rows                         // total row count across all row groups
  4: required list<RowGroup> row_groups            // one per horizontal partition
  5: optional list<KeyValue> key_value_metadata    // arbitrary writer-defined metadata
  6: optional string created_by                    // writer identity string
  7: optional list<ColumnOrder> column_orders;     // sort order for statistics
}

3. Worked Example

import pyarrow as pa
import pyarrow.parquet as pq
import time, os

# Generate a wide table: 20 columns, 1 million rows
n = 1_000_000
cols = {"id": pa.array(range(n), type=pa.int64())}
for i in range(19):
    cols[f"col_{i}"] = pa.array([float(i) * j for j in range(n)], type=pa.float64())

table = pa.table(cols)

pq.write_table(table, "/tmp/wide.parquet", row_group_size=250_000)

# Simulate an analytical query: read only 'id' and 'col_0'
t0 = time.perf_counter()
result = pq.read_table("/tmp/wide.parquet", columns=["id", "col_0"])
t1 = time.perf_counter()

file_size = os.path.getsize("/tmp/wide.parquet")
print(f"File size: {file_size / 1e6:.1f} MB")
print(f"Rows read: {len(result):,}")
print(f"Read time (2 of 20 cols): {(t1-t0)*1000:.1f} ms")
# Reading 2 of 20 columns is close to 10% of the work of reading all of them

Reading 2 columns out of 20 only touches the byte ranges for those two. The other 18 column chunks are never read off disk.

Aha: Parquet's read savings come from what it doesn't read. The moment you write SELECT * at scale you pay the full write cost of buffering row groups and get none of the column-skipping benefit. Columnar storage rewards selectivity, and punishes you for asking for everything.

4. Your Turn

Exercise: TheWorldShop's orders table has 100 columns and a billion rows. The revenue dashboard reads 3 columns. Reason about the I/O before you measure it.

Roughly what fraction of the data bytes does the dashboard read from Parquet versus from CSV?
Name one query shape where Parquet would read more than the CSV equivalent.
The team also wants to fetch a single full order by order_id thousands of times per second. Is Parquet the right format for that path? Why or why not?

5. Real-World Application

Every large analytics stack leans on this layout. BigQuery, Snowflake, Athena, Spark, and DuckDB all read columnar data so a dashboard that touches a few columns scans a few columns. When TheWorldShop moves its CSV exports to Parquet, the seven-minute revenue dashboard drops to seconds because it stops reading 97 columns it never displays. The same move cuts the per-query scan cost on systems billed by bytes scanned, which is why the CFO notices it on the invoice as well as the wall clock.

6. Recap + Bridge

Parquet groups data by column so selective queries read selective bytes, and that one decision unlocks encoding, compression, and vectorized reads. It pays off when queries are column-selective and costs you when you ask for whole rows. Next we crack the file open and look at the containers inside it: row groups, column chunks, and pages.