Part 1 made TheWorldShop's files smaller. Part 2 makes them faster. The "revenue last 7 days" dashboard still scans four years of files. Maya looks closer and finds each row group already stores its min/max date — but the engine is ignoring them. This lesson is how statistics let the engine throw away 99% of data before reading.
Overview
Parquet stores column statistics at two granularities: per-page statistics inside each DataPageHeader and per-column-chunk statistics inside ColumnMetaData. These are not the same structure serving the same purpose. Row-group-level stats enable row group skipping; page-level stats (combined with the Page Index) enable page-level skipping. Many sources conflate the two, leading to incorrect assumptions about when statistics are used.
How It Works
Row-group level (coarser, always present): Stored in ColumnMetaData.statistics inside the footer. A query with WHERE ts > '2024-01-01' checks min_value/max_value for every column chunk in every row group without reading any data pages. If the row group's max timestamp is before 2024-01-01, the entire row group (potentially millions of rows) is skipped.
Page level (finer, requires Data Page Header): Stored in DataPageHeader.statistics within each page header, which is embedded in the data stream of each column chunk. Reading page-level stats requires seeking through data pages sequentially — expensive without the Page Index. The Page Index (Chapter 19) solves this by aggregating page-level stats into the footer.
The Statistics struct fields:
max/min(fields 1, 2): deprecated. Encoded by signed comparison regardless of type. AUINT32column would havemin/maxsorted as signed INT32 — wrong for values ≥ 2^31. Writers should not write these; readers may use them only whencolumn_ordersis absent.min_value/max_value(fields 5, 6): current. Sorted according to the column'sColumnOrder(typicallyTypeDefinedOrder). Values are PLAIN-encoded without the 4-byte length prefix forBYTE_ARRAY.null_count(field 3): count of null values in the page or column chunk.distinct_count(field 4): approximate distinct value count, rarely written, unreliable.is_max_value_exact/is_min_value_exact(fields 7, 8): if false, the stored min/max is a bound (e.g. truncated string), not the exact value. Engines must not use inexact bounds for equality predicates.
Truncated string bounds: A writer may store min_value="B" instead of "Blart Versenwald III" to save space in the footer. This is valid — min_value only needs to be ≤ the true minimum. But is_min_value_exact=false must be set so engines don't apply equality predicates using the truncated value.
NaN handling for floats: NaN values must not be written to min or max. If a column contains NaN, the engine cannot determine the true min/max. Writers should set null_count to account for NaN values if needed or omit min/max entirely.
The Thrift Definition
Worked Example
To demonstrate row group skipping:
When to Use / When to Avoid
| Use When | Avoid When |
|---|---|
Always write null_count — readers depend on it even when zero | Writing deprecated min/max fields for columns with unsigned types |
Write min_value/max_value for any column used in WHERE filters | Trusting distinct_count — it's rarely written accurately |
| Enable statistics for sorted or clustered data — skipping works best | Writing statistics for high-cardinality random columns — footer bloats with no skip benefit |
Set is_max_value_exact=false when truncating string bounds | Omitting null_count for nullable columns — engines may conservatively assume data is present |
Key Takeaway
Page-level statistics in DataPageHeader and row-group-level statistics in ColumnMetaData serve different skip granularities — the Page Index (Chapter 19) lifts page stats into the footer so engines can use them without scanning through data pages.