Module: Indexing & Pushdown | Duration: 10 min read | Lesson: 1 of 13

Part 1 made TheWorldShop's files smaller. Part 2 makes them faster. The "revenue last 7 days" dashboard still scans four years of files. Maya looks closer and finds each row group already stores its min and max date, but the engine is ignoring them. This lesson is how statistics let an engine throw away 99% of the data before reading it.

2. Concept Explanation

Parquet stores column statistics at two granularities: per-page statistics inside each DataPageHeader, and per-column-chunk statistics inside ColumnMetaData. These are different structures for different jobs. Row-group-level stats enable row-group skipping. Page-level stats, combined with the Page Index (Lesson 2), enable page skipping. Conflating the two leads to wrong assumptions about when statistics actually help.

Row-group level (coarse, always present). Stored in ColumnMetaData.statistics in the footer. A query WHERE ts > '2024-01-01' checks every column chunk's min_value/max_value without reading any data pages. If a row group's max timestamp is before the bound, the whole group is skipped.

Page level (fine, needs the data page headers). Stored in DataPageHeader.statistics, embedded in each column chunk's data stream. Reading them inline means a sequential scan through pages, which is expensive. The Page Index solves that by lifting page stats into the footer (Lesson 2).

The Statistics fields:

max / min (fields 1, 2): deprecated, signed-only comparison. A UINT32 column would sort wrong for values above 2^31. Writers should avoid them; readers use them only when column_orders is absent.
min_value / max_value (fields 5, 6): current. Sorted per the column's ColumnOrder. PLAIN-encoded, without the length prefix for BYTE_ARRAY.
null_count (field 3): nulls in the page or chunk.
distinct_count (field 4): approximate, rarely written, unreliable.
is_max_value_exact / is_min_value_exact (fields 7, 8): if false, the stored bound is truncated (for example a shortened string), not exact. Engines must not use inexact bounds for equality predicates.

Truncated bounds. A writer may store min_value="B" instead of "Blart Versenwald III" to keep the footer small. That's valid as long as is_min_value_exact=false is set, since min_value only needs to be at or below the true minimum.

NaN handling. NaN must not go into min/max. If a float column contains NaN, the writer should account for it via null_count or omit min/max.

struct Statistics {
   1: optional binary max;           // DEPRECATED: signed-only sort
   2: optional binary min;           // DEPRECATED: signed-only sort
   3: optional i64 null_count;
   4: optional i64 distinct_count;   // approximate, often absent
   5: optional binary max_value;     // upper bound per ColumnOrder
   6: optional binary min_value;     // lower bound per ColumnOrder
   7: optional bool is_max_value_exact;
   8: optional bool is_min_value_exact;
}

3. Worked Example

import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

rng = np.random.default_rng(0)
n = 50_000
mask = rng.random(n) < 0.1   # 10% nulls
revenue = [float(v) if not mask[i] else None
           for i, v in enumerate(rng.uniform(0, 10_000, n))]

table = pa.table({
    "ts":      pa.array(range(n), pa.int64()),   # sequential, predictable stats
    "revenue": pa.array(revenue, pa.float64()),
    "country": pa.array(rng.choice(["US","EU","APAC"], n)),
})
pq.write_table(table, "/tmp/stats_demo.parquet", row_group_size=10_000, write_statistics=True)

meta = pq.ParquetFile("/tmp/stats_demo.parquet").metadata
for rg_i in range(meta.num_row_groups):
    rg = meta.row_group(rg_i)
    print(f"\nRow Group {rg_i} ({rg.num_rows} rows):")
    for col_i in range(rg.num_columns):
        col = rg.column(col_i); s = col.statistics
        if s:
            print(f"  {col.path_in_schema:10s}: min={s.min!r:15} max={s.max!r:15} "
                  f"nulls={s.null_count}")

# Row group skipping in action
import duckdb
r = duckdb.execute("""
    SELECT COUNT(*) FROM read_parquet('/tmp/stats_demo.parquet')
    WHERE ts BETWEEN 40000 AND 50000
""").fetchone()
print(f"\nResult: {r[0]} rows (DuckDB skips row groups 0-3 via max_value)")

Because ts is sequential, each row group's min/max is a tight, non-overlapping range. The BETWEEN 40000 AND 50000 query touches only the last group; the others are skipped on statistics alone.

Aha: Statistics are only as useful as your data is sorted. The footer always has min/max, but if rows are written in random order, every row group's range spans the whole dataset and overlaps every predicate. The skip machinery is present and doing nothing. Sorting before write is what turns "stats exist" into "stats skip data."

4. Your Turn

Exercise: TheWorldShop writes the events table two ways: sorted by ts, and in random arrival order. Both have 100 row groups.

For a query WHERE ts BETWEEN <one day> AND <next day>, roughly how many row groups can be skipped in each version?
A nullable coupon_code column has null_count omitted by the writer. Why might an engine refuse to push down WHERE coupon_code IS NOT NULL?
A string column stores is_min_value_exact=false. Can an engine use its min_value to satisfy WHERE name = 'Bob'? Why or why not?

5. Real-World Application

Statistics are the cheapest skip mechanism Parquet has, because they're already in memory the moment the footer is read. Every column-store query engine uses them, which is why "sort your fact table by the column you filter on" is the highest-leverage tuning advice in the lakehouse world. At TheWorldShop, switching the nightly write to sort by ts is what finally lets the 7-day dashboard skip four years of row groups instead of scanning them. The same move is what turns a per-query cloud scan bill from dollars to cents.

6. Recap + Bridge

Parquet keeps statistics at two levels: row-group stats in the footer for coarse skipping, and page stats for fine skipping, both useless unless the data is sorted on the filter key. Page stats sit inline by default, too expensive to scan. The next lesson introduces the Page Index, which lifts them into the footer so engines can skip individual pages cheaply.