Statistics: Min, Max, Null Count, and Distinct Count

Explore this on a real file in the Parquet Viewer. Drop any .parquet — or load the built-in sample — to see the schema, row groups, encodings, compression and statistics from this lesson, 100% in your browser. Open the tool →

Part 1 made TheWorldShop's files smaller. Part 2 makes them faster. The "revenue last 7 days" dashboard still scans four years of files. Maya looks closer and finds each row group already stores its min/max date — but the engine is ignoring them. This lesson is how statistics let the engine throw away 99% of data before reading.


Overview

Parquet stores column statistics at two granularities: per-page statistics inside each DataPageHeader and per-column-chunk statistics inside ColumnMetaData. These are not the same structure serving the same purpose. Row-group-level stats enable row group skipping; page-level stats (combined with the Page Index) enable page-level skipping. Many sources conflate the two, leading to incorrect assumptions about when statistics are used.

How It Works

Row-group level (coarser, always present): Stored in ColumnMetaData.statistics inside the footer. A query with WHERE ts > '2024-01-01' checks min_value/max_value for every column chunk in every row group without reading any data pages. If the row group's max timestamp is before 2024-01-01, the entire row group (potentially millions of rows) is skipped.

Page level (finer, requires Data Page Header): Stored in DataPageHeader.statistics within each page header, which is embedded in the data stream of each column chunk. Reading page-level stats requires seeking through data pages sequentially — expensive without the Page Index. The Page Index (Chapter 19) solves this by aggregating page-level stats into the footer.

The Statistics struct fields:

  • max / min (fields 1, 2): deprecated. Encoded by signed comparison regardless of type. A UINT32 column would have min/max sorted as signed INT32 — wrong for values ≥ 2^31. Writers should not write these; readers may use them only when column_orders is absent.
  • min_value / max_value (fields 5, 6): current. Sorted according to the column's ColumnOrder (typically TypeDefinedOrder). Values are PLAIN-encoded without the 4-byte length prefix for BYTE_ARRAY.
  • null_count (field 3): count of null values in the page or column chunk.
  • distinct_count (field 4): approximate distinct value count, rarely written, unreliable.
  • is_max_value_exact / is_min_value_exact (fields 7, 8): if false, the stored min/max is a bound (e.g. truncated string), not the exact value. Engines must not use inexact bounds for equality predicates.

Truncated string bounds: A writer may store min_value="B" instead of "Blart Versenwald III" to save space in the footer. This is valid — min_value only needs to be ≤ the true minimum. But is_min_value_exact=false must be set so engines don't apply equality predicates using the truncated value.

NaN handling for floats: NaN values must not be written to min or max. If a column contains NaN, the engine cannot determine the true min/max. Writers should set null_count to account for NaN values if needed or omit min/max entirely.

The Thrift Definition

// excerpt from parquet.thrift
struct Statistics {
   1: optional binary max;           // DEPRECATED: signed-only sort order
   2: optional binary min;           // DEPRECATED: signed-only sort order
   3: optional i64 null_count;       // null values in this page/column chunk
   4: optional i64 distinct_count;   // approximate distinct count; often absent
   5: optional binary max_value;     // upper bound per ColumnOrder
   6: optional binary min_value;     // lower bound per ColumnOrder
   7: optional bool is_max_value_exact;  // false = truncated/rounded upper bound
   8: optional bool is_min_value_exact;  // false = truncated/rounded lower bound
}

struct DataPageHeader {
  1: required i32 num_values
  2: required Encoding encoding
  3: required Encoding definition_level_encoding;
  4: required Encoding repetition_level_encoding;
  5: optional Statistics statistics;   // page-level stats
}

struct ColumnMetaData {
  ...
  12: optional Statistics statistics;  // column-chunk-level stats
  ...
}

Worked Example

import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

rng = np.random.default_rng(0)
n = 50_000

# A table with mixed types including nulls
mask = rng.random(n) < 0.1  # 10% nulls
revenue_vals = rng.uniform(0, 10_000, n)
revenue_vals_with_nulls = [float(v) if not mask[i] else None for i, v in enumerate(revenue_vals)]

table = pa.table({
    "ts":      pa.array(range(n), pa.int64()),       # sequential — predictable stats
    "revenue": pa.array(revenue_vals_with_nulls, pa.float64()),
    "country": pa.array(rng.choice(["US","EU","APAC"], n)),
})

pq.write_table(table, "/tmp/stats_demo.parquet", row_group_size=10_000,
               write_statistics=True)

pf = pq.ParquetFile("/tmp/stats_demo.parquet")
meta = pf.metadata

for rg_i in range(meta.num_row_groups):
    rg = meta.row_group(rg_i)
    print(f"\nRow Group {rg_i} ({rg.num_rows} rows):")
    for col_i in range(rg.num_columns):
        col = rg.column(col_i)
        s = col.statistics
        if s:
            print(f"  {col.path_in_schema:10s}: "
                  f"min={s.min!r:15} max={s.max!r:15} "
                  f"nulls={s.null_count} has_min_max={s.has_min_max}")

To demonstrate row group skipping:

import duckdb
# Only row groups where ts range overlaps [40000, 50000] are read
result = duckdb.execute("""
    SELECT COUNT(*) FROM read_parquet('/tmp/stats_demo.parquet')
    WHERE ts BETWEEN 40000 AND 50000
""").fetchone()
print(f"Result: {result[0]} rows (should be ~10000)")
# DuckDB skips row groups 0-3 (ts 0-39999) based on max_value statistics

When to Use / When to Avoid

Use WhenAvoid When
Always write null_count — readers depend on it even when zeroWriting deprecated min/max fields for columns with unsigned types
Write min_value/max_value for any column used in WHERE filtersTrusting distinct_count — it's rarely written accurately
Enable statistics for sorted or clustered data — skipping works bestWriting statistics for high-cardinality random columns — footer bloats with no skip benefit
Set is_max_value_exact=false when truncating string boundsOmitting null_count for nullable columns — engines may conservatively assume data is present

Key Takeaway

Page-level statistics in DataPageHeader and row-group-level statistics in ColumnMetaData serve different skip granularities — the Page Index (Chapter 19) lifts page stats into the footer so engines can use them without scanning through data pages.