What's Actually Inside Your Parquet File
Parquet has a reputation problem, and it's the good kind: it's so reliably faster than CSV that most engineers stop thinking about it the moment they switch. The file is columnar, it compresses well, the warehouse reads it quickly - what else is there to know?
Quite a lot, as it turns out. A Parquet file isn't a table. It's a layout - a set of decisions about how rows are grouped, how each column's values are encoded, which codec squeezes them, and what summary statistics get written alongside. Those decisions are made by whatever wrote the file, often with defaults nobody chose deliberately. And they're the difference between a query that reads 4 MB and the exact same query, over the exact same rows, reading 4 GB.
This post opens the format up. Not the spec - the parts that change your bill. And because the only convincing way to learn this is on a file you recognize, everything here is something you can see for yourself in the free Parquet Viewer: drop a file in, nothing uploads, and the layout is right there.
A file is a tree, not a grid
The mental model that gets people into trouble is the spreadsheet: rows down, columns across, one flat grid. Parquet's real shape is a tree.
At the top, the file splits into row groups - horizontal slabs of rows, each a self-contained unit. Inside a row group, each column is stored separately as a column chunk. Each chunk is divided into pages, the smallest unit the reader actually decompresses. And pinned to the end of the file is the footer: the schema, the byte offsets of everything, and - this is the part that earns its keep - per-chunk statistics like the min and max value in each column.
Every performance property of Parquet falls out of this structure. The reader opens the footer first, decides which row groups and pages it can skip entirely based on the statistics, and only then touches data. The grid model can't explain why one query is cheap and another expensive. The tree can.
Row groups: the unit nobody sizes on purpose
The row group is the granularity of skipping. When your query has a WHERE
clause, the engine checks each row group's column statistics and asks: could
any row in here match? If the answer is no, the whole slab is skipped without
being read. This is predicate pushdown, and it's most of why Parquet feels
fast.
Which means row group size is a genuine tradeoff, not a detail:
- Too large (say, one row group for the whole file) and skipping becomes all-or-nothing. The statistics cover so many rows that the min/max range is wide, almost everything "could match," and you read the file end to end.
- Too small and you drown in overhead - a footer entry per group, a decompression setup per page, and statistics so granular the metadata starts to rival the data.
The usual advice is row groups in the 128 MB–1 GB range, but the number
that actually matters is rows-per-group relative to your query patterns. The
trap is that you rarely chose your row group size. Spark picked it. Pandas
picked it. A COPY INTO picked it. The
Parquet Viewer shows you the row group count and the
rows in each, which is usually the first time anyone looks.
The classic anti-pattern is the tiny-file, tiny-row-group combo from streaming writers: thousands of 2 MB Parquet files, each one row group, each with its own footer to open. Your engine spends more time on metadata round-trips than on data. If you see hundreds of files where you expected a handful, that's the problem before you even open one.
Encodings: where the columnar magic actually lives
Here's the thing most "Parquet vs CSV" explanations skip. Parquet isn't fast just because it's columnar - it's fast because storing a column together lets it encode that column cleverly, in ways that are impossible when values from different columns are interleaved row by row.
The two that carry most of the weight:
- Dictionary encoding. When a column has low cardinality - a
country, astatus, anevent_type- Parquet builds a dictionary of the distinct values and stores each cell as a small integer index. A column of"checkout"/"view"/"add_cart"repeated a million times collapses to a three-entry dictionary plus a million tiny indices. This is enormous, and it happens automatically - until the dictionary grows too big, at which point Parquet silently falls back to plain encoding for the rest of the chunk. - Run-length & bit-packing (RLE). Sorted or repetitive columns compress to almost nothing: "the value 1 appears 50,000 times" is a handful of bytes.
That silent dictionary fallback is the one to watch. A high-cardinality column
you thought was dictionary-encoded - a user_id, a request_id, a free-text
field - blows the dictionary, falls back to plain, and your file is suddenly far
larger and slower than you assumed. You can't feel this from the outside. You
have to read the encoding the writer actually chose per column chunk, which is
exactly what the inspector surfaces.
Compression: the codec is a tradeoff, not a default
After encoding, each page gets compressed with a codec - and "default" is doing a lot of unexamined work here. The common ones:
| Codec | Ratio | Decompress speed | Good for |
|---|---|---|---|
| SNAPPY | Modest | Very fast | Hot data, interactive queries - the safe default |
| ZSTD | Strong | Fast | Cold/archival data, or when storage cost dominates |
| GZIP | Strong | Slow | Legacy interop; rarely the right pick today |
| none | - | - | Already-compressed payloads, or accidental |
The mistake isn't picking the "wrong" codec - it's not knowing which one your files use, and therefore not knowing whether you're paying for ratio you don't need or speed you're not getting. A warehouse serving interactive dashboards off GZIP is leaving latency on the table; a cold archive on SNAPPY is leaving storage money on the table. Both are invisible until you look at the codec recorded in the file.
Statistics: the feature that does nothing if it's missing
Min/max statistics are what make predicate pushdown work. The engine reads
"this row group's ts ranges from 09:00 to 09:15," sees your query wants
ts > 14:00, and skips the whole slab. No statistics, no skip - the engine
falls back to reading everything and filtering after the fact.
Two ways this quietly breaks:
- The writer didn't emit statistics. Some writers, or some configurations, skip them. Your beautifully partitioned file gets scanned in full because the engine has nothing to prune on.
- The data isn't sorted on the column you filter by. Statistics still
exist, but if
tsis scattered randomly across every row group, every group's min/max range spans the whole day. Every group "could match." Nothing gets skipped. Sorting on your common filter column before writing is one of the highest-leverage things you can do to a Parquet dataset, and it costs nothing at read time.
Sort order isn't stored as "this file is sorted" - you infer it from the statistics. If each row group's min/max ranges are tight and non-overlapping, the data is well-clustered and pushdown will fly. If they all span the full range, they're effectively useless for skipping no matter how many you have. The Parquet Viewer lays the per-row-group statistics out so you can see overlap at a glance.
Look at your own file
You can read every paragraph above and still not know what's true of your data - because all of it is per-file, per-column, and chosen by whatever wrote it. So the honest end to this post is: go look.
The Parquet Viewer opens a file entirely in your browser - the bytes never leave your machine, there's no upload and no account - and shows you the schema, the row groups and their sizing, the encoding and codec per column, and the statistics. Then it runs a findings pass: it flags the tiny-row-group sprawl, the high-cardinality column that lost its dictionary, the missing statistics, the codec mismatch - and links each finding to the lesson that explains the internal.
That's the loop worth building: stop treating Parquet as a black box that's "just fast," open the layout, and fix the three or four decisions that are actually costing you. If you want the full ground-up treatment - encodings, predicate pushdown, the relationship to Arrow and the lakehouse formats - it lives in the Storage & File Formats track. But the fastest way to care about any of it is to point the inspector at a file you already ship and find out what's really in there.