Parquet — Part 2: Indexing, Encryption, and Engines

Indexing, predicate pushdown, encryption, the Variant type, and engine integrations.

Part 2 of the Parquet deep-dive. Picks up where Part 1 ends: row-group and page statistics, the page index, bloom filters, end-to-end predicate pushdown, modular encryption (column- and footer-level), the Variant type and shredding, and how Parquet plugs into Spark, Iceberg, Delta Lake, DuckDB, and Arrow. Includes Docker labs.

Advanced13 chapters· 3h 22m· in Storage & File Formats
Explore this course on a real file in the Parquet Viewer. Drop any .parquet — or load the built-in sample — to see the schema, row groups, encodings, compression and statistics these lessons describe, 100% in your browser. Open the tool →

Course content

  1. 01Statistics: Min, Max, Null Count, and Distinct CountFree
  2. 02Page Index: ColumnIndex and OffsetIndex🔒
  3. 03Bloom Filters: Probabilistic Predicate Pushdown🔒
  4. 04Predicate Pushdown End-to-End: How Engines Skip Data🔒
  5. 05Lab: Measure Predicate Pushdown Gains with DuckDB🔒
  6. 06Encryption: AES-GCM, Column-Level and Footer Keys🔒
  7. 07Variant Type: Semi-Structured Data in Typed Columns🔒
  8. 08Variant Shredding: Extracting Fields into Real Columns🔒
  9. 09Lab: Encrypted Parquet with Python🔒
  10. 10Parquet in Apache Spark: Reader and Writer Internals🔒
  11. 11Parquet in Iceberg and Delta Lake🔒
  12. 12Parquet in DuckDB and Apache Arrow🔒
  13. 13Format Evolution, Versioning, and Production Best Practices🔒

Prerequisites

Read the first chapter free

Start reading now — no account required for the free chapters.