Module: Advanced | Duration: ~25 min | Lesson: 1 of 8
TheWorldShop's orders table has grown to 8 billion rows across 400,000 Parquet files. A query planner needs to answer: "Which of these 400,000 files might contain orders from the US placed in January 2024?"
A naive approach: list all 400,000 files, read each file's footer to check its statistics. That's 400,000 network round trips. Minutes of planning overhead before reading a single byte of data.
Iceberg answers this question in under a second. How? The answer is in the three-level metadata hierarchy. This lesson is where you understand it cold, down to the file format and byte-level details.
2. Concept Explanation
The Full Hierarchy
Each level in the hierarchy serves a different purpose in the scan planning pipeline.
Level 1: Table Metadata (metadata.json)
The entry point. Written in JSON for human readability. Contains everything needed to understand the current and historical state of the table.
Key fields (from format/spec.md in the codebase):
When a query engine starts planning, it reads this single file (~50 KB) and knows: current schema, partition spec, and where to find the manifest list.
Level 2: Manifest List (Avro)
One manifest list per snapshot. This is an Avro file: binary, compact, schema-encoded. Each row represents one manifest file. The manifest list's value comes from its partition summary columns: for each partition field, it stores the lower_bound and upper_bound of values across all files in that manifest.
Key manifest list entry fields (from ManifestFile.java):
This is where partition pruning happens. When a query has WHERE order_date = '2024-01-15', Iceberg:
- Computes
days('2024-01-15')=19737 - Scans the manifest list
- For each manifest entry, checks:
lowerBound ≤ 19737 ≤ upperBound - Skips any manifest where the bound check fails
If a manifest covers only data from 2023-12-01 to 2023-12-31, it's skipped entirely, without reading the manifest file.
Level 3: Manifest Files (Avro)
Each manifest file contains one row per data file or delete file. Critically, each row includes per-column statistics (min, max, null count) for every column in the data file.
Key fields per data file entry (DataFile.java):
With these per-file statistics, the query planner can skip files where a filter predicate is provably false:
WHERE order_total > 1000: skip files whereupperBounds[3] ≤ 1000WHERE country = 'US': skip files wherelowerBounds[6] > 'US' OR upperBounds[6] < 'US'
This is data skipping, and it's based entirely on the metadata already in manifests. No file reads needed.
Sequence Numbers
Every file written to Iceberg gets a monotonically increasing sequence number. This enables incremental scans. You can ask "give me all files added after sequence number N":
This is the foundation of Iceberg's streaming read support.
How Scan Planning Actually Works
Putting it all together, here's the full scan planning pipeline for a query with predicate WHERE order_date = '2024-01-15' AND country = 'US':
Compare to Hive: listing 400,000 files would take minutes.
3. Worked Example
Let's inspect actual metadata files from the Iceberg codebase test fixtures. Here's how you can dump and read them in Spark:
4. Your Turn
Exercise: Given an Iceberg table with these manifest entries:
| Manifest | Date LowerBound | Date UpperBound | Country LowerBound | Country UpperBound | Files |
|---|---|---|---|---|---|
| M1 | 2024-01-01 | 2024-01-31 | 'CA' | 'US' | 500 |
| M2 | 2024-01-01 | 2024-01-31 | 'DE' | 'UK' | 400 |
| M3 | 2024-02-01 | 2024-02-28 | 'CA' | 'US' | 600 |
| M4 | 2024-02-01 | 2024-02-28 | 'AU' | 'JP' | 350 |
Query: WHERE order_date BETWEEN '2024-01-10' AND '2024-01-20' AND country = 'US'
- Which manifests survive the partition (date) pruning step?
- Of those, which survive the file statistics (country) pruning step?
- How many data files does Spark actually need to open?
5. Real-World Application
Netflix's Metacat service (their metadata management platform) is built on top of Iceberg's metadata model. They've described cases where query planning time for their largest tables dropped from 12 minutes (Hive) to 8 seconds (Iceberg) just from the metadata hierarchy redesign.
The Puffin file format (which you'll see in Lesson 5 on performance) extends this even further by allowing pre-computed statistics (like theta sketches for approximate distinct counts) to be stored alongside manifests, enabling query planners to skip not just files but also defer expensive cardinality estimation.
In your career: When a stakeholder says "our Iceberg queries are slow," the first thing to check is scan planning efficiency. Are manifests being pruned? Are per-file statistics being used? The metadata layer is the first place to look, not the compute layer.
Aha: Iceberg's manifest list isn't an index "for" the data. It's an index over per-partition min/max bounds. By the time the query engine touches a manifest file, the partition decision is already made. That's why scan planning stays sub-second even when the file count grows past a million.
6. Recap + Bridge
What we learned: The Iceberg metadata hierarchy (metadata.json → manifest list → manifests → data files) enables sub-second scan planning for billion-row tables. Manifest list entries contain partition bounds for cheap partition pruning. Manifest entries contain per-file column statistics for data skipping. Sequence numbers enable efficient incremental reads for streaming.
Coming up next: Lesson 2 covers the most complex aspect of Iceberg for production engineers: how row-level updates and deletes work. Copy-on-Write (COW) and Merge-on-Read (MOR) are two fundamentally different strategies, and choosing the wrong one can tank your table's performance.