Module: Advanced | Duration: ~30 min | Lesson: 1 of 9

TheWorldShop's orders table has grown to 8 billion rows across 400,000 Parquet files. A query planner needs to answer: "Which of these 400,000 files might contain orders from the US placed in January 2024?"

A naive approach: list all 400,000 files, read each file's footer to check its statistics. That's 400,000 network round trips. Minutes of planning overhead before reading a single byte of data.

Iceberg answers this question in under a second. How? The answer is in how the table is structured: a tree of immutable files that prunes work at every layer. This lesson is where you understand that structure cold, top to bottom, down to the byte-level details.

2. Concept Explanation

The Complete Structure: Five Layers

An Iceberg table isn't a directory of data files. It's a tree of immutable files behind a single pointer. Nothing is ever edited in place. Every commit writes new files and moves the pointer. The whole table is five layers, top to bottom:

Layer 0  Catalog        one mutable pointer: prod.orders → .../00042.metadata.json
   │                    (REST / Glue / Hive / JDBC; lives outside the table's files)
   ▼
Layer 1  metadata.json  schemas, partition specs, sort orders, snapshots, refs  (~10-50 KB)
   │
   ▼
Layer 2  manifest list  one per snapshot; manifests + partition summaries  (.avro)
   │
   ▼
Layer 3  manifests      index of data + delete files, with per-column stats  (.avro)
   │
   ▼
Layer 4  data + deletes data files, delete files, deletion-vector Puffin blobs

The catalog (Layer 0) is the only mutable thing in the whole system. Everything beneath it is content-addressed by path and never changes. Layers 1 to 3 are the focus of this lesson because that's where scan planning happens. We cover the two bookends too: Layer 0 here and in the commit protocol below, Layer 4 right after the manifests.

On disk, that same tree is just files in two folders:

s3://theworldshop-warehouse/orders/              <- table location
  metadata/
    00001-....metadata.json                      <- Layer 1: one per commit
    00042-....metadata.json                      <- the one the catalog points at now
    snap-8392648-....avro                         <- Layer 2: a manifest list per snapshot
    9b21-....avro                                 <- Layer 3: manifests
    stats-8392648.puffin                          <- optional: table-level stats
  data/
    order_date_day=2024-01-15/00000-0-....parquet <- Layer 4: data
    order_date_day=2024-01-15/00001-0-....parquet
    dv-8392648.puffin                             <- Layer 4: deletion vectors

Each layer serves a different purpose in the scan planning pipeline, and each one prunes work before the next is even opened.

Level 1: Table Metadata (`metadata.json`)

The entry point. Written in JSON for human readability. Contains everything needed to understand the current and historical state of the table.

Key fields (from format/spec.md in the codebase):

{
  "format-version": 2,
  "table-uuid": "9c12d441-03fe-4693-9a96-a0705ddf69c3",
  "location": "s3://theworldshop-warehouse/orders",
  "last-sequence-number": 1234,
  "last-updated-ms": 1706227200000,
  "last-column-id": 12,
  
  "schemas": [{
    "schema-id": 0,
    "fields": [
      {"id": 1, "name": "order_id",    "type": "string", "required": true},
      {"id": 2, "name": "order_date",  "type": "date",   "required": true},
      {"id": 3, "name": "order_total", "type": "double",  "required": true}
    ]
  }],
  "current-schema-id": 0,
  
  "partition-specs": [{
    "spec-id": 0,
    "fields": [
      {"source-id": 2, "field-id": 1000, "name": "order_date_day", "transform": "day"}
    ]
  }],
  
  "current-snapshot-id": 3051729675574597004,
  "snapshots": [{
    "snapshot-id": 3051729675574597004,
    "parent-snapshot-id": 8765432109876543210,
    "timestamp-ms": 1706227200000,
    "manifest-list": "s3://theworldshop-warehouse/orders/metadata/snap-3051729.avro",
    "summary": {
      "operation": "append",
      "added-data-files": "4",
      "added-records": "1200000",
      "total-data-files": "400000",
      "total-records": "8000000000"
    }
  }],
  
  "snapshot-log": [...],
  "metadata-log": [...]
}

When a query engine starts planning, it reads this single file (~50 KB) and knows: current schema, partition spec, and where to find the manifest list.

Level 2: Manifest List (Avro)

One manifest list per snapshot. This is an Avro file: binary, compact, schema-encoded. Each row represents one manifest file. The manifest list's value comes from its partition summary columns: for each partition field, it stores the lower_bound and upper_bound of values across all files in that manifest.

Key manifest list entry fields (from ManifestFile.java):

// api/src/main/java/org/apache/iceberg/ManifestFile.java
interface ManifestFile {
    String path();              // s3://...manifest-1.avro
    long length();              // bytes
    int partitionSpecId();      // which partition spec applies
    long snapshotId();          // which snapshot added this manifest
    int addedFilesCount();
    int existingFilesCount();
    int deletedFilesCount();
    long addedRowsCount();
    
    List<PartitionFieldSummary> partitions(); // ← THE KEY FIELD
}

interface PartitionFieldSummary {
    boolean containsNull();
    ByteBuffer lowerBound();   // min partition value (encoded)
    ByteBuffer upperBound();   // max partition value (encoded)
}

This is where partition pruning happens. When a query has WHERE order_date = '2024-01-15', Iceberg:

Computes days('2024-01-15') = 19737
Scans the manifest list
For each manifest entry, checks: lowerBound ≤ 19737 ≤ upperBound
Skips any manifest where the bound check fails

If a manifest covers only data from 2023-12-01 to 2023-12-31, it's skipped entirely, without reading the manifest file.

Level 3: Manifest Files (Avro)

Each manifest file contains one row per data file or delete file. Critically, each row includes per-column statistics (min, max, null count) for every column in the data file.

Key fields per data file entry (DataFile.java):

// api/src/main/java/org/apache/iceberg/DataFile.java
interface DataFile extends ContentFile<DataFile> {
    String path();               // s3://...part-00000.parquet
    FileFormat format();         // PARQUET, ORC, AVRO
    StructLike partition();      // partition values
    long recordCount();          // row count in this file
    long fileSizeInBytes();
    
    Map<Integer, Long>    columnSizes();   // column ID → bytes
    Map<Integer, Long>    valueCounts();   // column ID → value count
    Map<Integer, Long>    nullValueCounts();
    Map<Integer, ByteBuffer> lowerBounds(); // column ID → min value
    Map<Integer, ByteBuffer> upperBounds(); // column ID → max value
}

With these per-file statistics, the query planner can skip files where a filter predicate is provably false:

WHERE order_total > 1000: skip files where upperBounds[3] ≤ 1000
WHERE country = 'US': skip files where lowerBounds[6] > 'US' OR upperBounds[6] < 'US'

This is data skipping, and it's based entirely on the metadata already in manifests. No file reads needed.

Level 4: Data, Delete, and DV Files

The bottom layer is the files a manifest points at, and there's more down here than just data:

Data files (Parquet, ORC, or Avro). The rows themselves. A manifest entry marks these with content = 0.
Position delete files (content = 1). A list of (file_path, position) tuples saying "row N of that data file is gone." Introduced in v2, deprecated from v3 in favor of deletion vectors.
Equality delete files (content = 2). A predicate on column values, like customer_id = 9999, with no file path or position attached. Ideal for streaming upserts.
Deletion vectors (v3+). A compressed bitmap of deleted positions for one data file, stored as a blob inside a Puffin file. One bitmap per data file, replacing piles of small position-delete files.

A single manifest holds either data files or delete files, never both. That split lets scan planning load all the delete manifests first and know what's deleted before it emits a single data file. How those deletes get written and merged at read time is Lesson 3. Here the point is structural: the bottom layer isn't just "your Parquet." It's data plus the side-files that record what's no longer true.

Sequence Numbers

Sequence numbers arrived in format v2 (Lesson 2 tells that story), and they're the quiet engine behind both streaming reads and correct deletes. Every commit gets a monotonically increasing number, threaded through every layer: the table metadata tracks the highest one assigned, each snapshot records its own, the manifest list records each manifest's number, and each manifest entry carries its file's number. Three things ride on this one counter.

Incremental scans. You can ask "give me everything added after sequence number N", which is how a streaming reader tails a table:

// Used for streaming incremental reads (Flink uses this)
table.newIncrementalScan()
     .fromSnapshotId(lastProcessedSnapshot)
     .toSnapshotId(currentSnapshot)
     .planFiles(); // only files added between the two snapshots

Delete ordering. This is the subtle one, and it's why merge-on-read (Lesson 3) stays correct. An equality delete ("any row where customer_id = 9999") applies to a data file only when the delete's sequence number is greater than the file's. So if you delete 9999 and then insert a new 9999, the insert has a higher sequence number and survives. The delete can't reach forward in time.

Inheritance. A manifest entry can leave its snapshot id and sequence number blank in the file, and the reader fills them in from the manifest list. That sounds like a detail, but it's what keeps commits cheap. When a commit loses the pointer-swap race and retries with a new sequence number, only the small manifest list is rewritten. The manifests and data files it points at are untouched.

The Commit Protocol: One Atomic Swap

Everything below the catalog is immutable and content-addressed by path, so readers and writers build files with no locking. The only contended operation in the whole system is the catalog pointer. A commit from version V to V+1 is two steps: write the new manifests, manifest list, and metadata.json (all new files, with random names), then compare-and-swap the catalog pointer from the old metadata path to the new one. If the swap succeeds, the commit is visible. If it fails, someone else committed first, so the writer rebases on the new head and retries.

The rebase isn't always safe to do blindly, so each commit carries an intent that decides how it replays:

An append is always replayable. New files never conflict with someone else's new files.
A replace (compaction) must verify the files it's replacing are still in the table.
A delete by file must verify its target files still exist.
A delete by expression is always replayable.
A schema or spec change must verify no other schema change slipped in first.

This is why two engineers appending to TheWorldShop's orders table at the same moment both succeed, but two compaction jobs fighting over the same files don't silently corrupt each other. One wins the swap. The other rebases, notices its files are already gone, and bails out cleanly.

How Scan Planning Actually Works

Putting it all together, here's the full scan planning pipeline for a query with predicate WHERE order_date = '2024-01-15' AND country = 'US':

1. Read metadata.json (1 file, ~50ms)
   → get current snapshot ID → get manifest list location

2. Read manifest list (1 Avro file, ~100ms)
   → for each manifest entry:
     → check partition bounds for order_date_day vs 19737
     → SKIP 380 of 400 manifests (wrong date range)
     → pass 20 manifests to next step

3. Read 20 manifest files (~2s)
   → for each data file entry:
     → check per-file bounds: country upperBound ≥ 'US' AND lowerBound ≤ 'US'
     → SKIP 60% of files in those manifests
     → pass ~3,000 data files to Spark executor planning

4. Spark reads ~3,000 Parquet files
   → within each file: Parquet column statistics + row group pruning further reduce I/O

Total planning overhead: ~2.5 seconds for an 8-billion-row, 400,000-file table

Compare to Hive: listing 400,000 files would take minutes.

The Whole Table on One Page

Every layer you've seen, stacked into one mental model:

catalog → metadata.json
  ├── table level: format-version, schemas[], partition-specs[], sort-orders[], refs{}, properties{}
  ├── refs{} → main / branches / tags → snapshot-id
  └── snapshots[]  (full history; expired ones pruned)
        └── one snapshot: snapshot-id, parent, sequence-number, summary{}
              └── manifest list (.avro): [ manifest entry × N ]
                    each: path, partition summary (lower/upper bounds), counts, content (data | deletes)
                      └── manifest (.avro): [ file entry × N ]
                            each: status, snapshot_id, sequence_number, per-column stats
                              └── data file  /  delete file  /  deletion-vector Puffin blob

Read it top to bottom and you have the entire table. The catalog holds one pointer. Everything beneath it is immutable. A new commit appends a new metadata.json plus new manifest-list, manifest, and data files, then swaps the pointer in one atomic step. The old tree stays reachable through snapshot history (that's your time travel) until expiration garbage-collects it.

3. Worked Example

Let's inspect actual metadata files from the Iceberg codebase test fixtures. Here's how you can dump and read them in Spark:

# In Spark, Iceberg exposes metadata as queryable tables

# 1. Inspect the manifest list for the current snapshot
spark.sql("""
    SELECT 
        path,
        length,
        partition_spec_id,
        added_snapshot_id,
        added_files_count,
        partition_summaries
    FROM theworldshop.orders.manifests
""").show(truncate=False)

# 2. Inspect individual data file statistics
spark.sql("""
    SELECT 
        file_path,
        record_count,
        file_size_in_bytes,
        column_sizes,
        value_counts,
        null_value_counts,
        lower_bounds,
        upper_bounds,
        partition
    FROM theworldshop.orders.files
    LIMIT 5
""").show(truncate=False)

# 3. See all snapshots and their manifest lists
spark.sql("""
    SELECT 
        snapshot_id,
        parent_id,
        operation,
        committed_at,
        summary
    FROM theworldshop.orders.snapshots
""").show(truncate=False)

# 4. Look at the raw metadata JSON on disk
import json
import os

warehouse = "/tmp/theworldshop-warehouse"
metadata_path = f"{warehouse}/theworldshop/orders/metadata"
metadata_files = sorted([f for f in os.listdir(metadata_path) if f.endswith('.json')])
latest_metadata = metadata_files[-1]

with open(f"{metadata_path}/{latest_metadata}") as f:
    meta = json.load(f)
    
print(f"Format version: {meta['format-version']}")
print(f"Current snapshot: {meta['current-snapshot-id']}")
print(f"Schema fields: {[f['name'] for f in meta['schemas'][0]['fields']]}")
print(f"Number of snapshots: {len(meta['snapshots'])}")

4. Your Turn

Exercise: Given an Iceberg table with these manifest entries:

Manifest	Date LowerBound	Date UpperBound	Country LowerBound	Country UpperBound	Files
M1	2024-01-01	2024-01-31	'CA'	'US'	500
M2	2024-01-01	2024-01-31	'DE'	'UK'	400
M3	2024-02-01	2024-02-28	'CA'	'US'	600
M4	2024-02-01	2024-02-28	'AU'	'JP'	350

Query: WHERE order_date BETWEEN '2024-01-10' AND '2024-01-20' AND country = 'US'

Which manifests survive the partition (date) pruning step?
Of those, which survive the file statistics (country) pruning step?
How many data files does Spark actually need to open?

5. Real-World Application

Netflix's Metacat service (their metadata management platform) is built on top of Iceberg's metadata model. They've described cases where query planning time for their largest tables dropped from 12 minutes (Hive) to 8 seconds (Iceberg) just from the metadata hierarchy redesign.

The Puffin file format (which you'll see in Lesson 6 on performance) extends this even further by allowing pre-computed statistics (like theta sketches for approximate distinct counts) to be stored alongside manifests, enabling query planners to skip not just files but also defer expensive cardinality estimation.

In your career: When a stakeholder says "our Iceberg queries are slow," the first thing to check is scan planning efficiency. Are manifests being pruned? Are per-file statistics being used? The metadata layer is the first place to look, not the compute layer.

Aha: Iceberg's manifest list isn't an index "for" the data. It's an index over per-partition min/max bounds. By the time the query engine touches a manifest file, the partition decision is already made. That's why scan planning stays sub-second even when the file count grows past a million.

6. Recap + Bridge

What we learned: An Iceberg table is five layers: the catalog pointer (Layer 0), metadata.json (Layer 1), the manifest list (Layer 2), manifests (Layer 3), and the data, delete, and deletion-vector files (Layer 4). Layers 1 to 3 carry the partition bounds and per-file statistics that make sub-second scan planning possible on billion-row tables. Sequence numbers order deletes and power incremental reads. A commit appends new files and swaps the single catalog pointer, which is what makes the whole thing atomic.

Coming up next: You've seen the metadata layers as they stand today. But the format reached this shape in four steps, and the version number stamped in every metadata.json decides which of these features a table even has. Lesson 2 walks the evolution from v1 to v4: what each version added, why it was needed, and what it means for a table you have to read.