Lesson 1: Why Iceberg Changes the PII Governance Game

Course: Iceberg Data Governance for PII | Duration: ~20 min | Lesson: 1 of 7


A GDPR erasure request lands on Dev's desk: customer 40021188 wants to be forgotten. Dev knows exactly where her data is, the detection course built him a manifest, it's in events.orders, a Hive table backed by Parquet files in S3. Then he tries to actually delete her rows.

There's no DELETE. Hive tables on raw Parquet don't support it. To remove one customer's rows, he has to find every Parquet file containing them, read each file, filter her out, rewrite the whole file, and swap it in, while hoping no concurrent job is reading. For a table with thousands of files, that's a multi-hour job that rewrites terabytes to delete a few kilobytes, and if it fails halfway, the table is corrupt.

Dev's problem isn't finding the PII. It's that his storage layer was never built to delete anything. This course is about a table format that was: Apache Iceberg, where erasure, audit, and tagging are native operations instead of heroic batch jobs.


2. Concept Explanation

The PII operations you need, delete a subject's rows, prove you deleted them, tag which columns are sensitive, expire old data, are storage-layer operations. Whether they're easy or nearly impossible depends entirely on the table format. Three generations:

Raw Parquet files (no table layer)

A directory of Parquet files with no metadata layer. Great for analytics scans, terrible for governance:

  • No row-level delete. To remove rows, rewrite entire files. No transactions.
  • No schema evolution safety. Add a column and old files don't know about it.
  • No history. Once you overwrite a file, the old state is gone. No audit trail.

Hive tables

Add a metastore that tracks partitions, but the data is still Parquet files, and the metastore tracks directories, not rows. Most engines still can't do a real row-level DELETE on a Hive table, and there's no snapshot history. Slightly better cataloging, same governance pain.

Apache Iceberg (a real table format)

Iceberg adds a metadata layer over your Parquet (or ORC/Avro) files: a tree of metadata files tracking every snapshot, every data file, and every column. That metadata layer is what makes governance native:

  • ACID transactions. DELETE FROM orders WHERE user_id = ... is a real, atomic, concurrent-safe operation.
  • Row-level deletes. Iceberg can mark specific rows deleted without rewriting whole files (delete files), then compact later. Erasure becomes a query, not a batch job.
  • Snapshots and time travel. Every write creates an immutable snapshot. You can query the table AS OF any past point, which is an audit trail you get for free.
  • Schema and partition evolution. Add columns, change partitioning, without rewriting data, and old snapshots stay queryable.
  • Catalog integration. Column-level properties and tags live in the catalog (REST, Glue, Nessie, Unity), so "this column is PII" is metadata the table carries.

Why this is the right foundation for PII

Every GDPR/CCPA obligation maps to an Iceberg primitive:

  • "Right to erasure" -> row-level DELETE + snapshot expiration (data actually leaves storage).
  • "Prove the deletion happened" -> time travel (compare before/after snapshots).
  • "Know where sensitive data is" -> column tags in the catalog.
  • "Storage limitation / retention" -> partition expiration + snapshot expiration.

Dev's Hive nightmare, rewrite-everything-to-delete-one-customer, simply doesn't exist in Iceberg. The delete is a transaction. That's the game change.

The one catch to remember

A DELETE in Iceberg removes rows from the current snapshot, but the old snapshot (with the data) still exists for time travel. To make PII actually gone from storage, you also expire the old snapshots and clean the files, the whole next course. For now: Iceberg makes the delete easy; making it permanent is a second, deliberate step.


3. Worked Example

The same erasure request, Hive versus Iceberg. The difference is the whole course's thesis.

-- HIVE (raw Parquet): there is no real row-level delete.
-- You must rewrite. Conceptually:
--   1. find every Parquet file that might contain user 40021188
--   2. read each, filter out her rows, write a new file
--   3. atomically swap directories, pray no reader is mid-scan
-- Hours of work, terabytes rewritten to delete kilobytes, no audit trail.

-- ICEBERG: erasure is one transaction.
DELETE FROM events.orders WHERE user_id = '40021188';
-- Atomic. Concurrent-safe. Creates a new snapshot. Done in seconds.

And the part Hive can't do at all, proving it happened:

-- Snapshot history is an audit trail you get for free.
SELECT committed_at, snapshot_id, summary['deleted-records']
FROM   events.orders.snapshots
ORDER  BY committed_at DESC;
-- committed_at          snapshot_id   deleted-records
-- 2026-06-10 09:14:02   8841...       3                <- the erasure, recorded

-- Query the table BEFORE the delete to confirm her rows existed,
-- and AFTER to confirm they're gone, both from the same table.
SELECT count(*) FROM events.orders FOR VERSION AS OF <prev_snapshot>
WHERE user_id = '40021188';   -- 3  (she was here)
SELECT count(*) FROM events.orders   -- current snapshot
WHERE user_id = '40021188';   -- 0  (she's gone)

In Iceberg, the erasure and its proof are both ordinary queries against table metadata. In Hive, the erasure is a fragile batch job and the proof doesn't exist. That gap is why serious PII platforms run on Iceberg (or Delta/Hudi, its cousins) and not on raw Parquet or Hive.

Aha: "Where is the PII" was the detection course's hard problem. "How do I delete it without rewriting the universe" is the storage layer's, and it's invisible until a real erasure request forces you to try. The reason Iceberg matters for privacy isn't that it's a faster query engine. It's that it added a metadata layer that tracks rows and history, and once rows and history are first-class, every GDPR verb, delete, prove, retain, expire, becomes a transaction instead of a heroic, terabyte-rewriting, corruption-prone batch job. The table format quietly decides whether compliance is a query or a crisis.


4. Your Turn

Exercise: TheWorldShop stores events.clickstream as raw Parquet in S3 (no table format) and is moving it to Iceberg. A regulator requires three capabilities: (a) delete a specific user's events on request, (b) prove to an auditor that a past deletion actually happened, (c) know which columns hold PII without opening the files.

  1. For each of the three requirements, explain why the raw-Parquet version fails and how Iceberg satisfies it (name the Iceberg feature).
  2. A teammate says "we ran the Iceberg DELETE, so the user's data is gone from storage." Explain why that's not yet fully true and what additional step is required.
  3. Why is "know which columns hold PII" a catalog/metadata problem rather than a data problem, and how does that connect to the detection manifest from the previous course?

5. Real-World Application

The shift from Hive/raw-Parquet to Iceberg (and its peers Delta Lake and Hudi) is one of the defining data-platform migrations of the last few years, and GDPR/CCPA erasure is one of its biggest drivers. Before table formats, "right to be forgotten" on a data lake was a genuine engineering crisis: companies built elaborate, fragile "forget-me" batch pipelines that rewrote partitions wholesale, ran for hours, and still couldn't prove they'd worked. Netflix (which created Iceberg), Apple, and countless others adopted it partly because row-level deletes and snapshots turned compliance from a batch nightmare into ordinary table operations.

The audit-trail-for-free property is underappreciated until an auditor shows up. Regulators don't just want the data deleted; they want evidence it was deleted, with timestamps. Iceberg's immutable snapshot log provides exactly that: every commit is recorded, so "show me that you erased customer 40021188 on the date you claim" is a query, not a scramble through logs. Teams on raw Parquet have no answer to that question, which is itself a compliance finding.

The connection to the previous course is the architecture of a real PII platform: detect (PySpark + Presidio produces the manifest of where PII lives), then govern at the storage layer (Iceberg tags those columns, deletes subjects, proves erasure, expires old data). The manifest tells you what and where; Iceberg is how you act on it durably. The rest of this course turns each manifest finding into an Iceberg operation: tagging the columns, deleting the subjects, and proving it with time travel.


6. Recap + Bridge

PII operations, delete a subject, prove it, tag sensitive columns, expire old data, are storage-layer operations, and the table format decides whether they're queries or crises. Raw Parquet and Hive can't do row-level deletes or keep history, so erasure means rewriting terabytes with no audit trail. Iceberg adds a metadata layer that makes rows and snapshots first-class, so DELETE is an atomic transaction, time travel is a free audit trail, and column tags live in the catalog. The one catch: a delete creates a new snapshot but the old one persists, so making PII physically gone needs snapshot expiration (the next course).

Next lesson starts turning the detection manifest into Iceberg governance: column tagging. You'll mark PII columns with catalog properties (pii=true, sensitivity=high, gdpr_erasable=true), read those tags back from the catalog, and see how tagging becomes the hook that access policies and erasure jobs key on, the bridge from "we detected PII here" to "the table itself knows this column is sensitive."