Course: Iceberg Data Governance for PII | Duration: ~20 min | Lesson: 1 of 7
A GDPR erasure request lands on Dev's desk: customer 40021188 wants to be forgotten. Dev knows exactly where her data is, the detection course built him a manifest, it's in events.orders, a Hive table backed by Parquet files in S3. Then he tries to actually delete her rows.
There's no DELETE. Hive tables on raw Parquet don't support it. To remove one customer's rows, he has to find every Parquet file containing them, read each file, filter her out, rewrite the whole file, and swap it in, while hoping no concurrent job is reading. For a table with thousands of files, that's a multi-hour job that rewrites terabytes to delete a few kilobytes, and if it fails halfway, the table is corrupt.
Dev's problem isn't finding the PII. It's that his storage layer was never built to delete anything. This course is about a table format that was: Apache Iceberg, where erasure, audit, and tagging are native operations instead of heroic batch jobs.
2. Concept Explanation
The PII operations you need, delete a subject's rows, prove you deleted them, tag which columns are sensitive, expire old data, are storage-layer operations. Whether they're easy or nearly impossible depends entirely on the table format. Three generations:
Raw Parquet files (no table layer)
A directory of Parquet files with no metadata layer. Great for analytics scans, terrible for governance:
- No row-level delete. To remove rows, rewrite entire files. No transactions.
- No schema evolution safety. Add a column and old files don't know about it.
- No history. Once you overwrite a file, the old state is gone. No audit trail.
Hive tables
Add a metastore that tracks partitions, but the data is still Parquet files, and the metastore tracks directories, not rows. Most engines still can't do a real row-level DELETE on a Hive table, and there's no snapshot history. Slightly better cataloging, same governance pain.
Apache Iceberg (a real table format)
Iceberg adds a metadata layer over your Parquet (or ORC/Avro) files: a tree of metadata files tracking every snapshot, every data file, and every column. That metadata layer is what makes governance native:
- ACID transactions.
DELETE FROM orders WHERE user_id = ...is a real, atomic, concurrent-safe operation. - Row-level deletes. Iceberg can mark specific rows deleted without rewriting whole files (delete files), then compact later. Erasure becomes a query, not a batch job.
- Snapshots and time travel. Every write creates an immutable snapshot. You can query the table
AS OFany past point, which is an audit trail you get for free. - Schema and partition evolution. Add columns, change partitioning, without rewriting data, and old snapshots stay queryable.
- Catalog integration. Column-level properties and tags live in the catalog (REST, Glue, Nessie, Unity), so "this column is PII" is metadata the table carries.
Why this is the right foundation for PII
Every GDPR/CCPA obligation maps to an Iceberg primitive:
- "Right to erasure" -> row-level
DELETE+ snapshot expiration (data actually leaves storage). - "Prove the deletion happened" -> time travel (compare before/after snapshots).
- "Know where sensitive data is" -> column tags in the catalog.
- "Storage limitation / retention" -> partition expiration + snapshot expiration.
Dev's Hive nightmare, rewrite-everything-to-delete-one-customer, simply doesn't exist in Iceberg. The delete is a transaction. That's the game change.
The one catch to remember
A DELETE in Iceberg removes rows from the current snapshot, but the old snapshot (with the data) still exists for time travel. To make PII actually gone from storage, you also expire the old snapshots and clean the files, the whole next course. For now: Iceberg makes the delete easy; making it permanent is a second, deliberate step.
3. Worked Example
The same erasure request, Hive versus Iceberg. The difference is the whole course's thesis.
And the part Hive can't do at all, proving it happened:
In Iceberg, the erasure and its proof are both ordinary queries against table metadata. In Hive, the erasure is a fragile batch job and the proof doesn't exist. That gap is why serious PII platforms run on Iceberg (or Delta/Hudi, its cousins) and not on raw Parquet or Hive.
Aha: "Where is the PII" was the detection course's hard problem. "How do I delete it without rewriting the universe" is the storage layer's, and it's invisible until a real erasure request forces you to try. The reason Iceberg matters for privacy isn't that it's a faster query engine. It's that it added a metadata layer that tracks rows and history, and once rows and history are first-class, every GDPR verb, delete, prove, retain, expire, becomes a transaction instead of a heroic, terabyte-rewriting, corruption-prone batch job. The table format quietly decides whether compliance is a query or a crisis.
4. Your Turn
Exercise: TheWorldShop stores events.clickstream as raw Parquet in S3 (no table format) and is moving it to Iceberg. A regulator requires three capabilities: (a) delete a specific user's events on request, (b) prove to an auditor that a past deletion actually happened, (c) know which columns hold PII without opening the files.
- For each of the three requirements, explain why the raw-Parquet version fails and how Iceberg satisfies it (name the Iceberg feature).
- A teammate says "we ran the Iceberg DELETE, so the user's data is gone from storage." Explain why that's not yet fully true and what additional step is required.
- Why is "know which columns hold PII" a catalog/metadata problem rather than a data problem, and how does that connect to the detection manifest from the previous course?
5. Real-World Application
The shift from Hive/raw-Parquet to Iceberg (and its peers Delta Lake and Hudi) is one of the defining data-platform migrations of the last few years, and GDPR/CCPA erasure is one of its biggest drivers. Before table formats, "right to be forgotten" on a data lake was a genuine engineering crisis: companies built elaborate, fragile "forget-me" batch pipelines that rewrote partitions wholesale, ran for hours, and still couldn't prove they'd worked. Netflix (which created Iceberg), Apple, and countless others adopted it partly because row-level deletes and snapshots turned compliance from a batch nightmare into ordinary table operations.
The audit-trail-for-free property is underappreciated until an auditor shows up. Regulators don't just want the data deleted; they want evidence it was deleted, with timestamps. Iceberg's immutable snapshot log provides exactly that: every commit is recorded, so "show me that you erased customer 40021188 on the date you claim" is a query, not a scramble through logs. Teams on raw Parquet have no answer to that question, which is itself a compliance finding.
The connection to the previous course is the architecture of a real PII platform: detect (PySpark + Presidio produces the manifest of where PII lives), then govern at the storage layer (Iceberg tags those columns, deletes subjects, proves erasure, expires old data). The manifest tells you what and where; Iceberg is how you act on it durably. The rest of this course turns each manifest finding into an Iceberg operation: tagging the columns, deleting the subjects, and proving it with time travel.
6. Recap + Bridge
PII operations, delete a subject, prove it, tag sensitive columns, expire old data, are storage-layer operations, and the table format decides whether they're queries or crises. Raw Parquet and Hive can't do row-level deletes or keep history, so erasure means rewriting terabytes with no audit trail. Iceberg adds a metadata layer that makes rows and snapshots first-class, so DELETE is an atomic transaction, time travel is a free audit trail, and column tags live in the catalog. The one catch: a delete creates a new snapshot but the old one persists, so making PII physically gone needs snapshot expiration (the next course).
Next lesson starts turning the detection manifest into Iceberg governance: column tagging. You'll mark PII columns with catalog properties (pii=true, sensitivity=high, gdpr_erasable=true), read those tags back from the catalog, and see how tagging becomes the hook that access policies and erasure jobs key on, the bridge from "we detected PII here" to "the table itself knows this column is sensitive."