SpecializationIceberg PII Lifecycle

Lesson 1: The PII Lifecycle

Course: Iceberg PII Lifecycle | Duration: ~20 min | Lesson: 1 of 7


Priya closes a GDPR erasure ticket: customer deleted, before/after time-travel proof attached, audit record written. Clean. Three months later, a security review runs a recovery test on the warehouse and pulls that exact customer's full record back out of an old Iceberg snapshot, intact. Email, address, order history, all of it, sitting in MinIO the whole time.

Priya didn't make a mistake in the erasure. She made a mistake about what "erased" means. The DELETE removed the customer from the live table. It never removed her from storage, because the snapshot holding her data was never expired. Logically gone, physically present, for ninety days.

The previous course taught you to delete and prove. This course teaches the half nobody sees until a recovery test or an auditor finds it: making PII actually leave the bytes. That's a lifecycle, collect, use, retain, delete, purge, and the delete is only the fourth step of five.


2. Concept Explanation

PII has a lifecycle, and each stage maps to a concrete Iceberg operation. Governance (last course) handled detection, tagging, and the logical delete. Lifecycle handles what happens over time: how long data is kept, and how it's physically removed when its time is up.

The five stages

  • Collect. Data lands (ingestion). The relevant choice: write it with retention metadata so the system knows when it expires.
  • Use. Active querying. Access-controlled and masked (prior courses).
  • Retain. The data sits for a defined window, set by law (GDPR storage limitation, HIPAA minimums) or business need. This is a policy, and it must be encoded somewhere the system can act on.
  • Delete. The logical removal: a subject erasure (DELETE) or a retention-window expiry. Removes data from the live table.
  • Purge. The physical removal: expiring snapshots, dropping old partitions, and cleaning files so the bytes leave storage. This is the step Priya skipped.

The trap is treating "delete" as the end. Delete is logical; purge is physical, and only purge satisfies "the data is gone."

Why delete isn't purge (the recurring caveat, made central)

From the governance course: an Iceberg DELETE writes a delete file (or rewrites the current files), but old snapshots still reference the original data files, that's what powers time travel and audit. So after a delete, the data is:

  • absent from the current snapshot (queries don't see it),
  • present in old snapshots (time travel and, crucially, anyone with storage access can recover it).

Purge is the set of operations that remove those old references and the underlying files:

  • Snapshot expiration (expire_snapshots): drop old snapshots so they no longer reference the data files.
  • Partition expiration / drop: remove whole partitions of aged-out data.
  • Orphan-file cleanup (remove_orphan_files): delete data files no live snapshot references anymore.

Only after these run is the data physically gone.

Retention as the organizing policy

Retention is the spine of the lifecycle. "Keep order data 7 years, then delete" or "purge raw event PII after 30 days" is a policy that must become table configuration and scheduled operations. The lifecycle course is essentially: translate legal/business retention into Iceberg config (next lesson), then run the expiration operations on schedule to enforce it (the rest).

The tension you'll manage throughout

Two forces pull against each other, and every lifecycle decision balances them:

  • Keep longer: audit trail, recovery from bad writes, time-travel debugging, legal hold.
  • Purge sooner: storage-limitation law, breach-blast-radius reduction, cost, honoring erasure.

You can't satisfy both with one knob. The resolution (from the governance course, formalized here): capture the evidence you need into durable audit records, then purge the data on a tight schedule. Proof and data get separate retention.

Where this sits in the track

Detect (course 1) -> govern and delete (course 2) -> lifecycle: retain and purge (this course) -> capstone (everything wired together). This course is what makes erasure permanent and retention enforced, the difference between a warehouse that claims compliance and one that can prove it under a recovery test.


3. Worked Example

The lifecycle of one customer's data, stage by stage, showing where "deleted" becomes "gone."

-- COLLECT: write with retention intent (table-level policy, next lesson).
ALTER TABLE events.orders SET TBLPROPERTIES (
  'retention.days' = '2555',          -- 7 years, business/legal retention
  'history.expire.max-snapshot-age-ms' = '604800000'  -- keep snapshots 7 days
);

-- USE: queried under masks/filters (prior courses). [no-op here]

-- DELETE (logical): subject erasure. Removes from the LIVE table only.
DELETE FROM events.orders WHERE user_id = '40021188';
SELECT count(*) FROM events.orders WHERE user_id = '40021188';   -- 0 (live)
SELECT count(*) FROM events.orders FOR VERSION AS OF <pre_delete>
WHERE user_id = '40021188';                                       -- 3 (still in storage!)

-- PURGE (physical): the steps that actually remove the bytes.
CALL system.rewrite_data_files(table => 'events.orders');         -- drop rows from current files
CALL system.expire_snapshots(                                     -- drop old snapshots...
  table => 'events.orders',
  older_than => TIMESTAMP '2026-06-03 00:00:00');                 -- ...that referenced her data
CALL system.remove_orphan_files(table => 'events.orders',         -- clean unreferenced files
  older_than => TIMESTAMP '2026-06-03 00:00:00');

-- NOW she is gone from storage:
SELECT count(*) FROM events.orders FOR VERSION AS OF <pre_delete>
WHERE user_id = '40021188';   -- ERROR / unavailable: snapshot expired, data purged

The before/after pair is the whole lesson. After DELETE, the live count is 0 but the time-travel count is still 3, the data is in storage, recoverable, exactly Priya's incident. Only after rewrite_data_files + expire_snapshots + remove_orphan_files does the old snapshot stop returning her, because the data files holding her rows are physically removed. "Deleted" was step one; "gone" took three more.

Aha: "Delete" and "gone" are different events separated by a retention window, and the gap between them is where your erased customers actually live. Priya's customer was logically deleted and physically present for ninety days, fully recoverable, while the audit record said "erased." The lifecycle exists to close that gap on purpose: not by deleting harder, but by expiring the snapshots that keep the deleted data alive for time travel. The uncomfortable truth is that the feature you love for audit (immutable history) is the same feature that keeps your "deleted" PII in the bucket, and purge is the deliberate act of giving up history to honor erasure.


4. Your Turn

Exercise: TheWorldShop's privacy policy promises that erased customer data is physically removed within 30 days, and that raw event PII is purged 90 days after collection. Currently the team runs subject DELETEs but no expiration jobs.

  1. List the five lifecycle stages and, for each, name the Iceberg operation (or "no-op") that implements it.
  2. Explain, using the delete-vs-purge distinction, why the current setup violates the "physically removed within 30 days" promise even though deletes run correctly.
  3. Describe the minimum set of scheduled operations the team must add to actually honor both promises (subject erasure within 30 days, raw PII purged at 90 days), and why each is needed.

5. Real-World Application

The delete-isn't-purge gap is one of the most common, and most dangerous, misunderstandings in lakehouse privacy, and it surfaces exactly as Priya's incident: a recovery test, a penetration test, or an auditor pulls "deleted" data out of object storage months later. Organizations have failed audits and triggered breach-notification obligations because data they reported as erased was still physically present and recoverable. The fix is never "delete more carefully", it's adding the purge lifecycle (expiration + cleanup) that physically removes the bytes, on a schedule that matches the retention promises.

Retention-as-configuration is how mature data platforms operationalize storage-limitation law. Rather than a human deciding per-table when to delete, the retention window is encoded in table properties and enforced by scheduled expiration jobs, so "raw event PII is purged after 90 days" becomes a partition-drop-plus-snapshot-expire job that runs nightly, not a quarterly cleanup someone forgets. This is also a cost story: snapshots and orphan files accumulate storage indefinitely without expiration, so the same jobs that honor privacy also control the bill.

The proof-versus-data separation is the governance pattern that makes the whole thing legally coherent, and it's why this course pairs purge with audit. You purge the personal data aggressively (short snapshot retention) to honor erasure and storage-limitation, while keeping durable audit records (the erasure evidence from the governance course) long-term to prove you did it. Two clocks: data on a short one, proof on a long one. The rest of this course builds each piece, retention config, partition expiration, snapshot expiration, orphan cleanup, and then wires them into an automated, verifiable purge pipeline.


6. Recap + Bridge

PII has a five-stage lifecycle, collect, use, retain, delete, purge, and the dangerous mistake is stopping at delete. Delete is logical (removes data from the live table); purge is physical (expires snapshots, drops partitions, cleans orphan files so the bytes leave storage). Only purge satisfies "the data is gone," and Priya's recoverable-after-90-days incident is what happens when you skip it. Retention is the organizing policy: translate legal/business windows into table config and scheduled expiration. Manage the keep-longer-vs-purge-sooner tension by separating proof (kept long) from data (purged fast).

Next lesson starts building the lifecycle at stage one: retention policies. You'll translate legal requirements ("keep 7 years," "purge raw PII after 30 days," "honor legal holds") into concrete Iceberg table properties and partition design, the configuration that the expiration jobs in later lessons will enforce.