Lesson 1: Architecture Overview

Course: Capstone: End-to-End PII Pipeline | Duration: ~20 min | Lesson: 1 of 8


Priya's manager makes it official: "Build us a real PII pipeline. Raw data comes in, and I want every piece of personal information detected, protected, catalogued, and erasable on request, end to end, no gaps." It's the whole track in one sentence, and Priya has, scattered across three courses, every piece she needs: a Presidio detector, masking techniques, Iceberg governance, a purge lifecycle.

What she doesn't have yet is the architecture, the order the pieces go in, the zones data flows through, and the decisions that make them one coherent system instead of four disconnected demos. Where does detection run? What gets masked, and by which technique? How does a tag set in one stage drive an erasure three stages later? Get the wiring wrong and you have a pipeline that detects PII it never protects, or masks data it can't erase.

This capstone builds that system. This lesson is the blueprint: the zones, the data flow, and why each component sits where it does.


2. Concept Explanation

The capstone pipeline takes raw data through five stages, each one a course you've already done, now wired into a single flow with metadata carrying decisions forward.

The data flow: Raw -> Detect -> Mask -> Govern -> Store

[ingest]      raw lands in a restricted BRONZE zone (Iceberg, access-locked)
   |
[detect]      Presidio + regex scan -> PII manifest (what/where, per column)
   |
[mask]        per the manifest + sensitivity tier, apply the right technique
   |          -> SILVER zone (masked, joinable, analytics-safe)
   |
[govern]      tag columns in the catalog, set lineage, set owners
   |
[store]       write Iceberg with PII metadata embedded + retention config
   |
[erase]       subject-erasure + lifecycle purge, end to end, verifiable

Why this order

  • Ingest to a locked bronze zone first. Raw data has un-detected PII, so it lands in a tightly access-controlled zone before anything is masked. You never let raw, unscanned data into a broadly-readable place. (Lesson 2.)
  • Detect before masking. You can't mask what you haven't found, and the manifest decides what to mask and how. Detection is the input to every later decision. (Lesson 3.)
  • Mask by tier, not uniformly. The detected sensitivity tier routes each field to the right technique: restricted to tokenization/FPE, internal to hashing, public to pass-through. One technique for everything is wrong. (Lesson 4, drawing on Track 2.)
  • Govern with tags and lineage. The manifest's findings become catalog tags on the silver table, with lineage from bronze, so policy and erasure can key on them. (Lesson 5.)
  • Store in Iceberg with metadata embedded. The masked, governed data lands in Iceberg with PII tags and retention config baked in, ready for erasure and lifecycle. (Lesson 6.)
  • Erase end to end. A subject request flows through all of it: find the subject across tables, delete/redact by tag, purge physically, verify, audit. (Lesson 7.)

The metadata spine

The thing that makes this one system and not four demos is metadata flowing forward:

  • Detection emits a manifest (column, entity, confidence, tier).
  • The manifest drives masking (which fields, which technique).
  • The manifest becomes catalog tags (sensitivity, gdpr_erasable).
  • The tags drive access policy and erasure scope.
  • Erasure writes an audit record proving it happened.

Each stage's output is the next stage's input. Lose the metadata between stages and you've broken the pipeline into disconnected steps.

The zones (medallion, privacy-flavored)

  • Bronze: raw, unmasked, access-locked. The only place plaintext PII lives, behind the tightest controls.
  • Silver: masked, tagged, analytics-safe. What most of the org reads.
  • Gold (implied): aggregated/generalized, broadly shareable (Track 2's spectrum).

PII protection increases as data flows bronze -> silver -> gold, exactly the masking spectrum from Track 2, now realized as physical zones.

The technology

Spark (the engine, runs detection and transforms), Presidio (detection), Iceberg (storage, governance, lifecycle), a catalog (tags, policy). The capstone wires the same stack the track has used throughout, no new tools, just integration.


3. Worked Example

The pipeline as one orchestrated flow, each stage calling the capability a prior course built.

def pii_pipeline(spark, raw_path, run_date):
    # 1. INGEST -> bronze (locked zone). Raw, unmasked, access-restricted.
    bronze = spark.read.json(raw_path)
    bronze.writeTo("lake.bronze.events").append()        # bronze is access-locked

    # 2. DETECT (Course 1): scan free-text/JSON, emit the manifest.
    manifest = scan_for_pii(spark, "lake.bronze.events")  # Presidio + regex UDF
    write_manifest(spark, manifest)                        # column, entity, tier, confidence

    # 3. MASK (Track 2) by tier, driven by the manifest.
    silver = apply_masking(spark, "lake.bronze.events", manifest)
    #   restricted -> tokenize/FPE, internal -> HMAC, public -> pass-through

    # 4. GOVERN (Course 2): tag silver columns from the manifest, record lineage.
    tag_columns(spark, "lake.silver.events", manifest)     # sensitivity, gdpr_erasable
    record_lineage(spark, src="lake.bronze.events", dst="lake.silver.events")

    # 5. STORE (Course 2/3): write Iceberg with retention config embedded.
    silver.writeTo("lake.silver.events").using("iceberg").createOrReplace()
    set_retention(spark, "lake.silver.events", purge_days=2555, snapshot_days=14)

# 6. ERASE (Courses 2-3), triggered by a subject request, not the nightly run.
def erase_subject(spark, subject_id):
    for tbl in tables_with_subject(subject_id):            # from manifest/lineage
        pre = snapshot_id(spark, tbl)
        redact_or_delete(spark, tbl, subject_id)           # tag-aware (gdpr_erasable)
        write_audit(spark, tbl, subject_id, pre, snapshot_id(spark, tbl))
    purge_and_verify(spark, subject_id)                    # compact, expire, orphans, verify

Read top to bottom and the whole track is there: detect feeds the manifest, the manifest feeds masking and tagging, the tags feed erasure, erasure feeds the audit. The erase_subject function consumes everything the pipeline produced, it knows which tables hold the subject (from lineage), how to remove their PII (from tags), and how to prove it (audit + verify). That's the difference between four courses and one system.

Aha: A PII pipeline isn't four tools in a row, it's one metadata spine with tools hanging off it. The manifest detection produces is the same object that decides masking, becomes the catalog tags, scopes the erasure, and shapes the audit. Cut the metadata between any two stages and the system collapses into disconnected demos: a detector whose findings nobody acts on, a masker guessing what's sensitive, an erasure job that can't find the subject. The architecture is the metadata flow. The Spark and Iceberg are just where it runs.


4. Your Turn

Exercise: Sketch the architecture for TheWorldShop's pipeline. Raw customer events (JSON with nested PII) arrive hourly. The business needs analytics on masked data, and must honor GDPR erasure within 14 days with proof.

  1. List the five stages in order and, for each, name the component/course it uses and the key artifact it produces or consumes.
  2. Explain why raw data must land in a locked bronze zone before detection and masking, rather than masking on ingest. What goes wrong if you mask first?
  3. Trace how a single piece of metadata, the detection finding "field payload.email is EMAIL, tier=high, erasable", flows through the rest of the pipeline and ultimately enables the 14-day erasure-with-proof.

5. Real-World Application

This architecture, raw to locked-bronze, detect, mask-by-tier, govern-with-tags, store-in-Iceberg, erase-with-proof, is the reference shape of a privacy-aware data platform, and every component maps to something teams run in production. The medallion zones (bronze/silver/gold) are standard lakehouse architecture; the privacy twist is treating bronze as the access-locked home of un-scanned PII and increasing protection as data flows toward gold, which is Track 2's masking spectrum realized as physical layers. Companies building GDPR/CCPA-compliant lakehouses converge on this shape because each constraint (detect everything, protect by sensitivity, prove erasure) forces a specific stage.

The metadata-spine insight is what separates platforms that scale from ones that don't. A pipeline where detection findings are manually re-entered as masking rules, then manually re-entered as catalog tags, then manually mapped for erasure, breaks the moment the warehouse grows, because the manual re-entry drifts and the gaps become leaks. The platforms that work make the manifest a first-class artifact that flows automatically: detect once, and the finding propagates to masking, tagging, policy, and erasure scope without a human retyping it. This is exactly why detection (Course 1) produces a durable, queryable manifest rather than a report.

The bronze-first discipline is a real and frequently-violated principle. The tempting shortcut, "mask on ingest so we never store raw PII", fails because you can't correctly mask data you haven't detected, and ingestion is exactly when you know the least about what's in the free-text and JSON fields. Mature pipelines land raw in a tightly-controlled zone, detect on complete data, then mask, accepting that plaintext PII exists somewhere (bronze) under strict controls rather than pretending it doesn't and masking blindly. The rest of this capstone builds each stage in order, starting with that locked ingestion zone next lesson.


6. Recap + Bridge

The capstone pipeline flows raw -> detect -> mask -> govern -> store -> erase, each stage a course you've done, wired by a metadata spine: detection's manifest drives masking, becomes catalog tags, scopes erasure, and shapes the audit. Data lands first in a locked bronze zone (un-scanned PII under tight control), gets detected, then masked by sensitivity tier into silver, tagged and stored in Iceberg with retention config, and is erasable end to end with proof. The architecture is the metadata flow; cut it between stages and you get disconnected demos.

Next lesson builds stage one: the ingestion layer. You'll design the bronze zone that lands raw data without leaking it, access-restricted storage, audit logging on raw access, schema-on-read for unpredictable payloads, the controlled front door that makes detect-then-mask safe.