Course: PII Fundamentals | Duration: ~20 min | Lesson: 1 of 7

In August 2006, AOL Research published what it thought was an anonymized dataset of search queries from 650,000 users. Names were stripped. Account IDs were replaced with random numbers. The release was meant to help academic researchers study search behavior. User #4417749 was just a number.

Within days, New York Times reporters Michaela Barbaro and Tom Zeller Jr. identified user #4417749 as Thelma Arnold, a 62-year-old widow from Lilburn, Georgia. How? They searched for her. Her queries included "landscapers in Lilburn, Ga," "homes sold in shadow lake subdivision gwinnett county georgia," "numb fingers," and "dog that urinates on everything." Cross-referencing the geography, the health concerns, and the personal situation painted an unmistakable portrait. AOL pulled the dataset within days, but it had already been mirrored hundreds of times.

This wasn't a hack. Nobody cracked an encryption scheme. The data was intended to be public. The failure was a misunderstanding of what makes data identifying, a misunderstanding that data engineers still make today when they strip "the obvious stuff" and call a dataset clean. This lesson is about developing the intuition to recognize PII in all its forms, including the kinds that don't look like PII at all.

2. Concept Explanation

Direct Identifiers

Direct identifiers are data elements that, on their own, reliably identify a specific individual without needing to be combined with anything else.

Common direct identifiers:

Full name
Social Security Number (SSN) / National ID numbers
Email address
Phone number
Passport or driver's license number
Medical record number
Financial account number
Full postal address
Biometric identifiers (fingerprints, face templates; covered in depth in Lesson 2)

These are the columns your compliance team flags on day one. They're the easy case. An engineer who strips these columns from an export has done something useful, but not enough.

Indirect Identifiers

Indirect identifiers don't point to a person on their own, but they're linkable to other datasets that do. In isolation they look innocuous. In combination they're dangerous.

Common indirect identifiers:

IP address (links to a physical location and often an account)
Device ID / mobile advertising ID (IDFA, GAID)
Cookie ID / session token
User agent string (browser + OS + screen resolution)
MAC address
Employee ID (if the employee directory is accessible)
URL paths containing user-specific slugs (e.g., /users/rjtrana16/orders)

IP addresses are the canonical example. 192.168.1.42 tells you nothing. But combined with your company's DHCP lease logs, it maps to a specific employee at a specific time. Combined with an ISP's subscriber records, it maps to a household. GDPR Article 4 explicitly classifies IP addresses as personal data for exactly this reason.

Quasi-Identifiers and Linkage Attacks

Quasi-identifiers are the most dangerous category for data engineers because they look like aggregate or demographic data. Each value individually describes many people. But their intersection is surprisingly narrow.

The foundational research here is Latanya Sweeney's 2000 paper "Simple Demographics Often Identify People Uniquely." Her finding: 87% of Americans can be uniquely identified using only three fields: ZIP code, date of birth, and sex. These are fields routinely published in "de-identified" datasets.

The mechanics of a linkage attack:

A linkage attack combines two or more datasets that were each anonymized independently, but whose quasi-identifiers overlap enough to re-identify individuals when joined.

Dataset A: "Anonymized" hospital discharge records
  patient_id | zip_code | dob        | sex | diagnosis_code
  A001       | 02139    | 1972-03-14 | M   | J18.9
  A002       | 02142    | 1985-11-02 | F   | K21.0

Dataset B: Voter registration records (public in many US states)
  voter_id | zip_code | dob        | sex | full_name
  V8812    | 02139    | 1972-03-14 | M   | James Callahan
  V9031    | 02142    | 1985-11-02 | F   | Maria Ramos

JOIN on (zip_code, dob, sex):
  full_name     | diagnosis_code
  James Callahan | J18.9  (Pneumonia)
  Maria Ramos    | K21.0  (GERD)

Neither dataset contained a name + diagnosis pairing. The join created one. This is exactly the attack Sweeney demonstrated against the Massachusetts Group Insurance Commission health data in 1997. She re-identified the governor's medical records from a "de-identified" public dataset.

Why the Boundary Is Fuzzy

PII status is context-dependent, not a fixed property of a column. A name alone on a form is PII. A name in a published book author bio is not. An email address in an HR system is PII. An email address on a public speaker page is generally not. The key factors:

Linkability: Can this data be joined to something that identifies a person?
Reasonable identifiability: Would a reasonable effort (not nation-state resources) succeed?
Regulatory jurisdiction: GDPR's definition is broader than CCPA's, which is broader than HIPAA's. The strictest applicable law governs.

The engineering implication: you cannot classify a column in isolation. You need to understand the data ecosystem around it.

3. Worked Example

Here's how a linkage attack plays out against a realistic analytics schema. Your company runs an e-commerce platform. You've published a "customer behavior" dataset to a third-party analytics firm, having removed the obvious PII.

What you shared (export_customer_behavior):

event_ts	zip_code	age	gender	device_type	purchase_amount	category
2024-01-15 09:23:11	94107	34	F	mobile	87.50	electronics
2024-01-15 14:05:33	94107	34	F	mobile	23.00	books
2024-01-16 11:18:44	10001	52	M	desktop	412.00	furniture

You removed: name, email, user_id, phone, address.

What the analytics firm already has (their ad targeting database):

cookie_id	zip_code	age_range	gender	device_type	matched_email
ck_a9f3b2	94107	30-35	F	mobile	sarah.chen@gmail.com
ck_d71c44	10001	50-55	M	desktop	bob.martinez@yahoo.com

The join: zip_code + age (bucketed) + gender + device_type produces a near-perfect match. Your "anonymized" purchase history is now linked to real email addresses.

The columns you kept (zip, age, gender, device_type) are individually meaningless. Together, they're a fingerprint.

Aha: PII isn't a column type. It's a join key in waiting. The question isn't "does this field name a person?" It's "does the rest of the world hold a table I could join this to?" That reframing is what separates an engineer who ships safe exports from one who becomes the next AOL incident.

4. Your Turn

Exercise: Two situations to think through.

Look at the following list of columns from a ride-sharing app's event log. For each column, label it as Direct, Indirect, Quasi, or Not PII (with justification).

columns: user_id, pickup_lat, pickup_lng, dropoff_lat, dropoff_lng,
         trip_duration_seconds, fare_amount, payment_type, driver_rating,
         device_os, timestamp, surge_multiplier, promo_code_used

A teammate proposes publishing a dataset with {city, age_bucket (5-year), occupation_category, annual_income_bucket} as "fully anonymized." What's your response? What additional information would you need to assess re-identification risk?

5. Real-World Application

Google's Street View cars famously collected Wi-Fi payload data in 37 countries between 2007 and 2010, in addition to street-level imagery. The payload fragments included email snippets, passwords, and URLs. Google characterized this as "fragmentary" and "unusable" data. Regulators in Germany, France, and the UK disagreed. Even fragments are PII under European law if they're linkable to individuals. Google paid over $7M in fines across jurisdictions. The engineering team had correctly stripped the car identifier from the captures; they hadn't considered that the payload data itself was the identifier.

In healthcare, the Safe Harbor de-identification method under HIPAA requires removing all 18 specified identifiers (covered in Lesson 2). But researchers at Harvard demonstrated in 2013 that even after Safe Harbor de-identification, a determined adversary with access to public records could re-identify up to 14% of records in certain datasets. This is why HIPAA also allows "Expert Determination": a statistical assessment by a qualified expert that re-identification risk is "very small." The standard is not zero; it's defensible risk management.

At Netflix, a similar AOL-style attack was published in 2009. Narayanan and Shmatikoff showed that the Netflix Prize dataset (where Netflix had replaced usernames with random IDs and released movie ratings) could be de-anonymized by cross-referencing with public IMDb ratings. A user who had rated even a handful of movies publicly on IMDb, combined with the timestamp and rating within the Netflix dataset, was identifiable with high confidence. Netflix settled a class-action lawsuit and canceled its second prize competition.

6. Recap + Bridge

Direct identifiers (SSN, email, name) are the obvious PII that everyone knows to protect. Indirect identifiers (IP, device ID) require context to be dangerous but are routinely classified as personal data under modern privacy law. Quasi-identifiers (seemingly innocuous demographic fields) are the hidden threat: individually harmless, collectively a fingerprint. The AOL, Netflix, and Sweeney examples all share the same failure mode: assuming that removing obvious identifiers makes data safe, without considering what the remaining fields can be joined with. The key takeaway is that PII is a property of a dataset in its ecosystem, not a property of a column in isolation. In Lesson 2, we'll go deeper into the categories of sensitive PII (PHI, PCI, biometrics, and children's data) where the regulatory stakes are highest and the definitions are most precise.