Course: Anonymization Deep Dive | Duration: ~20 min | Lesson: 1 of 7

In 2006, Netflix published 100 million movie ratings from 500,000 subscribers and offered a million dollars to anyone who could improve its recommendation engine. They did the responsible thing first: stripped every name, every email, replaced each subscriber with a random number. No identifiers left. A clean, anonymous dataset, or so the press release said.

Two researchers at UT Austin, Arvind Narayanan and Vitaly Shmatikov, took that "anonymous" data and matched it against public movie ratings people had posted on IMDb under their real names. The insight was brutal in its simplicity: the pattern of what you rate, and roughly when, is almost as unique as a fingerprint. Knowing a person rated a handful of specific films around specific dates was enough to find their row. They re-identified subscribers and, with them, revealed apparent political leanings and sexual orientation from viewing history.

Netflix had removed every identifier and still leaked identities. This course exists because masking, the entire last course, is not the same as anonymization, and the gap between them is where lawsuits live.

2. Concept Explanation

The last course gave you techniques to hide or transform values: hashing, tokenization, FPE, masking, generalization. Each protects a field. Re-identification attacks don't go after fields. They go after the combination, and they bring outside information you didn't account for.

The mechanism: linkage attacks

A linkage attack joins your "anonymized" dataset to an external dataset that shares some columns, using the shared columns as a key. The external data carries the identity; your data carries the sensitive attribute; the join marries them.

Netflix: shared columns were (movie, rating, approximate date). External data was IMDb. Identity came from IMDb usernames.
Sweeney's Massachusetts case (you met it last course): shared columns were (birth date, ZIP, gender). External data was a $20 voter roll. Identity came from voter names.
AOL 2006: AOL released 20 million "anonymized" search queries with user IDs replaced by numbers. But people search for their own name, their own address, their own medical conditions. The New York Times identified a specific 62-year-old widow in Georgia from her queries alone. The queries were the identifier.

Why removing identifiers isn't enough

Three ideas make masked data re-identifiable:

Quasi-identifiers. Fields that aren't identifying alone but are in combination (birth date + ZIP + gender). You can't just drop "the identifiers" because the dangerous ones don't look like identifiers.

High-dimensional uniqueness. The more columns describing a person, the more unique they become. With enough attributes (movies rated, places visited, purchases made), almost everyone is unique. A 2013 study found four time-stamped location points uniquely identify 95% of people. Sparse, wide behavioral data is essentially never anonymous.

Auxiliary information. The attacker is not limited to your dataset. They have voter rolls, IMDb, social media, other breaches, public records. You cannot anonymize against a single dataset in isolation, because the attacker will bring a second one you never saw.

The definition that matters

Anonymization is a claim about re-identification being not reasonably possible by anyone, using any available auxiliary data, now or later. That's a far stronger and more fragile claim than "I removed the names." Because you can't enumerate every external dataset an attacker might use, "anonymized by removing identifiers" is not a defensible position. The rest of this course is about replacing that hope with formal models (k-anonymity, l-diversity, differential privacy) that make a mathematical statement about re-identification risk, instead of a vibe.

3. Worked Example

Walk a linkage attack on TheWorldShop the way an attacker would, so you feel why field-level masking doesn't stop it.

TheWorldShop releases an "anonymized" loyalty dataset for a research partner. They removed name, email, user_id. What's left:

birth_date	zip	gender	top_category	total_spend
1991-03-07	94107	F	maternity	4,210
1984-11-22	10001	M	electronics	1,090
1991-03-07	94107	F	...	...

Now the attacker brings a voter roll (public, cheap) with name, birth_date, zip, gender:

-- The attacker's join. Shared quasi-identifiers are the key.
SELECT v.name, w.top_category, w.total_spend
FROM   worldshop_anon w
JOIN   voter_roll v
  ON   v.birth_date = w.birth_date
 AND   v.zip        = w.zip
 AND   v.gender     = w.gender;

For any row where (birth_date, zip, gender) is unique in the area, this join returns exactly one name, and now name -> maternity, $4,210 spent is public. The "anonymized" dataset just told a stranger which named individuals are likely pregnant. TheWorldShop removed three identifier columns and shipped a re-identification kit in the three they kept.

Notice what would have helped: if (birth_date, zip, gender) were generalized so that at least k people shared each combination, the join would return k names per row, not one, and the attacker couldn't tell which. That's the seed of k-anonymity, the next lesson.

Aha: The attacker never breaks your masking, they go around it. Every technique from the last course protects a column; a linkage attack ignores columns and exploits the combination plus a dataset you don't control. That's why "we removed the identifiers" is a sentence about your data, while "this can't be re-identified" is a claim about every dataset on Earth that might join to it. You can't verify the second by looking at the first. That gap is the entire reason formal anonymization exists.

4. Your Turn

Exercise: A health-tech partner sends TheWorldShop a "fully anonymized" wearable-fitness dataset to analyze. They removed name, email, and device ID. Remaining columns: birth_year, zip3 (first 3 digits), gender, home_gym_name, daily_step_series (365 daily step counts), resting_hr_series (365 daily values).

Identify which columns make this re-identifiable despite the removed identifiers, and name the two mechanisms (from the concept section) at play.
Describe a concrete linkage attack: what auxiliary data would an attacker use, and what would they recover?
Explain why generalizing birth_year/zip3/gender further would not fully solve this dataset's problem.

5. Real-World Application

The Netflix and AOL cases didn't just embarrass two companies, they rewrote policy. The Netflix Prize sequel was cancelled after an FTC complaint and a lawsuit (a closeted individual alleged the data could out her). AOL's release led to firings, including the CTO, and became the textbook example of why "replace the ID with a number" is not anonymization. Both are cited in modern privacy regulation and in the academic case for differential privacy.

The pattern repeats whenever an organization treats masking as anonymization. New York City released "anonymized" taxi trip data in 2014 with poorly-hashed medallion numbers; researchers de-anonymized drivers and, by cross-referencing paparazzi photos with timestamps, tracked specific celebrities' trips and inferred home addresses of riders. Australia released "de-identified" medical billing records in 2016 that researchers re-identified, forcing a withdrawal. The common thread: each org protected fields, not combinations, and ignored auxiliary data.

This is why regulators (GDPR's Article 29 Working Party opinion on anonymization, in particular) explicitly reject "remove identifiers" as sufficient and evaluate anonymization against linkability, inference, and singling-out. It's also why serious data-sharing programs (the US Census, Apple, Google) moved to differential privacy: it's the only framework that bounds re-identification risk regardless of what auxiliary data the attacker has. The lessons ahead build to exactly that.

6. Recap + Bridge

Masking protects fields; re-identification attacks exploit the combination of quasi-identifiers plus auxiliary data the masker never saw. Netflix, AOL, the Massachusetts hospital release, NYC taxis, all removed "the identifiers" and still leaked identities, because high-dimensional and quasi-identifying data stays unique. "Anonymized" is a claim about every possible joinable dataset, which is why it can't be verified by inspecting your own.

The fix is to stop hoping and start guaranteeing. The next lesson introduces the first formal model: k-anonymity, which makes the precise promise that every individual is indistinguishable from at least k-1 others on the quasi-identifiers, so the voter-roll join returns a crowd instead of a name. You'll learn exactly what it guarantees, how generalization achieves it, and, just as important, the attacks it still doesn't stop.