The Anonymization That Wasn't

P1hard10 minIncident Response

We published a hospital discharge dataset for research, anonymized to k=5 - every record indistinguishable from at least four others on age, ZIP and admission month. A health journalist just emailed our privacy office: they named a specific patient and their diagnosis, using only the public file and public information. The k-anonymity math checks out. The re-identification still happened.

Re-identifiable patients
1,420 → 0 patients
Classes failing l-diversity
37 → 0 classes

The incident

It's 14:00 and the privacy office just forwarded an email that every data team dreads. A health journalist says they have re-identified a named individual in the discharge dataset we published for research last month - correctly stating that person's diagnosis - and they intend to run the story when the embargo lifts at 17:00. The dataset was anonymized to k=5: we can re-run the check right now and every record really is indistinguishable from at least four others on the quasi-identifiers we generalized (age band, ZIP prefix, admission month). The generalizer did its job; suppression is within budget; there are no tiny outlier groups. And yet the journalist used nothing but the public file and ordinary public information to put a name and a condition together. So the headline metric - k=5 - is genuinely true, and the dataset is genuinely re-identifiable, at the same time. We need to understand how both can hold, and reduce the re-identification risk to zero before 17:00.

Symptoms on the table

  • a journalist named a patient and their diagnosis using only the public file and public info
  • the dataset verifiably satisfies k=5 on age band, ZIP prefix and admission month
  • generalization suppression is within budget; there are no classes smaller than 5
  • the re-identification used no breach and no internal access - public data only
  • the re-id risk gate signed the release off with no warning
  • an earlier Q1 release of the same cohort is already public

Systems on the board

The real components in play for this incident — the surface you investigate when the clock starts.

Discharge Records
raw EHR cohort
QI Generalizer
age/ZIP/date -> k=5
Equivalence Classes
k-anon groups
Admission Notes
free-text column
Q1 Public Release
earlier release, same cohort
Published Dataset
assembled k=5 release
Research Portal
download endpoint

What you'll practice

This is a timed, hands-on incident in the Incident Response. You diagnose the symptom, trace it to a root cause across real components, and ship a fix before the clock runs out — the same loop you run on call, without the production blast radius.

Members-only challenge

Solve it in the Simulation Arcade.

The interactive workspace — live metrics, the component map, and the fix you ship — runs inside Petascale Labs. Sign in to start the clock.

Related topics

Browse the full Arcade

Every challenge maps to a stratum in the curriculum.