The Anonymization That Wasn't
We published a hospital discharge dataset for research, anonymized to k=5 - every record indistinguishable from at least four others on age, ZIP and admission month. A health journalist just emailed our privacy office: they named a specific patient and their diagnosis, using only the public file and public information. The k-anonymity math checks out. The re-identification still happened.
The incident
It's 14:00 and the privacy office just forwarded an email that every data team dreads. A health journalist says they have re-identified a named individual in the discharge dataset we published for research last month - correctly stating that person's diagnosis - and they intend to run the story when the embargo lifts at 17:00. The dataset was anonymized to k=5: we can re-run the check right now and every record really is indistinguishable from at least four others on the quasi-identifiers we generalized (age band, ZIP prefix, admission month). The generalizer did its job; suppression is within budget; there are no tiny outlier groups. And yet the journalist used nothing but the public file and ordinary public information to put a name and a condition together. So the headline metric - k=5 - is genuinely true, and the dataset is genuinely re-identifiable, at the same time. We need to understand how both can hold, and reduce the re-identification risk to zero before 17:00.
Symptoms on the table
- a journalist named a patient and their diagnosis using only the public file and public info
- the dataset verifiably satisfies k=5 on age band, ZIP prefix and admission month
- generalization suppression is within budget; there are no classes smaller than 5
- the re-identification used no breach and no internal access - public data only
- the re-id risk gate signed the release off with no warning
- an earlier Q1 release of the same cohort is already public
Systems on the board
The real components in play for this incident — the surface you investigate when the clock starts.
What you'll practice
This is a timed, hands-on incident in the Incident Response. You diagnose the symptom, trace it to a root cause across real components, and ship a fix before the clock runs out — the same loop you run on call, without the production blast radius.
Members-only challenge
Solve it in the Simulation Arcade.
The interactive workspace — live metrics, the component map, and the fix you ship — runs inside Petascale Labs. Sign in to start the clock.
Related topics
Browse the full Arcade
Every challenge maps to a stratum in the curriculum.