Stop Hand-Writing PII Masking Policies

There's a particular kind of false confidence that comes from a column named email_masked. The pipeline ran, the values look scrambled, the ticket is closed. And then someone with a laptop and a rainbow table turns the whole column back into email addresses in an afternoon, because the "masking" was an unsalted SHA-256 and there are only so many email addresses in the world.

Masking PII is rarely a SQL problem. The CASE WHEN role IN (...) THEN ... ELSE mask(...) skeleton is the easy 10%. The hard 90% is four decisions that have nothing to do with syntax:

Which columns are even sensitive - and at what tier.
Which technique fits each column, given how it'll be used.
The gotchas that quietly break the guarantee you think you have.
Which warehouse dialect you target, because they're all different.

Get those right and the DDL writes itself. Get them wrong and you ship re-identifiable data with a reassuring column name. This post walks all four - and the PII Masking Policy Generator does the mechanical parts (detection, preview, dialect-specific DDL) entirely in your browser, so nothing you paste ever leaves the page.

Step 1: finding PII is a tiering problem, not a yes/no

The instinct is to look for the obvious fields - email, ssn, phone - mask those, and move on. But PII isn't binary, and the dangerous columns are usually the ones nobody flagged.

What matters is the sensitivity tier, because it drives how aggressively you treat the column:

Direct identifiers - SSN, passport, email, full name. Identify a person on their own. Mask hard.
Quasi-identifiers - ZIP, birth date, gender, job title. Harmless alone, but three of them together re-identify the majority of people. This is the category teams miss, because no single column looks scary.
Sensitive attributes - health condition, salary, religion. Not identifying, but the thing an attacker wants linked to an identity.

The re-identification risk lives in the quasi-identifiers, and you can't see it column by column - it's a property of the combination. A masking pass that nukes name and ssn but leaves birth_date + zip + gender in the clear has anonymized nothing. The generator's detection pass flags likely-PII columns with a tier precisely so the quasi-identifiers don't slip through as "not really PII."

Step 2: pick the technique for how the column will be used

There are four broad techniques, and choosing between them is entirely about what the masked column still needs to do.

Technique	What it does	Keeps joinable?	Reversible?	Use when
Hashing	One-way digest (salted)	Yes - same input → same hash	No	You need to join/group by identity but never read it
Tokenization	Swap value for a random token, mapping in a vault	Yes	Yes (with vault)	You need reversibility for authorized re-identification
Partial redaction	Show some, hide the rest (`****1234`)	No	No	Humans need a recognizable hint (last-4 of a card)
Generalization	Bucket into ranges (age `34` → `30–40`)	Aggregates only	No	Analytics needs the shape, not the individual

The mistake is reaching for redaction by reflex because it "looks masked." If analysts need to count distinct customers, redaction breaks the join and a salted hash is what you wanted. If a fraud team occasionally needs the real value back, only tokenization gives you that without storing the plaintext. The technique follows the use case - pick it backwards from what the column has to support.

Step 3: the gotchas that quietly undo everything

This is the part that separates masking that works from masking that looks like it works. Every technique has a trap, and the trap is silent.

!Warning

Hash without salt = a lookup table. An unsalted hash is deterministic and public-algorithm: an attacker precomputes the hash of every plausible email or every nine-digit SSN once, and your "masked" column becomes a join key back to identity. Always salt, and salt per-column (ideally per-tenant) so one cracked table doesn't unlock the rest. A salted hash is a real control; an unsalted one is decoration.

The others, briefly:

Tokenization is only as separated as its vault. The token-to-value mapping has to live somewhere the people querying the masked table cannot reach - different database, different access boundary. Store the vault next to the tokens and you've built an elaborate way to write the data twice.
Generalization has a granularity floor. Bucketing age into 30–40 is fine; bucketing into 34–35 re-identifies. The buckets must be coarse enough that many people share each one - that's the entire point.
Redaction leaks through the part you keep. Last-4 of a card is fine. But a redacted ZIP that keeps five digits, or an email that keeps the full domain on a tiny company, can re-identify on its own.

And the compliance-specific one people trip on constantly:

iNote

HIPAA Safe Harbor and the ZIP rule. Under the Safe Harbor de-identification standard, you generally truncate ZIP codes to the first three digits - and even then, if a three-digit ZIP region contains fewer than 20,000 people, those digits must be changed to 000. "I masked the last two digits of the ZIP" is not de-identification; it's a common and confident way to stay identifiable. Rules like this are why a generator that knows the standard beats hand-rolled SUBSTRING every time.

Step 4: the dialect tax

Once the decisions are made, you still have to write the policy in your warehouse - and they share almost no syntax. The same role-conditional masking looks completely different across platforms:

Snowflake - a MASKING POLICY object, attached to a column with ALTER TABLE … SET MASKING POLICY, branching on CURRENT_ROLE().
Databricks Unity Catalog - a column MASK function applied via ALTER TABLE … SET MASK, branching on is_account_group_member(...).
BigQuery - data masking through policy tags on the column plus a Data Masking IAM role, a fundamentally different model from inline DDL.
Portable SQL / PySpark - when you need it inside a pipeline rather than as a warehouse-native policy.

This is pure, error-prone busywork. The logic is identical; only the spelling changes, and each platform has its own footguns (Snowflake policies are schema-scoped objects you reuse; BigQuery masking is governed by IAM, not SQL). Hand-translating the same intent four ways is exactly the kind of mechanical work worth handing to a tool.

Let the mechanical parts be mechanical

The judgment in masking - which columns, which technique, which gotcha applies - is yours, and this post is mostly about making those calls well. But detection, the masked preview, and four dialects of DDL are mechanical, and that's what the PII Masking Policy Generator is for:

Paste a JSON or CSV sample (nothing uploads - it runs in your browser); it flags likely-PII columns with a sensitivity tier.
Choose a technique per column - hash, tokenize, redact, generalize - and see a live masked preview so you catch over- or under-masking before it ships.
Read the gotchas advisor - salt your hashes, separate your vault, the Safe-Harbor ZIP rule - surfaced against the choices you actually made.
Copy role-conditional masking DDL for Snowflake, Databricks Unity Catalog, BigQuery, or portable SQL/PySpark, with each technique explained and linked to the lesson behind it.

✓Tip

The real win isn't the generated SQL - it's that the gotchas are checked against your config, so the unsalted hash gets flagged before it becomes a lookup table, not after an incident. The DDL is the last 10%; the generator exists to make sure the first 90% is right.

Masking well is a governance skill wearing a SQL costume. Make the four decisions deliberately - tier the columns, match technique to use, respect the gotchas, target the right dialect - and "masked" finally means what everyone assumed it meant. The fastest way to pressure-test your own data is to drop a sample into the generator and watch which columns light up; the governance internals behind every technique are one click away in the curriculum.