Luke Angel
2025

PII Masking Starter Kit

Year
2025
Stack
python · pyspark · aws-glue · aws-databrew · aws-s3
Outcome
Open source · MIT · github.com/drlukeangel/PII-Masking-Starter-Kit-Product-Management

The kit is the smallest credible PII masking stack a data-engineering team needs to ship a compliant pipeline. A four-bucket rubric, a runnable PySpark job, an AWS DataBrew recipe, and a verify script that fails CI when the rubric ever drifts from the output.

github.com/drlukeangel/PII-Masking-Starter-Kit-Product-Management

Why this exists

Most teams handle PII three ways: ignore it (illegal), hash everything (useless), or argue about it for six weeks before a single byte moves (expensive). The kit is the minimal opinionated alternative — a rubric the team agrees on once, then code that enforces it.

The rubric, in one paragraph

PII isn't one thing. It's four.

  • Direct identifiers (email, device serial) → hashed with a rotating salt
  • Quasi-identifiers (name, employee ID) → tokenized to a stable random string so joins still work
  • Sensitive attributes (location, biometric, health) → generalized (GPS snapped to a 0.01° grid; ages bucketed into five-year bins)
  • Behavioral data (battery level, usage minutes) → kept

That's the whole rubric. Print it. Tape it next to your monitor.

What's in the box

FileJob
rubric.mdThe four-bucket rubric, one page
data/generate_synthetic.pyGenerate fake tool-telemetry data
data/sample_tool_telemetry.csv20 rows of synthetic data, ready to run
glue/pii_masking_job.pyPySpark Glue job — production path
databrew/recipe.jsonDataBrew recipe — analyst-friendly path
verify.pyPost-mask invariants check; fails CI on drift

How teams use it

  • Engineering managers fork as a starting template for the data-pipeline repo
  • Product managers read rubric.md and stop there
  • Data engineers lift the Glue job structure, swap in their own schema
  • Privacy / Legal audit rubric.md and verify.py — the verify script is the contract

Paired with

The data shape used throughout the examples is industrial tool telemetry — operator IDs, GPS readings, job-site addresses — which is also the shape the Connected Products Starter Kit emits. The two kits work together: one ingests the data, the other masks it before it goes anywhere downstream.