I get that you can look up or de-anonymize an event by its timestamp and the same is true of ID numbers. But it’s worse for ID numbers because these are often permanent and re-used for multiple events.
But yeah, the risk in anonymized data is that it’s never truly both anonymous and useful. Truly anonymous data might be considered junk or random data.
Anonymized data has some utility purpose to fulfil. Perhaps “realistic” analytics is required, or you want to troubleshoot a production issue without revealing who did what to engineers. So you anonymize the fields they shouldn’t see, and create a subset of data that reproduces the issue…?
Anonymized data is almost always a bad approach compared to generating data from algorithmic or random sources, but sometimes we need anonymized or restricted data to start that process.
A good example is: https://gretel.ai/blog/gretel-ai-illumina-using-ai-to-create...
Full disclosure, I work at Gretel, but I thought this was relevant enough to mention.
I might have missed it, but I need to know exactly where our PII is stored (so not on a dev laptop), how do you know what to replace and what do you do with any info you do replace?
Edit: To answer my own question, via transformers. But that seems to suggest each dev has to keep it up to date with any schema changes etc
(Also some links are broken on GitHub)
Yes, transformers is the way to go. I plan to add a way to detect schema changes and at list not trying to create a dump in case of change. I don't think it can be done in a safe way without human admin check.
(Thank you for your PR)
We identify PII by introspecting your database, suggest fields to transform, and provide a JavaScript runtime for writing transformations.
Besides transforming data, you can reduce, and generate data. We are most excited about data-generation!
The configuration lives in your repository, and you can capture the snapshots in GitHub Actions. So you get "gitops workflow" for data.
A typical git-ops workflow:
1. Add a schema migration for a new column.
2. Add a JS function to generate new data for that column.
3. Add core to use the new column.
4. Later, once you have data, use the same function to transform the original value. (Or just keep generating it.)Sure, passwords and credit card info is obscured with your methodology, but names, dates of birth, sexual orientation, telephone numbers, email and ip will remain unique. This uniqueness is what allows you to potentially identify a person given enough data.
Even that's problematic, because there may be code that depends on the data being somewhat "real". Credit cards, for example, may need to pass LUHN tests, or have valid BIN sections, etc.
Of course, given enough data that has been changed can potentially allow you to deduce how that data was changed and thus revert it, at which point it would become PII again and you’d have a problem… but that’s probably a fringe scenario
[0]https://docs.gretel.ai/gretel.ai/transforms/transforms-model...
1 https://www.tokenex.com/resource-center/what-is-tokenization
Installation (single binary Linux/Mac/FreeBSD):
curl https://clickhouse.com/ | sh
./clickhouse obfuscator --help
Docs: https://clickhouse.com/docs/en/operations/utilities/clickhou...
It’s not always available in a professional context. Or might be considered extraction.
Keeping everything local and detailing exactly what goes where and how would be helpful.