Replibyte – Seed your database with real data (opens in new tab)

(github.com)

222 pointsevoxmusic3y ago22 comments

22 comments

Trying to think how to anonymise datetimes hurts my head. You might want to randomise the date of an event. But you also need this random date to be consistent with respect to both the current time and the order of other related rows in the database.

lstamour3y ago

The answer is always “it depends,” but I think if a date time is a UTC timestamp, such as a record of when an event happened, then with random sampling, it shouldn’t matter? It’s just a timestamp. The amount of information it contains might include location, might include timing to other events, could be correlated, but… on its own? It doesn’t need anonymization. Likewise the sequence of events, should be safe to use.

I get that you can look up or de-anonymize an event by its timestamp and the same is true of ID numbers. But it’s worse for ID numbers because these are often permanent and re-used for multiple events.

But yeah, the risk in anonymized data is that it’s never truly both anonymous and useful. Truly anonymous data might be considered junk or random data.

Anonymized data has some utility purpose to fulfil. Perhaps “realistic” analytics is required, or you want to troubleshoot a production issue without revealing who did what to engineers. So you anonymize the fields they shouldn’t see, and create a subset of data that reproduces the issue…?

Anonymized data is almost always a bad approach compared to generating data from algorithmic or random sources, but sometimes we need anonymized or restricted data to start that process.

BobbyJo3y ago

Data can be anonymous and useful. You however have to define what you mean by useful, an use that to inform how you go about making it anonymous.

A good example is: https://gretel.ai/blog/gretel-ai-illumina-using-ai-to-create...

Full disclosure, I work at Gretel, but I thought this was relevant enough to mention.

lstamour3y ago

True? But I wouldn’t call creating new data from non-anonymous data “making data anonymous”. Instead, that’s new random data whose values are constrained or based on real-world data. I’d call that newly generated data, not anonymized data.

To me, anonymized data has an inherent risk of leaking the original transaction because it is a one-to-one mapping of the original data. If you generate new data, it will by definition diverge from the production dataset in some way that might be unrealistic. For example, fields with address components might not actually point to real places, or might not be written the same way as they would be in production. Perhaps a portion of production data includes international addresses or rural routes that your software might fail to generate, or worse, maybe it would generate them incorrectly.

Frankly, generating data is a better approach than anonymized data. And I know of anonymization techniques where good data is mixed with bad data and statistically, the bad data can be filtered out later but only in aggregate, etc. But I’m drawing a line in the sand between anonymized data that closely matches real data, and that which is “generated data”, because you can still potentially learn from the anonymized data but you can’t learn from generated data much more than you would from the initial model that created the constraints used to generate the data. I’m probably explaining this poorly, it’s a bit late at night in my time zone. :)

bennyp1013y ago

How does it keep personal data safe? I had a look at “how it works” and “faqs” but they don’t answer how you keep stuff safe? It also gets uploaded to S3?

I might have missed it, but I need to know exactly where our PII is stored (so not on a dev laptop), how do you know what to replace and what do you do with any info you do replace?

Edit: To answer my own question, via transformers. But that seems to suggest each dev has to keep it up to date with any schema changes etc

(Also some links are broken on GitHub)

crummy3y ago

The user tells it what fields need replacing with the yaml config.

ev0xmusic3y ago

Hi, author of Replibyte here :)

Yes, transformers is the way to go. I plan to add a way to detect schema changes and at list not trying to create a dump in case of change. I don't think it can be done in a safe way without human admin check.

(Thank you for your PR)

pistoriusp3y ago

You may want to check out Snaplet at https://docs.snaplet.dev. I'm the co-founder, but we're not open-source (yet.) Our goal is to give developers a database, and data, that they can code against.

We identify PII by introspecting your database, suggest fields to transform, and provide a JavaScript runtime for writing transformations.

Besides transforming data, you can reduce, and generate data. We are most excited about data-generation!

The configuration lives in your repository, and you can capture the snapshots in GitHub Actions. So you get "gitops workflow" for data.

A typical git-ops workflow:

  1. Add a schema migration for a new column. 
  2. Add a JS function to generate new data for that column.
  3. Add core to use the new column.
  4. Later, once you have data, use the same function to transform the original value. (Or just keep generating it.)

roskilli3y ago

One feature I’d love to see is a transformer that instead of providing a random value provides a cryptographic one way hash of the data (ie sha2) - that way key uniqueness stays the same (to avoid unique constraints on columns) and also the same value used in one place will match another value in another table after transformation which more accurately reflects the “shape” of the data.

pistoriusp3y ago

We do this via Copycat (https://github.com/snaplet/copycat). We generate static "fake values" by hashing your original value to a number, and map that to a fake-value.

MadsRC3y ago

This will not work, at least not if we’re talking PII as it is defined by a Somewhat Sane (TM) privacy legislation.

Sure, passwords and credit card info is obscured with your methodology, but names, dates of birth, sexual orientation, telephone numbers, email and ip will remain unique. This uniqueness is what allows you to potentially identify a person given enough data.

tyingq3y ago

>Sure, passwords and credit card info is obscured with your methodology

Even that's problematic, because there may be code that depends on the data being somewhat "real". Credit cards, for example, may need to pass LUHN tests, or have valid BIN sections, etc.

MadsRC3y ago

I suppose that what you’d have to do is change the data and then hash it. But once you’ve changed the data it’s no longer PII, so there’s no reason to hash it.

Of course, given enough data that has been changed can potentially allow you to deduce how that data was changed and thus revert it, at which point it would become PII again and you’d have a problem… but that’s probably a fringe scenario

BobbyJo3y ago

I hate to be so self promoting (I swear I'm just trying to be helpful), but Gretel has that as a transformer you can use[0]. You can test out a lot of our stuff without payment info through our console[1] if you just want to mess around and see if tools like it ( and Replibyte of course :) ) would fit your use case. That being said, you can run into issues using direct transforms like this, depending on the correlated data, because of various known deanonymization attacks. There are some pretty gnarly examples out there if you Google around.

[0]https://docs.gretel.ai/gretel.ai/transforms/transforms-model...

[1]https://console.gretel.cloud/login

cratermoon3y ago

What you're asking for is similar to what goes by the term "tokenization"[1], a technique often used by payment processors to avoid leaking credit card numbers and similar sensitive data. Using the proper transformer might provide the behavior you need.

1 https://www.tokenex.com/resource-center/what-is-tokenization

ev0xmusic3y ago

Hi, author of Replibyte here. Feel free to open an issue and explain what is your use case. I will be happy to consider a solution with the community.

zX41ZdbW3y ago

I recommend checking out clickhouse-obfuscator. It's a more sophisticated tool for dataset obfuscation.

Installation (single binary Linux/Mac/FreeBSD):

curl https://clickhouse.com/ | sh

./clickhouse obfuscator --help

Docs: https://clickhouse.com/docs/en/operations/utilities/clickhou...

evoxmusicOP3y ago

I will take a look for Replibyte. Thanks for sharing

dopidopHN3y ago

The default seems to be to store the sanitized dump on S3.

It’s not always available in a professional context. Or might be considered extraction.

Keeping everything local and detailing exactly what goes where and how would be helpful.

Svarto3y ago

Also if it's possible to run everything without uploading it to S3. For a smaller time dev with projects in production I would find this really interesting for debugging the production database data, but in development. Uploading it and having it in S3 would needlessly complicate it for me (even though I can understand enterprise customers might prefer it that way)

evoxmusicOP3y ago

You have a local storage option https://www.replibyte.com/docs/datastores#local-disk

CSSer3y ago

I think the description in the man entry is better than the one in the README. Other than that, cool tool!

j / k navigate · click thread line to collapse

22 comments

ff7c113y ago

lstamour3y ago

But yeah, the risk in anonymized data is that it’s never truly both anonymous and useful. Truly anonymous data might be considered junk or random data.

Anonymized data is almost always a bad approach compared to generating data from algorithmic or random sources, but sometimes we need anonymized or restricted data to start that process.

BobbyJo3y ago

Data can be anonymous and useful. You however have to define what you mean by useful, an use that to inform how you go about making it anonymous.

A good example is: https://gretel.ai/blog/gretel-ai-illumina-using-ai-to-create...

Full disclosure, I work at Gretel, but I thought this was relevant enough to mention.

lstamour3y ago

bennyp1013y ago

How does it keep personal data safe? I had a look at “how it works” and “faqs” but they don’t answer how you keep stuff safe? It also gets uploaded to S3?

I might have missed it, but I need to know exactly where our PII is stored (so not on a dev laptop), how do you know what to replace and what do you do with any info you do replace?

Edit: To answer my own question, via transformers. But that seems to suggest each dev has to keep it up to date with any schema changes etc

(Also some links are broken on GitHub)

crummy3y ago

The user tells it what fields need replacing with the yaml config.

ev0xmusic3y ago

Hi, author of Replibyte here :)

(Thank you for your PR)

pistoriusp3y ago

You may want to check out Snaplet at https://docs.snaplet.dev. I'm the co-founder, but we're not open-source (yet.) Our goal is to give developers a database, and data, that they can code against.

We identify PII by introspecting your database, suggest fields to transform, and provide a JavaScript runtime for writing transformations.

Besides transforming data, you can reduce, and generate data. We are most excited about data-generation!

The configuration lives in your repository, and you can capture the snapshots in GitHub Actions. So you get "gitops workflow" for data.

A typical git-ops workflow:

  1. Add a schema migration for a new column. 
  2. Add a JS function to generate new data for that column.
  3. Add core to use the new column.
  4. Later, once you have data, use the same function to transform the original value. (Or just keep generating it.)

roskilli3y ago

pistoriusp3y ago

We do this via Copycat (https://github.com/snaplet/copycat). We generate static "fake values" by hashing your original value to a number, and map that to a fake-value.

MadsRC3y ago

This will not work, at least not if we’re talking PII as it is defined by a Somewhat Sane (TM) privacy legislation.

tyingq3y ago

>Sure, passwords and credit card info is obscured with your methodology

Even that's problematic, because there may be code that depends on the data being somewhat "real". Credit cards, for example, may need to pass LUHN tests, or have valid BIN sections, etc.

MadsRC3y ago

I suppose that what you’d have to do is change the data and then hash it. But once you’ve changed the data it’s no longer PII, so there’s no reason to hash it.

BobbyJo3y ago

[0]https://docs.gretel.ai/gretel.ai/transforms/transforms-model...

[1]https://console.gretel.cloud/login

cratermoon3y ago

1 https://www.tokenex.com/resource-center/what-is-tokenization

ev0xmusic3y ago

Hi, author of Replibyte here. Feel free to open an issue and explain what is your use case. I will be happy to consider a solution with the community.

zX41ZdbW3y ago

I recommend checking out clickhouse-obfuscator. It's a more sophisticated tool for dataset obfuscation.

Installation (single binary Linux/Mac/FreeBSD):

curl https://clickhouse.com/ | sh

./clickhouse obfuscator --help

Docs: https://clickhouse.com/docs/en/operations/utilities/clickhou...

evoxmusicOP3y ago

I will take a look for Replibyte. Thanks for sharing

dopidopHN3y ago

The default seems to be to store the sanitized dump on S3.

It’s not always available in a professional context. Or might be considered extraction.

Keeping everything local and detailing exactly what goes where and how would be helpful.

Svarto3y ago

evoxmusicOP3y ago

You have a local storage option https://www.replibyte.com/docs/datastores#local-disk

CSSer3y ago

I think the description in the man entry is better than the one in the README. Other than that, cool tool!

j / k navigate · click thread line to collapse