Show HN: A tool to seed your dev database with real data

129 pointsev0xmusic4y ago78 comments

A bunch of developers and myself have created RepliByte - an open-source tool to seed a development database from a production database.

Features:

  - Support data backup and restore for PostgreSQL, MySQL and MongoDB
  - Replace sensitive data with fake data
  - Works on large database (> 10GB) (read Design)
  - Database Subsetting: Scale down a production database to a more reasonable size
  - Start a local database with the prod data in a single command
  - On-the-fly data (de)compression (Zlib)
  - On-the-fly data de/encryption (AES-256)
  - Fully stateless (no server, no daemon) and lightweight binary
  - Use custom transformers

My motivation:

As a developer, creating a fake dataset for running tests is tedious. Plus, it does not reflect the real-world data and painful to keep updated. If you prefer to run your app tests with production data. Then RepliByte is for you as well.

Available for MacOSX, Linux and Windows.

> https://github.com/qovery/replibyte

78 comments

mdaniel4y ago

Please don't require static AWS credentials: https://github.com/Qovery/replibyte/blob/v0.4.4/replibyte/sr...

or at least either include "AWS_SESSION_TOKEN" in that setup (if it is present) in order to allow "aws sts assume-role" to work, or allow `AWS_PROFILE`, or just use the aws-sdk's normal credential discovery mechanism which at least on their "main" SDKs is a fallback list of them, but I couldn't follow the docs.rs soup in order to know if their rust sdk is up to speed or what

moltar4y ago

Second this. I would NEVER hardcode credentials into a config file.

I run all AWS commands through an assumed role (STS) via aws-vault.

evoxmusic4y ago

You can use env vars in the conf yaml file -> https://github.com/Qovery/replibyte/blob/main/examples/sourc...

mdaniel4y ago

That is better, but apologies that I overloaded the word "static" -- I meant the non-assume-role flavor, "aws iam create-access-key", static style credentials. Static in that from AWS's perspective, once issued they don't expire, unlike session credentials that have a finite lifespan

I can now see how my wording was confusing, I'll try to be more precise in the future

atonse4y ago

Should work out of the box if they’re using the rust AWS library right?

mdaniel4y ago

That was the "I can't grok the docs.rs in order to tell you" part; in boto and the sane SDKs, there's a bunch of "Credential Providers" and they're in a list, meaning some are tried before others

I have no idea what implementations there are for this: https://docs.rs/aws-sdk-s3/latest/aws_sdk_s3/struct.Credenti... and its official page is even worse: https://docs.aws.amazon.com/sdk-for-rust/latest/dg/credentia...

Going all the way down to the GH readme seems to back up the investigation that, no, they really seem to have forgotten about "AWS_SESSION_TOKEN": https://github.com/awslabs/aws-sdk-rust#getting-started-with...

simtel204y ago

Not sure if this is any better than the docs (which are raw, at best) - I'm a middling beginner rust programmer, but I did do this test[1] for myself and it matches most of the other SDKs.

[1] https://github.com/pcn/check-assume-role-s3/blob/master/src/...

1 more reply

herpderperator4y ago

Why?

kodah4y ago

Credentials in a config file can be mistakenly checked into a repository. They're easy to exfiltrate from files, say I write a script with well known configuration locations for thousands of applications and just dumbly pull them all from a compromised system. I now have little bits of access to the wider system where I can now jump from system to system.

The best way to store ephemeral secrets is in an environment variable or /dev/shm. This locks the secret behind the scope of the parent process (shell instance) and the user.

dvt4y ago

I don't get this logic, that's what .gitignore is for. I've been using .env files for years and never mistakenly checked one into a repo.

7 more replies

osrec4y ago

How do you automate setting that environment variable?

2 more replies

jensenbox4y ago

Static tokens are terrible. They are far too easy to egress and are downright evil.

macksd4y ago

Some organizations use this feature to enforce security policy and ensure regular rotation of credentials. You can't get a static access key ID / secret access key pair, but you can get one with a session token, and so you run everything inside that context. If you can't specify the session token to an Amazon client, you just can't use Amazon STS.

alphabettsy4y ago

I presume so that one can use instance roles or automatic credential discovery that is common within AWS and using AWS applications.

Aeolun4y ago

One big reason is it’s insecure.

Another big reason is it’s much nicer to deploy on any AWS service and have the SDK use the metadata host, which will automatically provide you with a temporary access token with the permissions of the role you set for it.

krageon4y ago

Unless you can exhaustively guarantee your customer-data containing production data will definitely be transformed into something completely unrecognisable and irreversible (and let's face it, you can never do so - systems change all the time), using this is irresponsible. The fact that the motivation for it is that it is "tedious" to do the right way doesn't exactly inspire confidence, though it is definitely in the spirit of the times.

sverhagen4y ago

I feel I'm missing a reference for the configuration file, on their GitHub page. So maybe this already exists, but it would maybe be useful if there were some features for your concern. Like an opt-out mode where you have to specify transformations for all columns unless indicated otherwise. Or at least for text columns.

There are also plenty of organizations (not mine) with review boards for database changes. Those folks could also have a process to make sure that new, sensitive columns get added to the configuration file.

time4tea4y ago

From Thoughtworks Tech Radar https://www.thoughtworks.com/radar

21. Production data in test environments Hold We continue to perceive production data in test environments as an area for concern. Firstly, many examples of this have resulted in reputational damage, for example, where an incorrect alert has been sent from a test system to an entire client population. Secondly, the level of security, specifically around protection of private data, tends to be less for test systems. There is little point in having elaborate controls around access to production data if that data is copied to a test database that can be accessed by every developer and QA. Although you can obfuscate the data, this tends to be applied only to specific fields, for example, credit card numbers. Finally, copying production data to test systems can break privacy laws, for example, where test systems are hosted or accessed from a different country or region. This last scenario is especially problematic with complex cloud deployments. Fake data is a safer approach, and tools exist to help in its creation. We do recognize there are reasons for specific elements of production data to be copied, for example, in the reproduction of bugs or for training of specific ML models. Here our advice is to proceed with caution.

jve4y ago

Is this a negative or positive comment?

> Fake data is a safer approach, and tools exist to help in its creation.

Because the tool presented is exactly what this quote says.

thraxil4y ago

No, the tool presented is to copy real production data down to a test instance with support for anonymization, subsetting, etc. That's a very different approach than tools for creating fake data.

gregwebs4y ago

Using a customer's production data outside of production probably violates their expectations of your data security practices. I couldn't see myself using this unless there was a mode where only allowed fields are copied and non-id fields are first transformed in a lossy way.

jve4y ago

That was the first thing I checked whether it supports data masking. And it does, via transforms, as evoxmusic already pointed out.

However, there may be times when data masking must be nuanced. Suppose some random email/domain pair is bad and you would rather replace all "example.com" domain instances with "fake.com", and not "random1.com", "random2.com", etc (for ML, 3rd party random analysis). Out of the box I don't see it is provided, HOWEVER I see that you can write custom transformer: https://github.com/Qovery/replibyte/tree/main/examples/wasm and fulfill your needs.

Excellent :)

evoxmusic4y ago

That's why RepliByte uses S3 to store transformed data. Then the real production data never leave the production environment. https://github.com/Qovery/replibyte/blob/main/docs/DESIGN.md

gregwebs4y ago

The doc refers to S3 as an intermediary- it is then going to be loaded in a different DB where all these issues exist. The intermediary adds one more place where data could be leaked.

closeparen4y ago

And not using the chaos of the real world to harden your software before it hits production probably violates their expectations of correctness. It's a real and interesting tension! Software that helps people walk this line well is valuable.

pyr0hu4y ago

Cannot agree more. There are tons of edgecases that only happens with the chaotic blob that we call production data. Tools like this or anyother that helps anonymizing data is soo useful for debugging issues

jlgaddis4y ago

I'm (not) looking forward to the future data breach notifications / post-mortems that include something like "... our developers used a tool to copy the production database to a dev database on their laptop ..."

Honestly, I'm kinda surprised by the lack of comments advocating against doing this.

claytongulick4y ago

The comments I've read on here seem strangely negative, I don't understand why.

I think this tool looks great!

I appreciate the time and effort you put on to releasing a free and open source tool to help solve a real problem.

Keep up the great work!

evoxmusic4y ago

<3 thank you - all comments are welcomed to improve this tool. Of course production data are sensitive and everything that can be done to prevent data leak is super important.

tjpnz4y ago

I like this but after a cursory glance at the source I have a few concerns:

- There's a transformer which appears to retain the first char on string fields. That's not safe if you're dealing with customer data.

- Remove telemetry. That it's claimed to be anonymized and togglable is meaningless where sensitive data is concerned.

ev0xmusicOP4y ago

Thanks:

  1. What do you mean it is not safe?

  2. Telemetry can be removed with the option --no-telemetry and you can inspect the code > https://github.com/Qovery/replibyte/blob/main/replibyte/src/telemetry.rs

tjpnz4y ago

It's not safe because I could potentially use that information to find a real customer in the DB. It becomes more problematic when working with data from Asian countries where it's possible (even common) for family and/or first names to consist of two or even a single character.

With regards to telemetry I'm aware that it can be disabled. But in my experience that would still result in a veto from the security teams I've worked with.

nicoburns4y ago

This does sounds kind of useful. On the other hand I performed a similar task just yesterday using the native pg_dump and pg_restore commands, and it only took a couple of hours to setup (and now I have a repeatable script), so this’ll need to be implemented really well to provide value.

jve4y ago

Well, moving sensitive data or PII to development environment for any developer eyeballs to see doesn't sound good and may introduce trouble if data leaked. Development env (where random npm dev scripts get run or whatever) doesn't usually get protected as good as production.

Anyways, if you are single or handful developers where everyone get access to prod, you may not care. Still, data hygiene and risk mitigation shouldn't be overlooked.

nicoburns4y ago

Yeah, we have a shared QA environment which gets a fair amount of data populated in it, and this is what I was actually syncing in this case. If syncing prod data then I'd definitely want to have very thorough filtering. But then at that point I'm not sure I'd trust this tool!

tehlike4y ago

I am on the same boat but couple hours is terrible still. The best is probably copying the data directory straight which should cut it down to seconds, but i have yet to automate that + there are production credentials/sensitive data problems that needs to be tackled too...

nicoburns4y ago

I suspect it's likely to take a couple of hours to set up this tool too!

micheljansen4y ago

This is actually a much harder problem than it seems. GDPR is quite strict about what is considered PII (and rightly so). For example: you may think replacing sensitive data with fake data is enough to anonymise customer data. It's not:

> "Personal data is any information that relates to an identified or identifiable living individual. Different pieces of information, which collected together can lead to the identification of a particular person, also constitute personal data."

So it's not enough to, for example, replace all names, addresses etc. when you can still see which products someone has interacted with, when their account was created (which in the production DB would relate back to their actual account!) or any other unexpected pieces of information that links back to their identity.

In practice, this means that any realistic production-derived data is either very likely to be still considered PII (and therefore much more demanding to handle safely and securely) or has to be mangled so much that it is no longer representative of production data.

dvasdekis4y ago

I was thinking "oh! this is awesome!", and then noticed it didn't support MSSQL. Not to worry, I'll just contribute a connector. Let's take a look at their existing connector code...

https://github.com/Qovery/replibyte/blob/main/replibyte/src/...

Not a single comment to say what anything does. Sigh. It's the same for the other drivers too.

evoxmusic4y ago

Can you open an issue to ask for getting better documentation on how to contribute for an additional connector?

husainfazel4y ago

  - Works on large database (> 10GB) (read Design)

Can anyone explain to me how this works in RepliByte? The design document only talked about Postgres.

For example let's say I have a MySQL database, how does RepliByte copy that database into S3?

Does it use mysqldump or are they coping the database index files? We have a script that automatically backs up our production database at intervals to S3 and then a program to download the latest backup and scrub the data.

It takes a heck of a long time to download and impacts the server when it happens... it's been on my todo list to replace with Percona's Xtrabackup [1] but doesn't look that's what these guys are doing?

  - Database Subsetting: Scale down a production database to a more reasonable size

What about this? Does the database need foreign keys to prevent related rows in tables being lost and are they just randomly deleting rows as the config seems to indicate [2]

[1] https://www.percona.com/software/mysql-database/percona-xtra...

[2] https://github.com/qovery/replibyte#configuration

bprasanna4y ago

One of my colleagues has developed a sophisticated Data generator addressing the needs of workloads/algorithms which work based on the characteristics of the data.

https://github.com/jssprasanna/redgene

ReDGene - Relational Data Generator is a tool aimed at taking control over the data generation with being able to generate column vectors in a table with required type, interval, length, cardinality, skewness and constraints like Primary Key, Foreign Key, Foreign Key(unique 1:1 mapping), Composite Primary Key and Composite Foreign Key.

And this is DB agnostic, it generates data in the form of flat files which can be imported into any database which supports importing data from flat files.

exdsq4y ago

You’d imagine Postgres or whatever would have a built in function to populate a DB based on types as a sort of fuzzing tool tbh

I worked on a gov app years ago that required anonymized databases and I remember thinking that then - why isn’t it available out the box? Everyone must need this from time to time

carterschonwald4y ago

I build a dataset anonymizer at jpmorgan years ago. There’s a surprising amount of nuance needed just to do a decent job generating schematically valid fake data, let alone stuff that’s statistically faithful to true data

exdsq4y ago

Does it need to be statistically faithful or can it just fuzz the data type? Maybe that’s better for testing anyway?

closeparen4y ago

* Data that is tightly clustered on certain keys and widely dispersed on other keys can hit some "fun" interactions with sharding regimes, indexes, etc. that random data doesn't.

* Brute forcing a whole bunch of invalid values can be a lot less interesting than lighting up unconventional combinations of valid values.

* Sometimes you're wrong about the validation rules, i.e. you think you know the allowable enum values here but in fact production systems that really exist and have customers behind them are setting other values. Rejecting those would itself be a bug.

2 more replies

onion2k4y ago

This project needs a giant heading box in the README stating 3 things;

- staging databases that hold data generated from production databases should be considered production data, with the same level of consideration for security and access as production.

- staging databases that hold production data are a GDPR violation waiting to happen. Make sure your data controller / lawyers knows exactly what you're doing with production data.

- ask yourself why you need production data in staging in the first place. What are you gaining over a script that generates data? If you want data at scale you can generate it randomly. If you want data that covers all edge cases you can generate it non-randomly. If you want "real-looking" data then maybe this tool is useful.

People copying data from production to staging and then failing to look after it properly is a nightmare. It shouldn't be encouraged except in very unusual circumstances. In my experience of dev, your development and staging data should be covering the weird edge cases that you need to handle far more than the nice "happy path" data you get in production.

evoxmusic4y ago

Do you consider transformed data in staging harmful? (Transformed data = where all the sensitive data have been hidden)

onion2k4y ago

I consider it potentially harmful. Anonymizing data is a hard problem, and what is considered sensitive is not settled. For example, an IP address is personal identifiable information under the GDPR. Most people don't mask that in their logs though. If you copy records from production that have network information in them (last known IP for example) then your data controller should be very concerned.

Another major problem with tools like replibyte is that people use them properly, and then a database schema changes, but people don't update their script to anonymize new tables or columns. Then a few months later someone notices sensitive data has made its way in to staging, and into the backups, and the database dumps devs made to debug things because "it's only staging data, who cares!"

Protecting user data is something that you need to be extremely vigilant about. In my experience, the less access I have to production data the happier I am. Copying it and using it in staging, even if you're careful about it, fills me with dread.

evoxmusic4y ago

It makes sense to me and that's why:

  1. Auto-detection of sensitive data is planned

  2. Detecting database schema change is also plan to prevent leaking sensitive data.

RepliByte responds to a very common need that almost every company end to build internally. The idea is to collaboratively work on a tool that can be used by anyone and that can be improved to avoid leaking data.

1 more reply

xwowsersx4y ago

This looks very useful and I have an immediate need for this. I'm still not sure how to use replibyte to do the following: I want to take a snapshot of the DB from one of my environments and then seed a local DB with it. I see this is a basic use case of replibyte, but not sure exactly how to accomplish this.

I have a docker container running postgres and I just want to take the snapshot and seed it into that. How exactly do I do this?

evoxmusic4y ago

Can you open an issue on the GitHub repo and set label "question" ? Thx

a_c4y ago

Am I missing the obvious, why would one seed a dev database from production? If anything, data on dev should exist before production?

adamckay4y ago

Once you have data in production it's very possible for the fake data you generate in dev not really matching the sort of data you have in production (either in size or because of assumptions made whilst generating it or even bad app updates/schema migrations in the past causing duff data). It can then be useful for future development or debugging that the data is real(-ish).

fedeb954y ago

interesting, however couldn't it detect tables and columns automatically instead of having to specify them in the configuration file? If I understand correctly each table is to be specified by hand. Say I have nearly a hundred tables...

evoxmusic4y ago

A feature to automatically detect sensitive data is planned to be added

atonse4y ago

Seems like you only specify transformers

j / k navigate · click thread line to collapse

78 comments

mdaniel4y ago

Please don't require static AWS credentials: https://github.com/Qovery/replibyte/blob/v0.4.4/replibyte/sr...

moltar4y ago

Second this. I would NEVER hardcode credentials into a config file.

I run all AWS commands through an assumed role (STS) via aws-vault.

evoxmusic4y ago

You can use env vars in the conf yaml file -> https://github.com/Qovery/replibyte/blob/main/examples/sourc...

mdaniel4y ago

I can now see how my wording was confusing, I'll try to be more precise in the future

atonse4y ago

Should work out of the box if they’re using the rust AWS library right?

mdaniel4y ago

That was the "I can't grok the docs.rs in order to tell you" part; in boto and the sane SDKs, there's a bunch of "Credential Providers" and they're in a list, meaning some are tried before others

simtel204y ago

Not sure if this is any better than the docs (which are raw, at best) - I'm a middling beginner rust programmer, but I did do this test[1] for myself and it matches most of the other SDKs.

[1] https://github.com/pcn/check-assume-role-s3/blob/master/src/...

1 more reply

herpderperator4y ago

Why?

kodah4y ago

The best way to store ephemeral secrets is in an environment variable or /dev/shm. This locks the secret behind the scope of the parent process (shell instance) and the user.

dvt4y ago

I don't get this logic, that's what .gitignore is for. I've been using .env files for years and never mistakenly checked one into a repo.

7 more replies

osrec4y ago

How do you automate setting that environment variable?

2 more replies

jensenbox4y ago

Static tokens are terrible. They are far too easy to egress and are downright evil.

macksd4y ago

alphabettsy4y ago

I presume so that one can use instance roles or automatic credential discovery that is common within AWS and using AWS applications.

Aeolun4y ago

One big reason is it’s insecure.

krageon4y ago

sverhagen4y ago

time4tea4y ago

From Thoughtworks Tech Radar https://www.thoughtworks.com/radar

jve4y ago

Is this a negative or positive comment?

> Fake data is a safer approach, and tools exist to help in its creation.

Because the tool presented is exactly what this quote says.

thraxil4y ago

No, the tool presented is to copy real production data down to a test instance with support for anonymization, subsetting, etc. That's a very different approach than tools for creating fake data.

gregwebs4y ago

jve4y ago

That was the first thing I checked whether it supports data masking. And it does, via transforms, as evoxmusic already pointed out.

Excellent :)

evoxmusic4y ago

That's why RepliByte uses S3 to store transformed data. Then the real production data never leave the production environment. https://github.com/Qovery/replibyte/blob/main/docs/DESIGN.md

gregwebs4y ago

The doc refers to S3 as an intermediary- it is then going to be loaded in a different DB where all these issues exist. The intermediary adds one more place where data could be leaked.

closeparen4y ago

pyr0hu4y ago

jlgaddis4y ago

Honestly, I'm kinda surprised by the lack of comments advocating against doing this.

claytongulick4y ago

The comments I've read on here seem strangely negative, I don't understand why.

I think this tool looks great!

I appreciate the time and effort you put on to releasing a free and open source tool to help solve a real problem.

Keep up the great work!

evoxmusic4y ago

<3 thank you - all comments are welcomed to improve this tool. Of course production data are sensitive and everything that can be done to prevent data leak is super important.

tjpnz4y ago

I like this but after a cursory glance at the source I have a few concerns:

- There's a transformer which appears to retain the first char on string fields. That's not safe if you're dealing with customer data.

- Remove telemetry. That it's claimed to be anonymized and togglable is meaningless where sensitive data is concerned.

ev0xmusicOP4y ago

Thanks:

  1. What do you mean it is not safe?

  2. Telemetry can be removed with the option --no-telemetry and you can inspect the code > https://github.com/Qovery/replibyte/blob/main/replibyte/src/telemetry.rs

tjpnz4y ago

With regards to telemetry I'm aware that it can be disabled. But in my experience that would still result in a veto from the security teams I've worked with.

nicoburns4y ago

jve4y ago

Anyways, if you are single or handful developers where everyone get access to prod, you may not care. Still, data hygiene and risk mitigation shouldn't be overlooked.

nicoburns4y ago

tehlike4y ago

nicoburns4y ago

I suspect it's likely to take a couple of hours to set up this tool too!

micheljansen4y ago

dvasdekis4y ago

I was thinking "oh! this is awesome!", and then noticed it didn't support MSSQL. Not to worry, I'll just contribute a connector. Let's take a look at their existing connector code...

https://github.com/Qovery/replibyte/blob/main/replibyte/src/...

Not a single comment to say what anything does. Sigh. It's the same for the other drivers too.

evoxmusic4y ago

Can you open an issue to ask for getting better documentation on how to contribute for an additional connector?

husainfazel4y ago

  - Works on large database (> 10GB) (read Design)

Can anyone explain to me how this works in RepliByte? The design document only talked about Postgres.

For example let's say I have a MySQL database, how does RepliByte copy that database into S3?

  - Database Subsetting: Scale down a production database to a more reasonable size

What about this? Does the database need foreign keys to prevent related rows in tables being lost and are they just randomly deleting rows as the config seems to indicate [2]

[1] https://www.percona.com/software/mysql-database/percona-xtra...

[2] https://github.com/qovery/replibyte#configuration

bprasanna4y ago

One of my colleagues has developed a sophisticated Data generator addressing the needs of workloads/algorithms which work based on the characteristics of the data.

https://github.com/jssprasanna/redgene

And this is DB agnostic, it generates data in the form of flat files which can be imported into any database which supports importing data from flat files.

exdsq4y ago

You’d imagine Postgres or whatever would have a built in function to populate a DB based on types as a sort of fuzzing tool tbh

I worked on a gov app years ago that required anonymized databases and I remember thinking that then - why isn’t it available out the box? Everyone must need this from time to time

carterschonwald4y ago

exdsq4y ago

Does it need to be statistically faithful or can it just fuzz the data type? Maybe that’s better for testing anyway?

closeparen4y ago

* Data that is tightly clustered on certain keys and widely dispersed on other keys can hit some "fun" interactions with sharding regimes, indexes, etc. that random data doesn't.

* Brute forcing a whole bunch of invalid values can be a lot less interesting than lighting up unconventional combinations of valid values.

2 more replies

onion2k4y ago

This project needs a giant heading box in the README stating 3 things;

- staging databases that hold data generated from production databases should be considered production data, with the same level of consideration for security and access as production.

- staging databases that hold production data are a GDPR violation waiting to happen. Make sure your data controller / lawyers knows exactly what you're doing with production data.

evoxmusic4y ago

Do you consider transformed data in staging harmful? (Transformed data = where all the sensitive data have been hidden)

onion2k4y ago

evoxmusic4y ago

It makes sense to me and that's why:

  1. Auto-detection of sensitive data is planned

  2. Detecting database schema change is also plan to prevent leaking sensitive data.

1 more reply

xwowsersx4y ago

I have a docker container running postgres and I just want to take the snapshot and seed it into that. How exactly do I do this?

evoxmusic4y ago

Can you open an issue on the GitHub repo and set label "question" ? Thx

a_c4y ago

Am I missing the obvious, why would one seed a dev database from production? If anything, data on dev should exist before production?

adamckay4y ago

fedeb954y ago

evoxmusic4y ago

A feature to automatically detect sensitive data is planned to be added

atonse4y ago

Seems like you only specify transformers

j / k navigate · click thread line to collapse