Features:
- Support data backup and restore for PostgreSQL, MySQL and MongoDB
- Replace sensitive data with fake data
- Works on large database (> 10GB) (read Design)
- Database Subsetting: Scale down a production database to a more reasonable size
- Start a local database with the prod data in a single command
- On-the-fly data (de)compression (Zlib)
- On-the-fly data de/encryption (AES-256)
- Fully stateless (no server, no daemon) and lightweight binary
- Use custom transformers
My motivation:As a developer, creating a fake dataset for running tests is tedious. Plus, it does not reflect the real-world data and painful to keep updated. If you prefer to run your app tests with production data. Then RepliByte is for you as well.
Available for MacOSX, Linux and Windows.
or at least either include "AWS_SESSION_TOKEN" in that setup (if it is present) in order to allow "aws sts assume-role" to work, or allow `AWS_PROFILE`, or just use the aws-sdk's normal credential discovery mechanism which at least on their "main" SDKs is a fallback list of them, but I couldn't follow the docs.rs soup in order to know if their rust sdk is up to speed or what
I run all AWS commands through an assumed role (STS) via aws-vault.
I can now see how my wording was confusing, I'll try to be more precise in the future
I have no idea what implementations there are for this: https://docs.rs/aws-sdk-s3/latest/aws_sdk_s3/struct.Credenti... and its official page is even worse: https://docs.aws.amazon.com/sdk-for-rust/latest/dg/credentia...
Going all the way down to the GH readme seems to back up the investigation that, no, they really seem to have forgotten about "AWS_SESSION_TOKEN": https://github.com/awslabs/aws-sdk-rust#getting-started-with...
The best way to store ephemeral secrets is in an environment variable or /dev/shm. This locks the secret behind the scope of the parent process (shell instance) and the user.
Another big reason is it’s much nicer to deploy on any AWS service and have the SDK use the metadata host, which will automatically provide you with a temporary access token with the permissions of the role you set for it.
There are also plenty of organizations (not mine) with review boards for database changes. Those folks could also have a process to make sure that new, sensitive columns get added to the configuration file.
21. Production data in test environments Hold We continue to perceive production data in test environments as an area for concern. Firstly, many examples of this have resulted in reputational damage, for example, where an incorrect alert has been sent from a test system to an entire client population. Secondly, the level of security, specifically around protection of private data, tends to be less for test systems. There is little point in having elaborate controls around access to production data if that data is copied to a test database that can be accessed by every developer and QA. Although you can obfuscate the data, this tends to be applied only to specific fields, for example, credit card numbers. Finally, copying production data to test systems can break privacy laws, for example, where test systems are hosted or accessed from a different country or region. This last scenario is especially problematic with complex cloud deployments. Fake data is a safer approach, and tools exist to help in its creation. We do recognize there are reasons for specific elements of production data to be copied, for example, in the reproduction of bugs or for training of specific ML models. Here our advice is to proceed with caution.
> Fake data is a safer approach, and tools exist to help in its creation.
Because the tool presented is exactly what this quote says.
However, there may be times when data masking must be nuanced. Suppose some random email/domain pair is bad and you would rather replace all "example.com" domain instances with "fake.com", and not "random1.com", "random2.com", etc (for ML, 3rd party random analysis). Out of the box I don't see it is provided, HOWEVER I see that you can write custom transformer: https://github.com/Qovery/replibyte/tree/main/examples/wasm and fulfill your needs.
Excellent :)
Honestly, I'm kinda surprised by the lack of comments advocating against doing this.
I think this tool looks great!
I appreciate the time and effort you put on to releasing a free and open source tool to help solve a real problem.
Keep up the great work!
- There's a transformer which appears to retain the first char on string fields. That's not safe if you're dealing with customer data.
- Remove telemetry. That it's claimed to be anonymized and togglable is meaningless where sensitive data is concerned.
1. What do you mean it is not safe?
2. Telemetry can be removed with the option --no-telemetry and you can inspect the code > https://github.com/Qovery/replibyte/blob/main/replibyte/src/telemetry.rsWith regards to telemetry I'm aware that it can be disabled. But in my experience that would still result in a veto from the security teams I've worked with.
Anyways, if you are single or handful developers where everyone get access to prod, you may not care. Still, data hygiene and risk mitigation shouldn't be overlooked.
> "Personal data is any information that relates to an identified or identifiable living individual. Different pieces of information, which collected together can lead to the identification of a particular person, also constitute personal data."
So it's not enough to, for example, replace all names, addresses etc. when you can still see which products someone has interacted with, when their account was created (which in the production DB would relate back to their actual account!) or any other unexpected pieces of information that links back to their identity.
In practice, this means that any realistic production-derived data is either very likely to be still considered PII (and therefore much more demanding to handle safely and securely) or has to be mangled so much that it is no longer representative of production data.
https://github.com/Qovery/replibyte/blob/main/replibyte/src/...
Not a single comment to say what anything does. Sigh. It's the same for the other drivers too.
- Works on large database (> 10GB) (read Design)
Can anyone explain to me how this works in RepliByte? The design document only talked about Postgres.For example let's say I have a MySQL database, how does RepliByte copy that database into S3?
Does it use mysqldump or are they coping the database index files? We have a script that automatically backs up our production database at intervals to S3 and then a program to download the latest backup and scrub the data.
It takes a heck of a long time to download and impacts the server when it happens... it's been on my todo list to replace with Percona's Xtrabackup [1] but doesn't look that's what these guys are doing?
- Database Subsetting: Scale down a production database to a more reasonable size
What about this? Does the database need foreign keys to prevent related rows in tables being lost and are they just randomly deleting rows as the config seems to indicate [2][1] https://www.percona.com/software/mysql-database/percona-xtra...
https://github.com/jssprasanna/redgene
ReDGene - Relational Data Generator is a tool aimed at taking control over the data generation with being able to generate column vectors in a table with required type, interval, length, cardinality, skewness and constraints like Primary Key, Foreign Key, Foreign Key(unique 1:1 mapping), Composite Primary Key and Composite Foreign Key.
And this is DB agnostic, it generates data in the form of flat files which can be imported into any database which supports importing data from flat files.
I worked on a gov app years ago that required anonymized databases and I remember thinking that then - why isn’t it available out the box? Everyone must need this from time to time
- staging databases that hold data generated from production databases should be considered production data, with the same level of consideration for security and access as production.
- staging databases that hold production data are a GDPR violation waiting to happen. Make sure your data controller / lawyers knows exactly what you're doing with production data.
- ask yourself why you need production data in staging in the first place. What are you gaining over a script that generates data? If you want data at scale you can generate it randomly. If you want data that covers all edge cases you can generate it non-randomly. If you want "real-looking" data then maybe this tool is useful.
People copying data from production to staging and then failing to look after it properly is a nightmare. It shouldn't be encouraged except in very unusual circumstances. In my experience of dev, your development and staging data should be covering the weird edge cases that you need to handle far more than the nice "happy path" data you get in production.
Another major problem with tools like replibyte is that people use them properly, and then a database schema changes, but people don't update their script to anonymize new tables or columns. Then a few months later someone notices sensitive data has made its way in to staging, and into the backups, and the database dumps devs made to debug things because "it's only staging data, who cares!"
Protecting user data is something that you need to be extremely vigilant about. In my experience, the less access I have to production data the happier I am. Copying it and using it in staging, even if you're careful about it, fills me with dread.
I have a docker container running postgres and I just want to take the snapshot and seed it into that. How exactly do I do this?