You start with a database? Great. But wait, you need bulk storage now, so you start sticking it in a cloud bucket (and ensure you use a separate namespace for it). But then Team 2 introduced a new service you now need to spin up in a separate container, so you pull their repo. Then there's a production issue that could have been solved by proper AB testing, so you decide to go with a third party solution that offers that. The party continues, and soon your simple one-click setup ends up so complicated you end up with a full time person just keeping it alive. Whoops! Someone got the cloud namespace wrong on their desktop instance, and production data got hosed. Etc.
The dev/staging sandboxes are essentially the pragmatic hack to create these snapshots. Ugly sacrifices are made to construct the writable snapshot across disparate pieces. It becomes a headache when there is too much contention to use these sandboxes, or too much manual effort to reset them to a desired testing state. Also, if the sandbox copy-on-write mechanism differs too much, you end up changing the test environment so much that you are no longer emulating how it will behave in production. So the old-school approach is a replica of the full environment on redundant hardware matching the same characteristics as production.
But before I read the linked article, I was expecting a different anti-pattern to be discussed: where people forget that the dev/staging processes are for software testing, to prepare for when you deploy high-quality software to production. They are not for data preparation. Your deployment eventually needs to combine new software with the existing production data, and not depend on accumulated state of the sandbox data. I've seen people twist themselves into pretzels conflating software and data, and trying to somehow move data from the sandbox into production in a misguided "upgrade".
Software flows from developer, through the sandbox(es), to eventually be in production use by users. Data flows the opposite direction, from production users into snapshots loaded into sandboxes, and eventually into developer's hands with their experimental code. Ignoring of course situations where developers are not authorized to see real user data...
> Yeah, there's a lot of hidden magic/assumptions in having a "writable snapshot of a specific version" of production data.
That's absolutely a huge assumption. This technology has been a game changer for us: https://lakefs.io/
> It becomes a headache when there is too much contention to use these sandboxes, or too much manual effort to reset them to a desired testing state.
This exactly the situation we were encountering.
From what you've described, you're doing the right thing for you and your team. Keep it simple as long as you possibly can. I can only advise you to just keep the goal of balancing the time needed to maintain your approach vs. the return you get from it.
The key advantage of the dev/stage/prod approach is only at sufficient scale and proper discipline among teams, each maintaining their own version of their product at the dev and stage points. This has plenty of headaches, but you're at least getting a chance to exercise your work in something that will be as close to production as possible without actually being there. It tends to work 'best' when you only start holding other teams accountable at the stage point.
Cloud dependencies are where I've seen thing get the weirdest and most volatile. There are all kinds of limitations that can crop up even if you try to maintain the highest level of separation and discipline.
For example, did you know that AWS limits a single account to no more than 5 Elastic IP addresses, and that there's an upper-limit to how many Elastic Network Interfaces can be held in a region? [1] It sounds stupid, but I've actually seen these limits hit even after politely asking AWS to make them as large as possible; keeping developers empowered to deploy their own, compartmentalized version of the product became a real pain.
[1] https://docs.aws.amazon.com/vpc/latest/userguide/amazon-vpc-...
any way, hope that perspective is helpful
We found the key was associating our logical state (git branch) with our logical state (lakefs branch). We make this association during our branch deployment process.
Let me know if this helps at all. I was planning to write a follow up post about what we learned about managing the logical state of a data pipeline. If you have suggestions for a different topic to dive into, I'd love to hear about it.