This section is totally wrong IMO. What is the alternative? "Hard" deleting records from a table is usually a bad idea (unless it is for legal reasons), especially if that table's primary key is a foreign key in another table - imagine deleting a user and then having no idea who made an order. Setting a deleted/inactive flag is by far the least of two evils.
>when multiplied across all the analytics queries that you’ll run, this exclusion quickly starts to become a serious drag
I disagree, modern analytics databases filter cheaply and easily. I have scaled data orgs 10-50x and never seen this become an issue. And if this is really an issue, you can remove these records in a transform layer before it hits your analytics team, e.g. in your data warehouse.
>soft deletes introduce yet another place where different users can make different assumptions
Again, you can transform these records out.
The record of an order is not intrinsically PII and thereby subject to rights of erasure. It may well be equally unlawful in some jurisdictions to irrevocably destroy it entire, it being necessary for accounting or tax audit, or even simply for mundane followup process, such as returns, that arise from actionable consumer rights. Ergo, such documents must fundamentally survive the erasure/redaction of any PII it does include.
Is it always? If that data is immutable, for example?
There is also a more sinister side, which is that the ability to hard delete something forever means that bad actors can fabricate old "deleted" documents and accuse someone of having created and then deleted them.
Assuming you have constraints set up correctly (on delete no action or on delete restrict) then how could this ever happen? If you don’t have constraints set up correctly…
So if you want to delete a user but keep the records of their orders and still know who made those orders, then some form of soft delete is probably your best option. I believe that's the point rm999 was making (in response to the article asserting that soft deletes are a "data model mistake"). Properly configured constraints can prevent an "oops" but don't really do anything to solve the problem of this sort of delete from some contexts but not others.
As long as your primary key has no business meaning you should never have to delete the row of a table.
If you ever hear anyone bragging that their data model is entirely metadata driven, and can be used to model anything - without changing the database - that's a huge red flag, as is looking in and seeing tables called "element", "business object" and the like.
Unfortunately, for most serious Enterprise systems, a degree of flexibility is essential. It's being able to pick the right balance between hard coding first class domain objects into the database and allowing for extensibility that IMO marks the truly expert system designer.
E.g.: It should be possible to take a query definition, request its columns ("schema only" execution), and then insert or merge the columns into a table definition somewhere. Something like:
SELECT SCHEMA( SELECT * FROM "blah" )
INTO "tablename"
When a black market is formed, it's a sign that there is an unmet demand. When you see the exact same "wrong" design pattern turn up over and over, it's a sign that the underlying system isn't meeting the needs of the developers.Some early mistakes just can’t be solved without a do-over, and from a recent experience, it ends up being less work than maintaining a flawed schema.
At this point a ground-up rebuild is probably going to be no slower than trying to update the existing app. Neither will be cheap.
There comes a time to refactor and fix your architecture but it's usually not at the beginning.
You can design a data model if you don't know what you're building. And no startup really knows what they're building.
That can be said about any cost centre, but you don’t have to drag managers kicking and screaming to get them to buy fire insurance.
Practically what it does is allow the company to keep up velocity and not be distracted putting out fires everywhere.
Of course building features is the team’s entire reason for existing. But there is no advantage to defer refactoring to some later date. The longer you wait the more painful it gets.
Chances are the time never comes, once progress stalls and the company isn’t out of business yet someone will have the brilliant idea to rewrite everything from scratch, which is just lighting money on fire with extra steps.
No matter what, startups break as they grow. You will need to fix things. Just make sure they're not sooo bad that you can't do it in a timely/affordable way.
Of course they have to iterate, the problem is that there is no deliberate effort anywhere, it’s just piling more crap on top of old crap and deluding themselves that they are some kind of lean, agile visionaries because of it.
Even when we've spent a bunch of time planning out data, but we still got a lot of things wrong in hindsight. The reality is we didn't know enough about our product direction to make any truly informed decisions.
In general, poor decisions seem to stem from working in ambiguity about product, rather than poor technical decisions.
It's baffling to me that for many companies I've worked for, their data model is basically 100% tech debt that can never be fixed because the cost is too high.
For example:
> 1. Polluting your database with test or fake data
> [...] By polluting your database with test data, you’ve introduced a tax on all analytics (and internal tool building) at your company.
I feel like I'm missing something because that seems insane to me.
From my experience with metabase, this makes it easier to use anyway but it means you have to maintain an etl.
I once saw something a little similar to this, except with one flavor of DB rather than several. A company you've likely heard of went hard for a certain Java graph database product, due to a combination of an internal advocate who seemed determined to be The GraphDB Guy and an engineering manager who was weirdly susceptible to marketing material. This because some of their data could be represented as graphs, so clearly a graph database is a good idea.
However: the data for most of their products was tiny, rarely written, not even read that much really, even less commonly written concurrently, and was naturally sharded (with hard boundaries) among clients. Their use of that graph database product was plainly contributing to bugginess, operational pain, mediocre performance (it was reasonably fast... as long as you didn't want to both traverse a graph and fetch data related to that graph, then it was laughably slow) and low development velocity on multiple projects.
Meanwhile, the best DB to deliver the features they wanted quickly & with some nice built-in "free" features for them (ability to control access via existing file sharing tools they had, for instance) was probably... SQLite.
Nobody fully knew how operations, schemas, indexing or queries in any of them worked. Usually someone had managed to hack something together in a week and then the rest of the team just did minor changes to existing queries. Joining between the databases was also a fun exercise.
I blame it all on docker. It's so easy to just docker-compose run grafana:latest, then dust off your hands and claim you have a database running. Articles from HN on how fancy setup Netflix have also contributes to this, you don't have the same ops capacity to replicate a FAANG stack.
In the end all of it got replaced with only mongodb and firefighting went down to 0. Everybody in the team knew how to do everything, from new queries to migrations and backup-recovery. It's probably worse in every aspect on each task the specialized databases were solving, but it works good enough and often bringing a really good swiss-army-knife is better than having a caravan of specialized machines which each require special expertise.
You can use partial indexes to only index non-deleted rows. If you are worried about having to remember to exclude deleted rows from queries: Use a view to abstract away the implementation detail from your analytics queries.
This is a good alternative to moving deleted records from an active table to a deleted table.
- Having informal metrics and dimension definitions: you throw together something quick and dirty and then realize there's something semantically broken about your data definitions or unevenness. For example your Android app and iOS apps report "countries" differently, or they have meaningfully different notions of "active users"
- Not anticipating backfill/restatement needs. Bugs in logging and analytics stacks happen as much as anywhere else, so it's important to plan for backfills. Without a plan, backfills can be major fire drills or impossible.
- Being over-attentive to ratio metrics (CTR, conversion rates) which are typically difficult to diagnose (step 1 figure out whether the numerator or the denominator is the problem). Ratio metrics can be useful to rank N alternatives (eg campaign keywords) but absolute metrics are usually more useful for overall day to day monitoring.
- Overlooking the usefulness of very simple basic alerting. It's common for bugs to cause a metric to go to zero, or to be double counted, or to not be updated with recent data, but often times even these highly obvious problems don't get detected until manual inspection.
This matches my experience. Building tools that allow you to rebuild some or all of a dataset with minimal headache make any individual task much easier. Both in terms of safety, and in terms of things like branching/dev environments.
also, is there a good resource on how to backfill?
Just take care to use the subscription service provider data model how it is intended. It is possible to design your integration in a way that goes against the grain and end up with gaps in your data. For example, by re-using a single subscription instance per customer and changing it's properties when the customer down/upgrades rather than creating a new Subscription instance.
And no amount of "are you really really really sure you want to delete this?" confirmations are going to fix this. You could require the whole Spongebob Squarepants ravioli ravioli give me the formuoli song and dance and people will still delete hundreds or thousands of records by accident.
This way you can watch post mortem, restore etc...
AND it's not soft delete since the data is really gone from the production table, therefore no query tweaking
Only thing: you need to really delete when GDPR related deletion is requested.
The problem with this that it gets really cumbersome if you have a complex system of tables that depend on the main table, you'll end up having to make deleted/archived versions of all those tables. In that case it's easier to have a deleted/archived flag in the main table.
Maybe I've been spoiled, but isn't it common to have dev, test, and prod instances? Possibly multiples of the former 2?
I'm not sure how to get around this, actually. Any production service of a certain scale is going to have some amount of fake activity caused by debugging, monitoring, testing, feature demos to clients/investors/internal stakeholders... It seems naive to tell an engineering team "no test accounts in prod ever because it makes analytics harder."
To be fair, the above description paints a better picture than we have in reality. There's nuances and edge cases. But prod is kept pretty clean. Most of the problems we have are related to upgrades-- these are enterprise apps that all use Oracle, and the latest updates for one might require a particulate version of Oracle, but another app will be in conflict with that version. So a lot of the DBA work involves wrangling support from vendors on how to work around these. You'd think an app using Oracle 12c would run fine if you upgrade to 13c, but no it doesn't.
It is possible to get all the pieces that are needed to build a data server for a enterprise pre built form cloud providers. Then plumb them together so the mostly work.
When the heat comes on and peopel are using it for real and it must scale (even a little) it blows up horribly.
The "leggo bricks" save a lot of time and money, and mean that people with only half a clue can build large impressive looking systems, but in the end people like ,e are picking up the pieced
I guess if your read model is based on RDBMS then it makes sense, otherwise it depends on the database system in question (i.e. some NoSQL databases like C*[1] and Riak[2] are implementing deletes by writing special tombstone values, which is kind of soft-delete but on the implementation level - but you can't easily restore the data like in case of RDBMS).
[1] https://thelastpickle.com/blog/2016/07/27/about-deletes-and-...
[2] https://docs.riak.com/riak/kv/latest/using/reference/object-...
Technically, in Postgres you can (kind of) enforce arbitrary schemas for semi-structured data using CHECK constraints. Unfortunately this isn't well-documented and NoSQL DBs often don't support similar mechanisms.
>The exact definition of what comprises a session typically changes as the app itself changes.
Isn't this an argument for post-hoc reconstruction? You can consistently re-run your analytics. If the definition changes in code, your persisted data becomes inconsistent, no?
A simple but useful thing is setting the database default time zone match the one where most of your team is (instead of UTC). This reduces the chance your metrics are wrong because you forgot to set the time zone when extracting the date of a timestamp.
Build tooling around this, warn users, hell, educate them, but don't set up foot-guns like non-UTC.
If I see a timestamp without a timezone, it must always be UTC. To do anything else is to introduce insanity.
I insisted that all the tools that were going to be installed under my watch would be UTC, and never experienced any time issue on them.
https://dba.stackexchange.com/questions/12991/ready-to-use-d...
Instead of soft deletes, move records to a history table
I agree w session issue. Had to rebuild sessions before and is a pita compared to just recording them at source
an index for every column in the database. then wondering why inserts are slow.
seriously?
Most of even these worries such as soft deletes disappear if you're not trying to keep every scrap of data you can.
Focus on the core business requirements and competencies and you likely don't need to store the minutae of every interaction forever.