Common data model mistakes made by startups (opens in new tab)

(metabase.com)

175 pointsReginaDeiPirati5y ago130 comments

130 comments

>Soft deletes

This section is totally wrong IMO. What is the alternative? "Hard" deleting records from a table is usually a bad idea (unless it is for legal reasons), especially if that table's primary key is a foreign key in another table - imagine deleting a user and then having no idea who made an order. Setting a deleted/inactive flag is by far the least of two evils.

>when multiplied across all the analytics queries that you’ll run, this exclusion quickly starts to become a serious drag

I disagree, modern analytics databases filter cheaply and easily. I have scaled data orgs 10-50x and never seen this become an issue. And if this is really an issue, you can remove these records in a transform layer before it hits your analytics team, e.g. in your data warehouse.

>soft deletes introduce yet another place where different users can make different assumptions

Again, you can transform these records out.

inopinatus5y ago

Most order forms are snapshots of data at the instant of their lodgement, since they are sales contracts. It is a rookie blunder to link them relationally to master data for products and PII &c.

The record of an order is not intrinsically PII and thereby subject to rights of erasure. It may well be equally unlawful in some jurisdictions to irrevocably destroy it entire, it being necessary for accounting or tax audit, or even simply for mundane followup process, such as returns, that arise from actionable consumer rights. Ergo, such documents must fundamentally survive the erasure/redaction of any PII it does include.

jhgb5y ago

> It is a rookie blunder to link them relationally to master data for products and PII &c.

Is it always? If that data is immutable, for example?

1 more reply

tgbugs5y ago

Hard deletes are also awful from the perspective of data preservation. For example, when youtube removes a video they also delete all the metadata or any indication that it ever existed. Countless people have lost what they thought was a secure record of at least the title of songs or videos they saved to a playlist.

There is also a more sinister side, which is that the ability to hard delete something forever means that bad actors can fabricate old "deleted" documents and accuse someone of having created and then deleted them.

alexpetralia5y ago

I do think that hard deletes may sometimes be required to comply with legal requirements (e.g. complete expungement of personal information relating to a user). If it is not required by statutory law, sometimes it is written into commercial contracts.

bulhi5y ago

Exactly. I get OP's point (i.e. you can accidentally include softdeleted records in your results), but for some types of data hard deletes are an absolute no-go anyways, so you just have to live with it.

irrational5y ago

> if that table's primary key is a foreign key in another table - imagine deleting a user and then having no idea who made an order

Assuming you have constraints set up correctly (on delete no action or on delete restrict) then how could this ever happen? If you don’t have constraints set up correctly…

vericiab5y ago

Assuming you're deleting the row because it shouldn't be used by read queries, constraints like you described prevent the problem of having orphaned records in the child table but also prevent you from achieving your goal. On delete cascade would allow you to achieve your goal and prevent the orphaned records but could lead to deleting more than intended (especially if the child table is also a parent table referenced by further foreign key constraints, its children could in turn have children, etc). Of course, with no action/restrict you could also manually cascade the delete, but if you actually don't want to delete a child row and there's not an appropriate alternative value for its foreign key then you're in a bit of a pickle.

So if you want to delete a user but keep the records of their orders and still know who made those orders, then some form of soft delete is probably your best option. I believe that's the point rm999 was making (in response to the article asserting that soft deletes are a "data model mistake"). Properly configured constraints can prevent an "oops" but don't really do anything to solve the problem of this sort of delete from some contexts but not others.

andrewprock5y ago

The chance that you don't have constraints set up correctly is indistinguishable from 100%.

2 more replies

quietbritishjim5y ago

That would just make the data loss problem worse still. I realise OP just chose an arbitrary example, but if you really are talking about users and orders, and if you delete a user, then really deleting the records for their associated orders is even worse than losing track of who made them.

2 more replies

watermelon05y ago

Hard deletes most likely need to be supported, due to legal or contractual obligations. Designing with this in mind, makes everything a lot easier in the long run.

rm9995y ago

I’ve always NULL’d values, not deleted rows. E.g. GDPR request? NULL out all identifying information, but keep the record.

As long as your primary key has no business meaning you should never have to delete the row of a table.

1 more reply

pyrophane5y ago

I think the biggest mistake some startups make wrt their data model is not really thinking about it at all. The data model winds up being the byproduct of all the features they've implemented and the framework and the libraries they've used, rather than something that was deliberately designed.

beachy5y ago

At the other end of the scale is a data model designed for extreme extensibility.

If you ever hear anyone bragging that their data model is entirely metadata driven, and can be used to model anything - without changing the database - that's a huge red flag, as is looking in and seeing tables called "element", "business object" and the like.

Unfortunately, for most serious Enterprise systems, a degree of flexibility is essential. It's being able to pick the right balance between hard coding first class domain objects into the database and allowing for extensibility that IMO marks the truly expert system designer.

jiggawatts5y ago

One underlying reason for this is that DBMS systems have an unnecessary source of complexity: They have a separate Data Manipulation Language (DML) and a Data Description Language (DDL). They really ought to be unified, but few (any?) mainstream SQL databases are homoiconic in this way.

E.g.: It should be possible to take a query definition, request its columns ("schema only" execution), and then insert or merge the columns into a table definition somewhere. Something like:

    SELECT SCHEMA( SELECT * FROM "blah" ) 
    INTO "tablename"

When a black market is formed, it's a sign that there is an unmet demand. When you see the exact same "wrong" design pattern turn up over and over, it's a sign that the underlying system isn't meeting the needs of the developers.

1 more reply

feoren5y ago

Secret hacker pro-tip: inner platforms and key/value pair representations do not actually improve extensibility. You always have a schema -- you get to decide whether it's explicit or implicit. Their problem isn't that their data model is too extensible: it's that it's plain old bad.

allie15y ago

I think it’s a mistake that they don’t revisit it occasionally, and if necessary pull the trigger on a new schema + migration scripts.

Some early mistakes just can’t be solved without a do-over, and from a recent experience, it ends up being less work than maintaining a flawed schema.

vosper5y ago

This is the place I work at. The data model was designed with a narrow focus. When that turned out to not be viable, the company moved into an adjacent and much larger market. But the names never changed, and the subtle differences between the two worlds was never addressed. So now our application is full of terminology and restrictions that confuse our customers, and our database doesn’t match anyone’s mental model of what the application does. It’s all workable, but IMO we’ve paid (and pay) a not-insignificant price in productivity and complexity because we never took the time to fix these things.

At this point a ground-up rebuild is probably going to be no slower than trying to update the existing app. Neither will be cheap.

2 more replies

duped5y ago

Practically speaking the data model creates very little value. If your startup is trying to make money, features are more important than design for a good stretch.

There comes a time to refactor and fix your architecture but it's usually not at the beginning.

You can design a data model if you don't know what you're building. And no startup really knows what they're building.

Tabular-Iceberg5y ago

> Practically speaking the data model creates very little value.

That can be said about any cost centre, but you don’t have to drag managers kicking and screaming to get them to buy fire insurance.

Practically what it does is allow the company to keep up velocity and not be distracted putting out fires everywhere.

Of course building features is the team’s entire reason for existing. But there is no advantage to defer refactoring to some later date. The longer you wait the more painful it gets.

Chances are the time never comes, once progress stalls and the company isn’t out of business yet someone will have the brilliant idea to rewrite everything from scratch, which is just lighting money on fire with extra steps.

1 more reply

SkyPuncher5y ago

"The next series is the time to fix your mistakes from this one".

No matter what, startups break as they grow. You will need to fix things. Just make sure they're not sooo bad that you can't do it in a timely/affordable way.

taeric5y ago

Oddly, I feel the opposite is also a trap. A carefully crafted data model often stalls out compared to a grown one.

tomnipotent5y ago

I'm not aware of a single project, ever, that has gotten their data model right up front and not had to iterate on it countless times as it grew/evolved. Except maybe NASA. Even the best early data models fail after years of updates and evolution.

Tabular-Iceberg5y ago

This is the rationalisation I get every time when I tell companies that their data model is a mess. Never mind that neither I nor the parent said anything about doing it up front.

Of course they have to iterate, the problem is that there is no deliberate effort anywhere, it’s just piling more crap on top of old crap and deluding themselves that they are some kind of lean, agile visionaries because of it.

3 more replies

SkyPuncher5y ago

My rule is "make it easy for us to fix our mistakes".

Even when we've spent a bunch of time planning out data, but we still got a lot of things wrong in hindsight. The reality is we didn't know enough about our product direction to make any truly informed decisions.

In general, poor decisions seem to stem from working in ambiguity about product, rather than poor technical decisions.

2 more replies

bluefirebrand5y ago

This is a direct result of "move fast and break things", your database schema is the first thing broken and it never recovers.

It's baffling to me that for many companies I've worked for, their data model is basically 100% tech debt that can never be fixed because the cost is too high.

rectang5y ago

Metabase provides business analytics, and this list of "common mistakes" is weighted towards "choices which get in the way of business analytics".

For example:

> 1. Polluting your database with test or fake data

> [...] By polluting your database with test data, you’ve introduced a tax on all analytics (and internal tool building) at your company.

BigJono5y ago

The end of this article is particularly weird. Is it really suggesting that a good general rule is to optimise for business metric queries (which sounds like something that would generally run daily during off peak hours or ad hoc when someone needs the data) over the most commonly run reads/updates (which sounds like something that will happen multiple times per minute for every active user)?

I feel like I'm missing something because that seems insane to me.

kwertyoowiyop5y ago

Consider the source. The barber is suggesting optimizing for haircuts.

asperous5y ago

To your point, many of these could be addressed by making an analytics database copy of the transactional database, for example scrubbing test data and removing soft deletes in your etl.

From my experience with metabase, this makes it easier to use anyway but it means you have to maintain an etl.

handrous5y ago

> 5. The “right database for the job” syndrome

I once saw something a little similar to this, except with one flavor of DB rather than several. A company you've likely heard of went hard for a certain Java graph database product, due to a combination of an internal advocate who seemed determined to be The GraphDB Guy and an engineering manager who was weirdly susceptible to marketing material. This because some of their data could be represented as graphs, so clearly a graph database is a good idea.

However: the data for most of their products was tiny, rarely written, not even read that much really, even less commonly written concurrently, and was naturally sharded (with hard boundaries) among clients. Their use of that graph database product was plainly contributing to bugginess, operational pain, mediocre performance (it was reasonably fast... as long as you didn't want to both traverse a graph and fetch data related to that graph, then it was laughably slow) and low development velocity on multiple projects.

Meanwhile, the best DB to deliver the features they wanted quickly & with some nice built-in "free" features for them (ability to control access via existing file sharing tools they had, for instance) was probably... SQLite.

Too5y ago

Been through this as well. There was one database for relational data, one for logs, one for analytics, one for miscelaneous, one for binaries, one for time-series, one for key-values, one for caches and probably a lot more! Total nightmare.

Nobody fully knew how operations, schemas, indexing or queries in any of them worked. Usually someone had managed to hack something together in a week and then the rest of the team just did minor changes to existing queries. Joining between the databases was also a fun exercise.

I blame it all on docker. It's so easy to just docker-compose run grafana:latest, then dust off your hands and claim you have a database running. Articles from HN on how fancy setup Netflix have also contributes to this, you don't have the same ops capacity to replicate a FAANG stack.

In the end all of it got replaced with only mongodb and firefighting went down to 0. Everybody in the team knew how to do everything, from new queries to migrations and backup-recovery. It's probably worse in every aspect on each task the specialized databases were solving, but it works good enough and often bringing a really good swiss-army-knife is better than having a caravan of specialized machines which each require special expertise.

konha5y ago

> On the flip side, soft deletes require every single read query to exclude deleted records.

You can use partial indexes to only index non-deleted rows. If you are worried about having to remember to exclude deleted rows from queries: Use a view to abstract away the implementation detail from your analytics queries.

ako5y ago

You can also use the index to cluster database blocks of the table based on the index (postgres => cluster command). This means all active records will be written to the same database blocks, and all deleted records will be kept in separate database blocks. This can speed up queries that need to access a lot of active records.

This is a good alternative to moving deleted records from an active table to a deleted table.

ridaj5y ago

I would personally add:

- Having informal metrics and dimension definitions: you throw together something quick and dirty and then realize there's something semantically broken about your data definitions or unevenness. For example your Android app and iOS apps report "countries" differently, or they have meaningfully different notions of "active users"

- Not anticipating backfill/restatement needs. Bugs in logging and analytics stacks happen as much as anywhere else, so it's important to plan for backfills. Without a plan, backfills can be major fire drills or impossible.

- Being over-attentive to ratio metrics (CTR, conversion rates) which are typically difficult to diagnose (step 1 figure out whether the numerator or the denominator is the problem). Ratio metrics can be useful to rank N alternatives (eg campaign keywords) but absolute metrics are usually more useful for overall day to day monitoring.

- Overlooking the usefulness of very simple basic alerting. It's common for bugs to cause a metric to go to zero, or to be double counted, or to not be updated with recent data, but often times even these highly obvious problems don't get detected until manual inspection.

void_mint5y ago

> - Not anticipating backfill needs. Bugs in logging and analytics stacks happen, so it's important to plan for backfills. Without a plan, backfills can be major fire drills or impossible.

This matches my experience. Building tools that allow you to rebuild some or all of a dataset with minimal headache make any individual task much easier. Both in terms of safety, and in terms of things like branching/dev environments.

cerved5y ago

what's the relation to bugs in logging and analytics? I'm not sure I see it

also, is there a good resource on how to backfill?

2 more replies

brylie5y ago

If your company has a subscription business model, keep a history of user's subscriptions. They change over time and it is likely you will need to measure popularity and profitability of product offerings over time. Please don't force your analytics team to rely on event logs to reconstruct a subscription history.

jabo5y ago

This. You want to capture timestamps as users downgrade, upgrade, change quantity, churn, etc. If you have a status field, timestamp the changes to it. This way it’s easy to get the state of the world on any given day, which is a common analysis that’s done to study behavior of cohorts of subscriptions over time.

nerdponx5y ago

I first learned what an "audit log" was because I had to use an audit log to figure out the states of record in the database at a time in the past, because some specific pieces of data were being lost in the "soft-update" database setup.

maneesh5y ago

Stripe manages this extremely well

brylie5y ago

That's a good point. Most subscription service providers, like Stripe, Chargrbee, Braintree, etc, use a fairly conventional one-to-many data architecture for Customers and Subscriptions.

Just take care to use the subscription service provider data model how it is intended. It is possible to design your integration in a way that goes against the grain and end up with gaps in your data. For example, by re-using a single subscription instance per customer and changing it's properties when the customer down/upgrades rather than creating a new Subscription instance.

FriedrichN5y ago

I have seen so many people argue against soft deletes over the years. But I have also had so many instances where users 'accidentally' deleted a bunch of items and then call support to ask if there are any backups. And then I'll have to reconstruct the data from yesterday's backup plus today's changes. A soft delete will take care of this.

And no amount of "are you really really really sure you want to delete this?" confirmations are going to fix this. You could require the whole Spongebob Squarepants ravioli ravioli give me the formuoli song and dance and people will still delete hundreds or thousands of records by accident.

philprx5y ago

On this case, one way is to make a past_ or deleted_$tablename where you insert the deleted row before deleting it from production table.

This way you can watch post mortem, restore etc...

AND it's not soft delete since the data is really gone from the production table, therefore no query tweaking

Only thing: you need to really delete when GDPR related deletion is requested.

FriedrichN5y ago

>On this case, one way is to make a past_ or deleted_$tablename where you insert the deleted row before deleting it from production table.

The problem with this that it gets really cumbersome if you have a complex system of tables that depend on the main table, you'll end up having to make deleted/archived versions of all those tables. In that case it's easier to have a deleted/archived flag in the main table.

ineedasername5y ago

Polluting your database with test or fake data

Maybe I've been spoiled, but isn't it common to have dev, test, and prod instances? Possibly multiples of the former 2?

anchochilis5y ago

Yeah but it's also not unusual to have shared accounts for manual testing in prod, or to write automated smoke tests that run in prod after a deploy...

I'm not sure how to get around this, actually. Any production service of a certain scale is going to have some amount of fake activity caused by debugging, monitoring, testing, feature demos to clients/investors/internal stakeholders... It seems naive to tell an engineering team "no test accounts in prod ever because it makes analytics harder."

ineedasername5y ago

We just have a live clone in Dev, updated monthly, and a dev instance of the front end to use it. Sometimes monthly is too long, so a DBA will run a manual update in off-peak times. Queries that don't write data back can be moved directly to prod, though we also have an ODS with denormalized data for easier creation of reports & analysis. And changes that significantly write back to the DB are moved to test first, then to prod. Sometimes different people have things going on and that requires different timing or a clean copy of dev or test, and we'll temporarily spin up another instance.

To be fair, the above description paints a better picture than we have in reality. There's nuances and edge cases. But prod is kept pretty clean. Most of the problems we have are related to upgrades-- these are enterprise apps that all use Oracle, and the latest updates for one might require a particulate version of Oracle, but another app will be in conflict with that version. So a lot of the DBA work involves wrangling support from vendors on how to work around these. You'd think an app using Oracle 12c would run fine if you upgrade to 13c, but no it doesn't.

klowner5y ago

I do this for personal projects, that seems like basic obvious stuff, IMHO.

dugmartin5y ago

I would add to their semi structured data fields section a suggestion to add a version or type key. Otherwise your code consuming those field may grow over time to a bunch of conditionals to figure what is in the json.

worik5y ago

In my experience I would add: Building systems out of "lego blocks".

It is possible to get all the pieces that are needed to build a data server for a enterprise pre built form cloud providers. Then plumb them together so the mostly work.

When the heat comes on and peopel are using it for real and it must scale (even a little) it blows up horribly.

The "leggo bricks" save a lot of time and money, and mean that people with only half a clue can build large impressive looking systems, but in the end people like ,e are picking up the pieced

nivertech5y ago

There are advantages for soft deletes for CRUD architecture, but are there any for CQRS/ES (Event Sourcing)?

I guess if your read model is based on RDBMS then it makes sense, otherwise it depends on the database system in question (i.e. some NoSQL databases like C*[1] and Riak[2] are implementing deletes by writing special tombstone values, which is kind of soft-delete but on the implementation level - but you can't easily restore the data like in case of RDBMS).

[1] https://thelastpickle.com/blog/2016/07/27/about-deletes-and-...

[2] https://docs.riak.com/riak/kv/latest/using/reference/object-...

jasonhansel5y ago

> Typically semi-structured data have schemas that are only enforced by convention

Technically, in Postgres you can (kind of) enforce arbitrary schemas for semi-structured data using CHECK constraints. Unfortunately this isn't well-documented and NoSQL DBs often don't support similar mechanisms.

ridaj5y ago

Seems likely that the enforced schema would then break things when someone updates the "live" schema without updating all of the checks littered through downstream tables...

nerdponx5y ago

I'd love a document database that supports JSON Schema validation on read and write.

jayd165y ago

Whats the best way to construct a session?

>The exact definition of what comprises a session typically changes as the app itself changes.

Isn't this an argument for post-hoc reconstruction? You can consistently re-run your analytics. If the definition changes in code, your persisted data becomes inconsistent, no?

giovannibonetti5y ago

> Queries for business metrics are usually scattered, written by many people, and generally much less controlled. So do what you can to make it easy for your business to get the metrics it needs to make better decisions.

A simple but useful thing is setting the database default time zone match the one where most of your team is (instead of UTC). This reduces the chance your metrics are wrong because you forgot to set the time zone when extracting the date of a timestamp.

xyzzy_plugh5y ago

I cannot overstate how bad this advice is. Everything should be UTC by default. You can explicitly use timestamp with timezones and frankly it's trivial to query something like midnight-to-midnight PST. Your team should learn this as early as possible.

Build tooling around this, warn users, hell, educate them, but don't set up foot-guns like non-UTC.

If I see a timestamp without a timezone, it must always be UTC. To do anything else is to introduce insanity.

higeorge135y ago

This. I once joined a company with local timezone per deployment and it was a nightmare. Not only in terms of development and debugging, but even for all the support tools required and the numerous bugs we found.

I insisted that all the tools that were going to be installed under my watch would be UTC, and never experienced any time issue on them.

sethammons5y ago

We went this route and ended up with a db set to pst and some servers based on Chicago time. Endless time bugs. Pick one timezone for everything or just use unix timestamps.

dehrmann5y ago

I disagree on this because handling DST is error-prone.

elchief5y ago

Some enterprise data model links here:

https://dba.stackexchange.com/questions/12991/ready-to-use-d...

Instead of soft deletes, move records to a history table

I agree w session issue. Had to rebuild sessions before and is a pita compared to just recording them at source

konfusinomicon5y ago

Good list in there. Len silverstons data model resource books are amazing.. especially volume 3. Reading that book and getting to the point where I actually understood the most generalized patterns in it was a total game changer for me

jerrysievert5y ago

the one that is missing for me, that is my personal pet peeve:

an index for every column in the database. then wondering why inserts are slow.

seriously?

cerved5y ago

why would anyone ever want to do that?

jerrysievert5y ago

Their “justification” was that they wanted to be able to sort by any column.

1 more reply

eterm5y ago

A more common thing I think is just trying to collect and hoard too much data.

Most of even these worries such as soft deletes disappear if you're not trying to keep every scrap of data you can.

Focus on the core business requirements and competencies and you likely don't need to store the minutae of every interaction forever.

cjfd5y ago

It sounds like quite a few of the problems that are mentioned here can be ameliorated using views.

Pxtl5y ago

How do you reconcile the first bullet point (polluting data with test data) vs Test In Production being the modern trend? Those sound irreconcilable.

edgyquant5y ago

Doesn’t this mean beta testing in prod? Development, at least everywhere I’ve worked, takes place on a separate db. For instance where I work atm we copy prod data to a staging db every couple of months and develop/test new features there before rolling them out. Any data coming from the beta test, in prod, is not really test data it is prod data and I don’t see why you’d want to remove it.

allie15y ago

Include a cleanup step after each test?

jerrysievert5y ago

I find ROLLBACK to be a good fix for that.

anonytrary5y ago

is_fake column should do it.

intricatedetail5y ago

I am happy that on so many projects we rejected the kool aid and just used postgres and redis. Can't remember ever troubleshooting these.

j / k navigate · click thread line to collapse

130 comments

rm9995y ago

>Soft deletes

>when multiplied across all the analytics queries that you’ll run, this exclusion quickly starts to become a serious drag

>soft deletes introduce yet another place where different users can make different assumptions

Again, you can transform these records out.

inopinatus5y ago

Most order forms are snapshots of data at the instant of their lodgement, since they are sales contracts. It is a rookie blunder to link them relationally to master data for products and PII &c.

jhgb5y ago

> It is a rookie blunder to link them relationally to master data for products and PII &c.

Is it always? If that data is immutable, for example?

1 more reply

tgbugs5y ago

alexpetralia5y ago

bulhi5y ago

irrational5y ago

> if that table's primary key is a foreign key in another table - imagine deleting a user and then having no idea who made an order

Assuming you have constraints set up correctly (on delete no action or on delete restrict) then how could this ever happen? If you don’t have constraints set up correctly…

vericiab5y ago

andrewprock5y ago

The chance that you don't have constraints set up correctly is indistinguishable from 100%.

2 more replies

quietbritishjim5y ago

2 more replies

watermelon05y ago

Hard deletes most likely need to be supported, due to legal or contractual obligations. Designing with this in mind, makes everything a lot easier in the long run.

rm9995y ago

I’ve always NULL’d values, not deleted rows. E.g. GDPR request? NULL out all identifying information, but keep the record.

As long as your primary key has no business meaning you should never have to delete the row of a table.

1 more reply

pyrophane5y ago

beachy5y ago

At the other end of the scale is a data model designed for extreme extensibility.

jiggawatts5y ago

E.g.: It should be possible to take a query definition, request its columns ("schema only" execution), and then insert or merge the columns into a table definition somewhere. Something like:

    SELECT SCHEMA( SELECT * FROM "blah" ) 
    INTO "tablename"

1 more reply

feoren5y ago

allie15y ago

I think it’s a mistake that they don’t revisit it occasionally, and if necessary pull the trigger on a new schema + migration scripts.

Some early mistakes just can’t be solved without a do-over, and from a recent experience, it ends up being less work than maintaining a flawed schema.

vosper5y ago

At this point a ground-up rebuild is probably going to be no slower than trying to update the existing app. Neither will be cheap.

2 more replies

duped5y ago

Practically speaking the data model creates very little value. If your startup is trying to make money, features are more important than design for a good stretch.

There comes a time to refactor and fix your architecture but it's usually not at the beginning.

You can design a data model if you don't know what you're building. And no startup really knows what they're building.

Tabular-Iceberg5y ago

> Practically speaking the data model creates very little value.

That can be said about any cost centre, but you don’t have to drag managers kicking and screaming to get them to buy fire insurance.

Practically what it does is allow the company to keep up velocity and not be distracted putting out fires everywhere.

Of course building features is the team’s entire reason for existing. But there is no advantage to defer refactoring to some later date. The longer you wait the more painful it gets.

1 more reply

SkyPuncher5y ago

"The next series is the time to fix your mistakes from this one".

No matter what, startups break as they grow. You will need to fix things. Just make sure they're not sooo bad that you can't do it in a timely/affordable way.

taeric5y ago

Oddly, I feel the opposite is also a trap. A carefully crafted data model often stalls out compared to a grown one.

tomnipotent5y ago

Tabular-Iceberg5y ago

This is the rationalisation I get every time when I tell companies that their data model is a mess. Never mind that neither I nor the parent said anything about doing it up front.

3 more replies

SkyPuncher5y ago

My rule is "make it easy for us to fix our mistakes".

In general, poor decisions seem to stem from working in ambiguity about product, rather than poor technical decisions.

2 more replies

bluefirebrand5y ago

This is a direct result of "move fast and break things", your database schema is the first thing broken and it never recovers.

It's baffling to me that for many companies I've worked for, their data model is basically 100% tech debt that can never be fixed because the cost is too high.

rectang5y ago

Metabase provides business analytics, and this list of "common mistakes" is weighted towards "choices which get in the way of business analytics".

For example:

> 1. Polluting your database with test or fake data

> [...] By polluting your database with test data, you’ve introduced a tax on all analytics (and internal tool building) at your company.

BigJono5y ago

I feel like I'm missing something because that seems insane to me.

kwertyoowiyop5y ago

Consider the source. The barber is suggesting optimizing for haircuts.

asperous5y ago

To your point, many of these could be addressed by making an analytics database copy of the transactional database, for example scrubbing test data and removing soft deletes in your etl.

From my experience with metabase, this makes it easier to use anyway but it means you have to maintain an etl.

handrous5y ago

> 5. The “right database for the job” syndrome

Too5y ago

konha5y ago

> On the flip side, soft deletes require every single read query to exclude deleted records.

ako5y ago

This is a good alternative to moving deleted records from an active table to a deleted table.

ridaj5y ago

I would personally add:

void_mint5y ago

> - Not anticipating backfill needs. Bugs in logging and analytics stacks happen, so it's important to plan for backfills. Without a plan, backfills can be major fire drills or impossible.

cerved5y ago

what's the relation to bugs in logging and analytics? I'm not sure I see it

also, is there a good resource on how to backfill?

2 more replies

brylie5y ago

jabo5y ago

nerdponx5y ago

maneesh5y ago

Stripe manages this extremely well

brylie5y ago

That's a good point. Most subscription service providers, like Stripe, Chargrbee, Braintree, etc, use a fairly conventional one-to-many data architecture for Customers and Subscriptions.

FriedrichN5y ago

philprx5y ago

On this case, one way is to make a past_ or deleted_$tablename where you insert the deleted row before deleting it from production table.

This way you can watch post mortem, restore etc...

AND it's not soft delete since the data is really gone from the production table, therefore no query tweaking

Only thing: you need to really delete when GDPR related deletion is requested.

FriedrichN5y ago

>On this case, one way is to make a past_ or deleted_$tablename where you insert the deleted row before deleting it from production table.

ineedasername5y ago

Polluting your database with test or fake data

Maybe I've been spoiled, but isn't it common to have dev, test, and prod instances? Possibly multiples of the former 2?

anchochilis5y ago

Yeah but it's also not unusual to have shared accounts for manual testing in prod, or to write automated smoke tests that run in prod after a deploy...

ineedasername5y ago

klowner5y ago

I do this for personal projects, that seems like basic obvious stuff, IMHO.

dugmartin5y ago

worik5y ago

In my experience I would add: Building systems out of "lego blocks".

It is possible to get all the pieces that are needed to build a data server for a enterprise pre built form cloud providers. Then plumb them together so the mostly work.

When the heat comes on and peopel are using it for real and it must scale (even a little) it blows up horribly.

The "leggo bricks" save a lot of time and money, and mean that people with only half a clue can build large impressive looking systems, but in the end people like ,e are picking up the pieced

nivertech5y ago

There are advantages for soft deletes for CRUD architecture, but are there any for CQRS/ES (Event Sourcing)?

[1] https://thelastpickle.com/blog/2016/07/27/about-deletes-and-...

[2] https://docs.riak.com/riak/kv/latest/using/reference/object-...

jasonhansel5y ago

> Typically semi-structured data have schemas that are only enforced by convention

ridaj5y ago

Seems likely that the enforced schema would then break things when someone updates the "live" schema without updating all of the checks littered through downstream tables...

nerdponx5y ago

I'd love a document database that supports JSON Schema validation on read and write.

jayd165y ago

Whats the best way to construct a session?

>The exact definition of what comprises a session typically changes as the app itself changes.

Isn't this an argument for post-hoc reconstruction? You can consistently re-run your analytics. If the definition changes in code, your persisted data becomes inconsistent, no?

giovannibonetti5y ago

xyzzy_plugh5y ago

Build tooling around this, warn users, hell, educate them, but don't set up foot-guns like non-UTC.

If I see a timestamp without a timezone, it must always be UTC. To do anything else is to introduce insanity.

higeorge135y ago

I insisted that all the tools that were going to be installed under my watch would be UTC, and never experienced any time issue on them.

sethammons5y ago

We went this route and ended up with a db set to pst and some servers based on Chicago time. Endless time bugs. Pick one timezone for everything or just use unix timestamps.

dehrmann5y ago

I disagree on this because handling DST is error-prone.

elchief5y ago

Some enterprise data model links here:

https://dba.stackexchange.com/questions/12991/ready-to-use-d...

Instead of soft deletes, move records to a history table

I agree w session issue. Had to rebuild sessions before and is a pita compared to just recording them at source

konfusinomicon5y ago

jerrysievert5y ago

the one that is missing for me, that is my personal pet peeve:

an index for every column in the database. then wondering why inserts are slow.

seriously?

cerved5y ago

why would anyone ever want to do that?

jerrysievert5y ago

Their “justification” was that they wanted to be able to sort by any column.

1 more reply

eterm5y ago

A more common thing I think is just trying to collect and hoard too much data.

Most of even these worries such as soft deletes disappear if you're not trying to keep every scrap of data you can.

Focus on the core business requirements and competencies and you likely don't need to store the minutae of every interaction forever.

cjfd5y ago

It sounds like quite a few of the problems that are mentioned here can be ameliorated using views.

Pxtl5y ago

How do you reconcile the first bullet point (polluting data with test data) vs Test In Production being the modern trend? Those sound irreconcilable.

edgyquant5y ago

allie15y ago

Include a cleanup step after each test?

jerrysievert5y ago

I find ROLLBACK to be a good fix for that.

anonytrary5y ago

is_fake column should do it.

intricatedetail5y ago

I am happy that on so many projects we rejected the kool aid and just used postgres and redis. Can't remember ever troubleshooting these.

j / k navigate · click thread line to collapse