Ask HN: What's your preferred way of ensuring complex database integrity?

73 pointsHex087y ago56 comments

What I mean really is: when working with relational databases, and you have complex constraints to ensure (not trivial things like notnull but something like: "two tables away" uniqueness, etc..), do you like filling it with tricks using PKs/FKs/UNIQUEs to ensure integrity or do you prefer writing some code that will get executed as you application boots up, that ensures that all the more complex requirements are met?

56 comments

crispyambulance7y ago

Isn't that the primary purpose of transactions?

I mean, every time you write to the database, you open a transaction, do you stuff and close it. It either works or it doesn't. If it doesn't, then it rolls back. Isn't that enough?

I guess I am saying that for complex integrity checks, it should be the responsibility of the application's data layer.

On the other hand what is meant by "complex"? Is it "complex" because of numerous, ever-changing applications that touch the database or is it just one application and a database with a lot of tables?

In the case of wild, numerous applications, I suppose that database-side integrity controls are needed. If you really have ONE application touching the database, it might be better to have the application police integrity. Perhaps there's more than one answer?

btown7y ago

For what it’s worth, this is one of the rationales behind microservices: if the concept of an “allowed transaction” in your org is not “all things that can be done to a database” then everything should be using a shared API rather than having direct access.

ken7y ago

In my experience, that works great as long as your schema never changes and your software never has bugs.

jhanschoo7y ago

Here's an example of a "complex" constraint. You have a restaurant reservation system, and reservations are stored as time intervals. Each restaurant has a capacity, and you need to ensure that a booking does not make it so that the restaurant exceeds capacity at any point of time.

One way to resolve it with triggers is to have a stored procedure fetch all the records that intersect the new booking's interval, and procedurally count the number of concurrent bookings at each record's start and end times; the constraint would be violated if and only if the count at any point plus the pax of the new booking exceeds capacity.

zepolen7y ago

Imho that's the only way. Anything else can end up in invalid state.

6nf7y ago

Yea I agree with you - if your database only ever has one concurrent user (one app and one person using that app at a time) then it doesn't matter that much how you implement it, either in the DB or in the app. The benefit of doing it in the DB is that multiple apps or people using the same data concurrently can be easier to manage depending on the situation. But if it's a single user app? Whatever you prefer is probably fine.

tdurden7y ago

> do you like filling it with tricks using PKs/FKs/UNIQUEs

I am not sure that I would call these "tricks"

jhanschoo7y ago

As mentioned by the other comment, it's standard to use database triggers to ensure consistency.

There are two major problems with dealing with them in the application side

1. You can no longer guarantee consistency.

2. In the case where you need to do some things procedurally so that your application can no longer make a yes/no query to determine if a constraint is violated but need to make use of intermediate results, then this means that you will need to transfer intermediate results over the wire, which is a performance penalty over just doing it in the database server.

edit: of course, when you need to ensure consistency with some data source outside the database, there's no choice but to do it application-side.

clavalle7y ago

>As mentioned by the other comment, it's standard to use database triggers to ensure consistency.

shudder

Triggers are almost always the wrong level to handle consistency. It flies in the face of 'fail fast' if you need to actually handle problems of consistency and if you need 'last line' protection, why not bake it into the structure instead?

chatmasta7y ago

Foregoing triggers in favor of validating data in the application layer can be dangerous. The problem is that you need to re-implement data validation on every path to the database. If you've got a single application server, and it's the only client that ever writes to the database, this might work. But as soon as you add another service with write capability, you need to re-implement the validation logic. What if you forget?

If you put the validation in the database, any application can leverage it. If you want to fail fast, you can still do validation in the application layer while relying on the DB layer as the ultimate validator.

neuland7y ago

It’s generally not good for services to share write privileges to a database for this very reason.

Also you can end up with services communicating between an implicit database api rather than a well established rest or rpc (or whatever) api.

scarface747y ago

But as soon as you add another service with write capability, you need to re-implement the validation logic.

Then Don’t Do That (tm)

It’s hardly ever a good idea to have multiple apps writing to the same set of tables directly.

2 more replies

6nf7y ago

Yea the 'correct' approach is often to to both:

- You check for consistency issues on the client-side UI level, and give the user instant feedback when something is not right.

- You also create a db trigger to enforce the rule in the database itself. This way you, as a developer, can write your code with the assumption that the data is gonna be consistent when it comes from the database, even if something went wrong on the client UI level at some point.

TotempaaltJ7y ago

Generally my rule of thumb is to keep the integrity checking as close to the data as possible. Usually that means integrating it into the schema/database (optionally with triggers), as long as that has proper tooling (versioning specifically).

With more complex checks it might be necessary to build a periodically runnable check mechanism. At startup might be a good way to do it, or some sort of cronjob. I can imagine that in larger systems there might even be separate services to check data integrity across other services.

scarface747y ago

I’ve never heard someone say as the database grows in complexity - “I’m so glad we decided to use triggers everywhere”.

TotempaaltJ7y ago

I haven't actually used triggers myself, and I do feel like they're too far away from the data to be visible/easy to understand.

vbsteven7y ago

I try to encode all constraints in the database using constraints and triggers. This way your data is always consistent regardless of which app/language is interacting with it.

Sometimes I add the same or similar constraints in the application layer because it leads to better error/validation messages.

IMO having data integrity enforced in the database instead of application code is similar to having a type system enforcing code integrity at compile time instead of at runtime. It's about catching errors as early and as close to the root as possible.

lazulicurio7y ago

If you're using SQL Server, you can use indexed (materialized) views to enforce uniqueness across multiple tables.

There's also a neat little trick with indexed views and cross join[1] that you can use for more complex constraints. I've never used it for anything more than reference tables, so I can't comment on its ability to scale.

[1] https://spaghettidba.com/2011/08/03/enforcing-complex-constr...

clavalle7y ago

If data needs to agree PKs and FKs are a way to go. I avoid UNIQUES unless absolutely necessary since they are not structural.

Reference tables for things that don't change or change very seldom are a valuable tool if data across other tables must agree. Check constraints otherwise.

But that's just the bones -- the last line of defense. Having a framework that can push such constraints through the application is very helpful so you are not waiting for SQL errors to enforce integrity.

It is easy to go overboard, though, and it is worth thinking about how important data integrity and agreement actually is before writing it in stone, so to speak.

paulddraper7y ago

UNIQUES are definitely "structural".

In fact, your the foreign side of the FK needs to be covered by a unique constraint (including PK, which is simply a UNIQUE with slight differences).

mistrial97y ago

can you say more about "reference tables" ?

clavalle7y ago

Sure.

I've also heard them called 'Verification Tables'. They are basically check constraints but stored in a table for reference. If you are using a check constraint in more than one place and it goes beyond a simple 'Y', 'N' it's probably worth thinking about using a reference table instead.

These tables will generally not be changed very often and can be considered 'read only' for the most part. At their most pure, an application can count on certain values to exist in a reference table and might even verify this at startup or populate them with application properties. They could be application level properties if not for the referential integrity needs.

It is often useful, however, to have some reference tables that are only 'mostly read only'. Changeable but largely static.

As a trivial example, a table like state_abbreviations. It is a finite list of data that is unlikely to change.

Data that comes from the government is a pretty good candidate for this kind of thing. It changes, but not often, and you don't want to give users the option to put whatever they want. e.g. Federal School Codes.

Other good candidates are structural aspects of your application. Say, a list of modules that users can navigate to to perform a certain job function. Perhaps you want to give them the ability to compose security roles around those and then apply those roles to users. Also, say, you want to use the same names for menus or reference them elsewhere in a consistent way, a reference table might be the way to go.

floriol7y ago

I think he means the following, illustrated by example: instead of storing group_name in the users table for each user, you only reference a pk: group_id, and have a separate table groups with the actual name. So the actual value "admin" will only be there once, defending against typos etc.

So basically a normalised database

kogir7y ago

I much prefer when the database enforces integrity by limiting access and preventing corruption from occurring.

In Microsoft SQL Server, you can often achieve the desired result with check constraints, primary and foreign keys, indexed views, and unique indexes. When more logic is required, I usually restrict write access from the application at the table or column level and force modifications to occur through a stored procedure that enforces integrity.

SigmundA7y ago

It is a correctness vs performance trade-off. I typically use foreign keys and the type system as much as possible and back-off when needed for performance or certain cases of flexibility.

Unfortunately unlike compiled typed languages the database foreign keys are just straight up runtime overhead.

Using types correctly such as storing numbers and dates and UUID's in the actual database type rather than strings is an obvious one that typically improves space and performance, there is normally not much trade off here. Varchar with specific length I go back and forth on, it has saved me off on and catching something not validated for length properly in the app, but does get in the way when needed to expand length which happens quite often.

For coded values I like a code table per code type with foreign keys rather than say EAV, this has saved me many times in data quality and allows extra attributes to be added to specific types. It matches more closely to the type system in the client too which will have a specific type for a specific coded value. It also allows reflection like qualities where you can query the schema and follow the types. The downside is in a large app you can have many hundreds or more of code tables and its more work to cross cut features, but the benefits outweigh the downsides for me. this also goes for more full blown "entities" which are more obvious.

I don't normally use more complicated constraint or triggers, the tend to get in the way things like ETL loads and many complex rules end up working out much better in the app code. I am torn about it, but the procedural languages in the DB is usually less expressive with less reuse and can't do the other things needed such as alerting messaging etc.

Bottom line though don't be dogmatic, use the database as you can, test performance make the trade-offs. Sometimes constraints and triggers are better due to their locality to the data, sometimes leaving off foreign keys is better due to the overhead. Sometimes EAV is better for flexibility (or JSON).

In an ideal world we would run the same code in any tier(client, app server, database) for this stuff choosing the best location to run it based on the needs (or run it in multiple tiers, client for user experience, database for transactional correctness). You can almost get there with javascript an PL/V8 now days, and sort of with .Net and Sql Server, but its really not there yet.

combatentropy7y ago

I'm moving more and more stuff into the database. Let me clarify my job:

- I use Postgres. This is important. Postgres has long been one of the most stable databases. It has multiversion concurrency control and even lets you make changes to tables within transactions, which makes such changes easier. It has many features, like check constraints, triggers, procedural languages, and JSON, which makes it easy to put move more stuff to your database.

- I have shell access to Linux virtual machines. I don't have layers of frameworks. I use the psql command-line tool and can log in as the superuser, postgres, when I need to. So my use of the database is not inhibited in any way.

- I write internal business apps. They can have complex business rules, but it is not web scale --- though I would be comfortable serving hundreds of requests per second for tens of thousands of users with my set-up, which is just a single virtual machine. I'm being conservative, and it could probably handle 100 times that.

I began 14 years ago as most web programmers do, keeping their database as a dumb data store and writing everything in their middle layer. But I have found maintenance easier, and the overall codebase much, much smaller, by keeping my database in the know as much as possible.

- Are the values in a column supposed to be from a limited set of choices? Tell it! Put those choices in a table, and link the column to that table with a foreign key.

- Is a number supposed to be within a certain range? Is the date in a column supposed to be after a date in another column? Tell it, such as with Check constraints.

- Is the data supposed to be sorted or summed or transformed before printed on the page? Tell it! Put the gob of SQL into a view, so that all your middle layer does is "select * from view".

osrec7y ago

I generally keep the more complex stuff in the application layer, where it's often easier to capture and change. If all the logic goes into the DB, it can be a real pain to update and maintain (especially if you're running a cluster of sharded databases, say).

Some might suggest that you take a performance hit by not using stored procs etc, but in reality it's never been a problem for my apps (the largest one has over 150k active users - https://usebx.com).

man25257y ago

Trying to think through the alternative presented:

If the application is the only way to access the data, it would appear consistent until it wasn't. Assuming two application processes, process one would not immediately see the mistake it made, but process two would see it on startup. So, an audit table might be a good idea to see which process (or user) made which change, including the original erroneous entry, and any change needed to remedy the failed constraint. If the first process is still active, the second process could warn the first process, but that leaves a lot to chance. In comparison, once in place, relational database constraints are always "on" and enforced globally. Enforcement on a per transaction basis might provide a consistent slowdown, but still be preferable to a long startup time for checking the integrity of a large dataset with potentially many errors. In order to handle increasing startup times given boot time integrity checking, you may have to partition the data into more recently and less recently accessed data. In that case, you may find inconsistencies in reporting where the application only touched an earlier set of data once, and didn't enforce integrity checking. On the other hand, you could be presented with an opportunity to remedy the data before you can receive the report. If the entries that are permitted change over time, you would need some kind of date ranges on those values, so that the person receiving the report doesn't enter historically inaccurate data while fixing their own report.

joshmarlow7y ago

I've got some somewhat delicate relationships in the database (Postgres) for an application that I am working on.

I've opted to build an "API" within the database consisting of views for read-access and stored procedures that can validate relationships upon insert/update. It's pretty clean so far as it allows the data needed for integrity checks to remain in the database.

This hasn't made it to production yet, so take it with a grain of salt :)

bobbydreamer7y ago

It mostly depends on what type of table it is, I am a Db2 DBA z/OS. From experience, PK is must for every table either composite or single column. FK not recommended for 24/7 high volume or transaction table, recommended to handle it via code for ease of maintenance.

Column should be NOT NULL WITH DEFAULT, nulls introduces unnecessary additional where clause checks.

GauntletWizard7y ago

Make your complex database error resistant. Write documents for what it means when references are missing. Do Garbage Collection rather than ON_DELETE triggers. Do everything that SQL doesn't want you to do.

SQL requires defining integrity in advance, and if your database gets corrupted, there is little hope of recover. Cleaning up your data is nearly impossible because your schemas and logic are tightly coupled.

Writing garbage collection scripts is easy if you make a practice of it, and makes you think about your schema in the ways that are important and that the ease of triggers hides from you. You can monitor and metricize it in ways that you can't with triggers.

yashap7y ago

I’ve yet to encounter DB level integrity problems that aren’t solved by a well designed, normalized schema combined with proper use of transactions, PKs, FKs, unique and not null constraints, etc.

These problems probably exist, but they’re very rare. Much more rare than people THINKING they have a problem like this, hacking around it with poorly implemented application code, and ending up with data that has integrity issues, when it could have been solved well using the DB properly.

sokoloff7y ago

We ran a large e-commerce company main site solely on a single (with HA) SQL Server database. (We’ve since added other DBs, and always had other reporting and analysis systems, but we were doing over $1BB/yr on a single transactional server.)

There’s no “room” there for foreign key constraints nor pervasive use of transactions. We used limited transactions in payments area only (and eventually even retired the last of those) and no foreign keys.

Reasonably careful coding, doing a lot of work in stored procedures, and “fix it if it breaks” was the order of the day. I won’t claim it was academically pure, but it sure did work based on the scoreboard (cash register).

yashap7y ago

Curious to know what you mean by “no room.” As in the performance impact of FKs and extensive use of transactions was too high? In my experience (not insane scale, but around 1 million daily active users hitting DBs with TBs of data), the performance overhead of FKs and transactions is fairly minimal. Ad-hoc attempts to bring the same functionality almost always end up in a loss of referential integrity, corrupt/wrong data, lots of bug squashing, etc. Why waste all that dev time and introduce all those bugs when you can use battle tested approaches like FKs and transactions that can guarantee correctness, at only a small performance hit?

0x70dd7y ago

I handle those at both DB and application layer. Integrity is ensured by the database, proper error messages and logging is handled at the application layer.

6nf7y ago

Database triggers is a common tool to use in such cases

Insanity7y ago

Provided your database system has triggers. Cloud Spanner, for example, does not.

mindcrime7y ago

If I'm in an environment where I'm using an RDBMS, then yeah, I mostly try to rely on the built in support for referential integrity that is supplied by the database. If there's a rule that's too complex to express that way, I would probably just write code (in Java, Groovy, Python, whatever I'm working with) to verify things.

stunt7y ago

It's hard to answer without having enough context. Indexes, Transactions, and Triggers are three common choices, respectively. But don't forget complex data integrity requirements are also often a sign for bad design.

segmondy7y ago

You use functions and stored procedures to provide a controlled API over the table. All actions will go through those and that will have your logic to enforce the constraints.

mistrial97y ago

right - except you can do this outside the database also, in code. But there is something about the rigor/pain-level of writing the in-DB functions that tends to make them change less. It is equally possible with external code, but the mental modal has to create expectations for your environment.

mrkeen7y ago

I try to keep code out of my data where possible. No triggers, no constraints (excepting maybe PRIMARY KEY and NOT NULL). Despite otherwise being a nutter for type safety, I'll even favour TEXT over enums.

The main reason is change. As I evolve the code base, I might want to do something today which is principled, but seemed unprincipled yesterday. Database constraints have tended to hit me the hardest at the worst times too, e.g. being unable to make an emergency fix in prod. Or prod can't come up because the flyway script (which succeeded in test/stage) got snagged on a constraint when it ran on prod data.

le-mark7y ago

No triggers, no constraints (excepting maybe PRIMARY KEY and NOT NULL)

Could you not then simply use a nosql solution?

Dirlewanger7y ago

Or you can have that logic at a back-end application level (assuming all clients are going through this one back-end to interact with the database).

mrkeen7y ago

I use both.

scarface747y ago

I keep as much logic out of the database as possible besides PKs/FKs etc.

I keep all of my logic in the application and only one “service” can write to related tables.

I also think that you should always write software as domain specific “microservices”. Not necessarily out of process, separately released microservices, but the Domain Driven Design concept of “services” that can just as easily be part of a monolithic project with different modules or if necessary shared between teams via versioned modules.

It is so much easier to deploy, rollback, and version code than databases.

rooam-dev7y ago

Integrity checks during startup will slow down in time. Also, does it mean you have to restart the app often? Sounds bad to me.

Complex constraints sounds like an app/domain responsibility, much easier to test, version, troubleshoot, evolve.

When a solution cannot quickly and easily be tested, it impacts overall agility (small iterations become small waterfalls). Having complex constraints in database, could mean slower release cycles, processes (schema changes reviews, etc), more migrations.

JHoye7y ago

It's called database normalization.

dillonmckay7y ago

triggers.

j / k navigate · click thread line to collapse

56 comments

crispyambulance7y ago

Isn't that the primary purpose of transactions?

I mean, every time you write to the database, you open a transaction, do you stuff and close it. It either works or it doesn't. If it doesn't, then it rolls back. Isn't that enough?

I guess I am saying that for complex integrity checks, it should be the responsibility of the application's data layer.

btown7y ago

ken7y ago

In my experience, that works great as long as your schema never changes and your software never has bugs.

jhanschoo7y ago

zepolen7y ago

Imho that's the only way. Anything else can end up in invalid state.

6nf7y ago

tdurden7y ago

> do you like filling it with tricks using PKs/FKs/UNIQUEs

I am not sure that I would call these "tricks"

jhanschoo7y ago

As mentioned by the other comment, it's standard to use database triggers to ensure consistency.

There are two major problems with dealing with them in the application side

1. You can no longer guarantee consistency.

edit: of course, when you need to ensure consistency with some data source outside the database, there's no choice but to do it application-side.

clavalle7y ago

>As mentioned by the other comment, it's standard to use database triggers to ensure consistency.

shudder

chatmasta7y ago

neuland7y ago

It’s generally not good for services to share write privileges to a database for this very reason.

Also you can end up with services communicating between an implicit database api rather than a well established rest or rpc (or whatever) api.

scarface747y ago

But as soon as you add another service with write capability, you need to re-implement the validation logic.

Then Don’t Do That (tm)

It’s hardly ever a good idea to have multiple apps writing to the same set of tables directly.

2 more replies

6nf7y ago

Yea the 'correct' approach is often to to both:

- You check for consistency issues on the client-side UI level, and give the user instant feedback when something is not right.

TotempaaltJ7y ago

scarface747y ago

I’ve never heard someone say as the database grows in complexity - “I’m so glad we decided to use triggers everywhere”.

TotempaaltJ7y ago

I haven't actually used triggers myself, and I do feel like they're too far away from the data to be visible/easy to understand.

vbsteven7y ago

I try to encode all constraints in the database using constraints and triggers. This way your data is always consistent regardless of which app/language is interacting with it.

Sometimes I add the same or similar constraints in the application layer because it leads to better error/validation messages.

lazulicurio7y ago

If you're using SQL Server, you can use indexed (materialized) views to enforce uniqueness across multiple tables.

[1] https://spaghettidba.com/2011/08/03/enforcing-complex-constr...

clavalle7y ago

If data needs to agree PKs and FKs are a way to go. I avoid UNIQUES unless absolutely necessary since they are not structural.

Reference tables for things that don't change or change very seldom are a valuable tool if data across other tables must agree. Check constraints otherwise.

It is easy to go overboard, though, and it is worth thinking about how important data integrity and agreement actually is before writing it in stone, so to speak.

paulddraper7y ago

UNIQUES are definitely "structural".

In fact, your the foreign side of the FK needs to be covered by a unique constraint (including PK, which is simply a UNIQUE with slight differences).

mistrial97y ago

can you say more about "reference tables" ?

clavalle7y ago

Sure.

It is often useful, however, to have some reference tables that are only 'mostly read only'. Changeable but largely static.

As a trivial example, a table like state_abbreviations. It is a finite list of data that is unlikely to change.

floriol7y ago

So basically a normalised database

kogir7y ago

I much prefer when the database enforces integrity by limiting access and preventing corruption from occurring.

SigmundA7y ago

It is a correctness vs performance trade-off. I typically use foreign keys and the type system as much as possible and back-off when needed for performance or certain cases of flexibility.

Unfortunately unlike compiled typed languages the database foreign keys are just straight up runtime overhead.

combatentropy7y ago

I'm moving more and more stuff into the database. Let me clarify my job:

- Are the values in a column supposed to be from a limited set of choices? Tell it! Put those choices in a table, and link the column to that table with a foreign key.

- Is a number supposed to be within a certain range? Is the date in a column supposed to be after a date in another column? Tell it, such as with Check constraints.

- Is the data supposed to be sorted or summed or transformed before printed on the page? Tell it! Put the gob of SQL into a view, so that all your middle layer does is "select * from view".

osrec7y ago

man25257y ago

Trying to think through the alternative presented:

joshmarlow7y ago

I've got some somewhat delicate relationships in the database (Postgres) for an application that I am working on.

This hasn't made it to production yet, so take it with a grain of salt :)

bobbydreamer7y ago

Column should be NOT NULL WITH DEFAULT, nulls introduces unnecessary additional where clause checks.

GauntletWizard7y ago

yashap7y ago

sokoloff7y ago

yashap7y ago

0x70dd7y ago

I handle those at both DB and application layer. Integrity is ensured by the database, proper error messages and logging is handled at the application layer.

6nf7y ago

Database triggers is a common tool to use in such cases

Insanity7y ago

Provided your database system has triggers. Cloud Spanner, for example, does not.

mindcrime7y ago

stunt7y ago

segmondy7y ago

You use functions and stored procedures to provide a controlled API over the table. All actions will go through those and that will have your logic to enforce the constraints.

mistrial97y ago

mrkeen7y ago

le-mark7y ago

No triggers, no constraints (excepting maybe PRIMARY KEY and NOT NULL)

Could you not then simply use a nosql solution?

Dirlewanger7y ago

Or you can have that logic at a back-end application level (assuming all clients are going through this one back-end to interact with the database).

mrkeen7y ago

I use both.

scarface747y ago

I keep as much logic out of the database as possible besides PKs/FKs etc.

I keep all of my logic in the application and only one “service” can write to related tables.

It is so much easier to deploy, rollback, and version code than databases.

rooam-dev7y ago

Integrity checks during startup will slow down in time. Also, does it mean you have to restart the app often? Sounds bad to me.

Complex constraints sounds like an app/domain responsibility, much easier to test, version, troubleshoot, evolve.

JHoye7y ago

It's called database normalization.

dillonmckay7y ago

triggers.

j / k navigate · click thread line to collapse