Show me either flowcharts and/or tables, it doesn’t matter, I’ll continue to be mystified.
I love this statement. It's true too, having seen a decades-old database that needed to be converted to Postgres. The old application was going to be thrown away, but the data was still relevant :).
Databases in heavy use will not just outlast your application, they have a strong chance of outlasting your career and they very well may outlast you as a person.
It's going to be interesting when this same problem occurs years from now when people are trying to reverse schemas from NoSQL databases or if they become difficult to extract.
The only sticking point is when business logic is put into stored procedures. On one hand if you're building an app on top of it, there's a temptation to extract and optimize that logic in your new back-end. On the other hand, it is kind of nice to even have it at all should the legacy app go poof.
Also, in my experience, the database is almost always the main cause of any performance issues. I would much rather hire someone who is very good at making the database perform well than making the front end perform well. If you are seeking to be a full stack developer, devote much more time to the database layer than anything else.
I would be careful with the term "cause". There is a symbiotic relationship between the application and the database. Or, if talking to a DBA...a database and its applications. Most databases can store any sets of arbitrary information...but how they are stored (read: structure) must take into account how the data is to be used. When the database designer can be told up-front (by the app dev team) considerations can be made to optimize performance along whatever vector is most desired (e.g. read speed, write speed, consistency, concurrency, etc). Most database performance issues result when these considerations are left out. Related: Just because a query works (ie. returns the right data) does not mean it's the best query.
If your database is great, at least you have the option of a fast backend.
More generically, state stores are almost always bottlenecks (they tend to be harder to scale without some tradeoff)
I’ve been working with and on databases for a long, long time, and I’ve even written about things I think people should know about if they want to do this, yet I never came up with such great insight. This is so true it should be engraved somewhere. Hats off!
1. If you store data in rows it's quite fast to insert/update/delete individual rows. Moreover, it's easy to do it concurrently. However reads can be very slow because you read the entire table if you scan a single column. That's why OLAP databases use column storage.
2. If you sort insert data in the table, reading ranges based on the sort key(s) is very fast. On the other hand inserts may spray data over over the entire table, (eventually) forcing writes to all blocks, which is very slow. That's why many OLTP databases use heap (unsorted) row organization.
In small databases you don't notice the differences, but they become dominant as volume increases. I believe this fact alone explains a lot of the proliferation of DBMS types as enterprise datasest have grown larger.
Edit: minor clarification
I’ve migrated ORM several times, and the only thing that changes is the entity definition. The database remains the same.
I don't know if there is a single soul who believes this. If you are designing a database, it is much more cooler than front end apps.
https://users.ece.utexas.edu/~adnan/pike.html
Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
I'm curious: what's the alternative to NULL? I'm struggling to think of a database where NULL wouldn't be super useful. It feels like NULL as a concept is almost required, but I think you're suggesting that's a faulty assumption.
Would love to hear more about this.
But if a columns truly has unknown values, NULL's are the best ways to represent it. It is sometimes suggested to use "sentinel values" like empty string or -1 to represent missing values, but IMHO this is much worse than NULL's, since these will be treated as regular values by operators. When you have missing values, you want three-valued logic.
Fixing up a database step by step is a painful process.
I really like how he (Martin Kelppman) in the book starts with a primitive data structure for constructing a database design, and then evolves the system slowly and describes the various trade offs with building a database from the ground up.
If you're choosing something other than an RDBMS - you should rethink why.
Because unless you're at massive scale (which still doesn't justify it), choosing something else is rarely the right decision.
Except the most important problem: A pleasant API. Which is, no doubt, why 95% of those considering something other than an RDBMS are making such considerations.
RDBMS can have pleasant APIs. It is not a fundamental limitation. We have built layers upon layers upon layers of abstraction over popular RDBMSes to provide nice APIs and they work well enough. But those additional layers come with a lot of added complexity and undesirable dependencies that most would prefer to see live in the DBMS itself instead.
At least among the RDBMSes we've heard of, there does not seem to be much interest in improving the APIs at the service level to make them more compelling to use natively like alternative offerings outside of the relational space have done.
You either model things in a very domain specific and classic fashion. Here you get the benefit of being quite declarative and ad-hoc queries are natural. Also your schema is stronger, as in it can catch more misuse by default. But this kind of schema tends to have _logical_ repetition and is so specific that change/evolution is quite painful, because every new addition or use-case needs a migration.
Or you model things very generically, more data driven than schema driven. You lose schema strength and you definitely lose sensible ad-hoc queries. But you gain flexibility and generality and can cover much more ground.
You can kind of get around this dichotomy with views, perhaps triggers and such. In an ideal world you'd want the former to be your views and the latter to be your foundational schema.
But now you get into another problem, which is that homogeneous tables are just _super_ rigid as result sets. There are plenty of very common types you cannot cover. For example tagged unions, or any kind of even shallowly nested result (extremely common use case), or multiple result groups in one query. All of these things usually mean you want multiple queries (read transaction) or you use non-SQL stuff like building JSONs (super awkward).
If you can afford to use something like SQLite, then some of the concerns go away. The DB is right there so it's fine to query it repeatedly in small chunks.
I wonder if we're generally doing it wrong though, especially in web development. Shouldn't the backend code quite literally live on the database? I wish my backend language would be a data base query language first and a general purpose language second, so to speak. Clojure and its datalog flavors come close. But I'm thinking of something even more integrated and purpose built.
A pleasant API is clearly not the most important business problem a database is there to solve.
The data in it is presumably the life and blood of the business, whereas the API is something only developers need to deal with.
But that aside, the interface will be SQL which is quite powerful, long-lived (most important) and, fortunately, very pleasant.
For that, there are stored procedures.
I'm with you on using an RDBMS for almost everything, but worked on quite a few projects where alternatives were needed.
One involved a lot of analytics queries (aggregations, filters, grouping etc.) on ~100-200GB of data. No matter what we tried, we couldn't get enough performance from Postgres (column-based DBs / Parquet alternatives gave us 100x speedups for many queries).
Another was for storing ~100M rows of data in a table with ~70 columns or so of largely text based data. Workload was predominantly random reads of subsets of 1M rows and ~20 columns at a time. Performance was also very poor in Postgres/MySQL. We ended up using a key/value store, heavily compressing everything before storing, and got a 30x speedup compared to using an RDBMS using a far smaller instance host size.
I wouldn't call either of them massive scale, more just data with very specific query needs.
Kimball's dimensional modelling helps a lot in cases like this, since probably there is a lot of repeated data in these columns.
Every day, I see people struggling with problems that would be easy to understand if you had one. You don't even need to have an RDBMs. They are good just to model how things are related to each other.
> choosing something else is rarely the right decision
I think this is a little bit of a 'We always did it this way' statement.
More often than not it is worth sometime thinking and planning to work out at least the core requirements in that area, to save yourself a lot of refactoring (or throwing away and restarting) later, and potentially hitting bugs in production that a relational DB with well-defined constraints could have saved you from while still in dev.
Programming is brilliant. Many weeks of it sometimes save you whole hours of up-front design work.
Contrary to what people seem to assume, you actually can change the schema of a database and migrate the existing data to the new schema. There's a learning curve, but it's doable.
If you go schema-less, you run into another problem: not knowing the past shape of your data. When you try to load old records (from previous years), you may find that they don't look like the ones you wrote recently. And, if your code was changed, it may fail to handle them.
This makes it hard to safely change code that handles stored data. You can avoid changing that code, you can accept breakage, or you can do a deep-dive research project before making a change.
If you have a schema, you have a contract about what the data looks like, and you can have guarantees that it follows that contract.
I think a valid reason for not choosing a relational database is if your business plan requires that you grow to be a $100B+ company with hundreds of millions of users. Otherwise, you will probably be fine with RDBMS, even if it will require some optimizing in the future.
and that you ruled out using a JSON string column(s) as a dump for the uncertain parts, de-normalization and indexing, and the EAV schema as potential solutions to your problems.
the point is noting is free, and you have to be sure it's a price your willing to pay.
are you ready to give up joins ?, have your data be modeled after the exact queries your going to make ?, for you data to be duplicated across many places ? etc ...
That's a very good reason for going with a RDBMS even if looks like it's not the clearest winner for your use case.
If you invert any of those conditions, it may become interesting to study alternatives.
I’m curious how you handle this with less engineering effort without using an RDBMS.
What's interesting is query performance, and a RDBMS supports explicit control over indexing (usually including analyzing execution plans to find out which queries are going to work well). Where do you see "a terrible waste of engineering effort"?
If that assumption is true, then it follows that the same argument used in the last statement also applies— that if you're not at massive scale, then its likely the aforementioned tradeoff of not using RDBMS is likely de minimis.
(This assumes that the tradeoffs are of the magnitude that they only manifest impact at scale, hard to address that without concrete examples though)
The tradeoff is usually flexibility. You run into flexibility problems anytime requirements change. Scale doesn't factor in.
I have used both and have never regretted NOT using an RDBMs. Maybe its a taste thing but I'd rather use a simple K/V database than a relational database any day.
I want to be able to treat the servers in my database tier as cattle instead of pets and RDBMSs don't fit this paradigm well. Either NoSQL or NewSQL databases are, in my opinion, a much better place to start.
I feel like RDBMSs being the "default" option is because most people have worked with them in the past and already understand them. It doesn't mean they are the best tool for the job or even the tool most likely to solve the unknown problems you'll encounter in the future.
It's even worse than this with MS SQL Server. When using the READ UNCOMMITTED isolation level it's actually possible to read corrupted data, e.g. you might read a string while it's being updated, so the result row you get contains a mix of the old value and new value of the column. SQL Server essentially does the "we got a badass over here" Neil deGrasse Tyson meme and throws data at you as fast as it can. Unfortunately I've worked on several projects where someone apparently thought that READ UNCOMMITTED was a magic "go fast" button for SQL and used it all throughout the app.
Dirty reads incidentally weren't supported for quite some time in the Sybase architecture (which forked to MS SQL Server in 1992). There was a Sybase effort to add dirty read support around 1995 or so. The project name was "Lolita."
Indexes are a nightmare to get right. Often performance optimizations of SQL databases include removing indexes as much as adding indexes.
If you are seeing performance gains from removing indexes, then I'm assuming your workload is very heavy on writes/updates compared to reads.
But yes, the issue with too many indexes is more often that they harm write performance.
A related issue is indexes that are too wide, either covering many columns or “including” them. As well as eating disk space they also eat extra memory (and potentially cause extra IO load) when used (less rows per page, so more pages loaded into RAM for the same query).
Both problems together, too many indexes many of which are too wide, usually comes from blindly accepting recommendations from automated tools (particularly when they are right that there is a problem, and it is a problem that a given index may solve, but fixing the queries so existing indexes are useful could have a much greater effect than adding the indexes).
One of my most popular StackOverflow questions to this day is about how to handle one million rows in a single MySQL table (shudder).
The product I work on now collects more rows than that a day in a number of tables.
> Therefore, if the price isn’t an issue, SSDs are a better option — especially since modern SSDs are just about as reliable as HDDs
This needs a tiny extra bit of detail: if you're buying random IO (IOPS) or throughput (MB/s), SSDs are significantly (orders of magnitude!) cheaper than HDDs. HDDs are only cheaper on space, and only if your need for throughput or IO doesn't cause you to "strand" space.
> Consistency can be understood after a successful write, update, or delete of a row. Any read request immediately receives the latest value of the row.
This isn't the ACID definition of C, and is closer to the distributed systems (CAP) one. I can't fault the article for getting this wrong, though - it's super confusing!
I have a post in draft to discuss disk trade offs which digs into this aspect, its impossible to dig into everything in this level of a post.
I know they’re trying to simplify, but this is confusing. If the first part is true, the second part can’t be. In reality the database does execute the queries concurrently, but will try to make it seem like they were done one by one. If it can’t manage that, a query will fail and have to be retried by the application.
I do appreciate the feedback and will look to add some more color here! Thank you!
I send those 2 links to coworkers all the time
I highly recommend reading https://jepsen.io/consistency and clicking on each model on the map. This is the best resource I found so far for understanding databases, especially distributed ones.
I am an expert on the subject matter, and I don't think that the overall approach is questionable. The approach that the author took seems fine to me.
The definition of certain basic concepts like 'consistency' is even confusing to experts at times. This is made all the more confusing by introducing concepts from the distributed systems world, where consistency is often understood to mean something else.
Here's an example of that that I'm familiar with, where an expert admits to confusion about the basic definition of consistency in the sense that it appears in ACID:
https://queue.acm.org/detail.cfm?id=3469647
This is a person that is a longtime peer of the people that invented the concepts!
Not trying to rigorously define these things makes a great deal of sense in the context of a high level overview. Getting the general idea across is far more important.
I think your level of abstraction is quite good for the absolute "what on earth are people talking about when they use that 'database' word?". With an extremely high level understanding, when they encounter more detail they'll have a "place to put it".
There are at least two ways (that I'm aware of) that this can be violated. For example, if you run an update statement like this:
UPDATE foo SET bar = bar + 1
Then the read of "bar" will always use the latest value, which may be different from the value other statements in the same transaction saw." Unlike SQL, it forms a logical pipeline of transformations, and supports abstractions such as variables and functions. It can be used with any database that uses SQL, since it transpiles to SQL. "
An ironic caveat to this is that balanced trees don't scale well, only offering good performance across a relatively narrow range of data size. This is a side-effect of being "balanced", which necessarily limits both compactness and concurrency.
That said, concurrent B+trees are an absolute classic and provide important historical context for the tradeoffs inherent in indexing. Modern hardware has evolved to the point where B+trees will often offer disappointing results, so their use in indexing has dwindled with time.
This is pure nonsense. B+Trees are used extensively and by default by 5 out of 5 of the top database systems, according to db-engines.com.
If your database engine is an old design or your data is small by modern standards, then a B+tree will be one of the few indexing algorithms available and if the data is small it will probably work. Modern database kernels targeting modern hardware and storage densities typically aren't using B+trees and the reasons why are well-understood. No one with any sense is using a B+tree to index e.g. a trillion records, which is a pretty ordinary thing to do on a single server in 2022.
You can't just swap out indexing architectures due to their dependency on storage engine and scheduling behavior, so older databases like PostgreSQL will be using B+trees for the indefinite future even if suboptimal.
The transition away from B+tree based architectures in new databases engines started about 10-15 years ago. Back then I used them ubiquitously but I honestly don't remember the last time I've seen one in a new design.
The maximum you've told is "yeah do what scylladb does but you will still suck".
It just feels as advertisement and doesn't really add anything to the discussion I believe. All your comments are the same in all database threads.
endrant
Radically improving index compactness is achieved by loosening design constraints on B+trees: the indexes represent a partial order which only converges on a total order at the limit and the search structure is unbalanced. In the abstract these appear slightly less efficient but it enables the use of selectivity-maximizing succinct representations of the key space that can get pretty close to the information theoretic limits. Scalability gains result from the radical reduction in cache footprint when represented this way.
Optimal compressive indexes are not computable (being equivalent to AI), so the efficient approximation strategies people come up with tend to be diverse, colorful, and sometimes impractical. Tangentially, some flavors have excellent write performance. It is not a trivial algorithm problem but there are a few design families that generalize well to real databases engines. I wouldn't describe this as a fully solved problem but many ordinary cases are covered.
There isn't much incentive to design a relational database engine that can use these types of indexes, since the types of workloads and data models that recommend them usually aren't relational. Someone could, there just isn't much incentive. It is more de rigueur for graph, spatiotemporal, and some types of analytical databases, where there is no other practical option if scalability matters at all.
Big query would be one example.
Which is to say they're frequently good enough such that the human working with them on whatever level can safely not know a lot of these details and get a LOT done. Kudos to whoever deserves them here.
As opposed to aspirationally discrete classifications that end up being porous, e.g. MVC, "Object Oriented" etc.