Things to know about databases (opens in new tab)

(architecturenotes.co)

730 pointsgrech3y ago241 comments

241 comments

This article is informative. I have found that databases in general tend to be less sexy than the front-end apps...especially with the recent cohort of devs. As an old bastard, I would pass on one thing: Realize that any reasonably used database will likely outlast the applications leveraging it. This is especially true the bigger it gets, and the longer it stays in production. That said, if you are influencing the design of a database, imagine years later what someone looking at it might want to know if having to rip all the data out into some other store. Having migrated many legacy systems, I tend to sleep better when I know the data is well-structured and easy to normalize. In those cases, I really don't care so much about the apps. If I can sort out (haha) the data, I worry less about the new apps I need to design. I have been known to bury documentation into for-purpose tables...that way I know that info won't be lost. Export the schema regularly, version it, check it in somewhere. And, if you can, please, limit the use of anything that can hold a NULL. Not every RDBMS handles NULL the same way. Big old databases live a looooong time.

mmcnl3y ago

"Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious." -- Fred Brooks, The Mythical Man Month (1975)

Aeolun3y ago

This man has clearly never seen our database schema.

Show me either flowcharts and/or tables, it doesn’t matter, I’ll continue to be mystified.

4 more replies

SulphurSmell3y ago

This is going on my wall. Thanks so much.

2 more replies

emerongi3y ago

> Realize that any reasonably used database will likely outlast the applications leveraging it.

I love this statement. It's true too, having seen a decades-old database that needed to be converted to Postgres. The old application was going to be thrown away, but the data was still relevant :).

evilduck3y ago

About a decade ago I worked for an insurance company. It was an offshoot that was spun out of of another insurance company from another state, which itself was decades old. As best as I could infer from my vantage point, my expertise at the time, and the spare time I was willing to investigate the matter, the database schema and a good chunk of the core data tables were first created in the late-80s on a mainframe and had outlived 4 or 5 application rewrites and (at least) two SQL variant migrations. I'm hand-waving exact details because nobody from the original company or that time period was still around even prior to the corporate split and so there was nobody who could answer history questions in detail, but that's also a testament to how persistent data can be. There was one developer from the parent company they slapped with golden handcuffs who knew where most of the bodies were hid in that software stack that enabled decent productivity but even she was lacking a solid 15 years of first-hand experience of its inception. To the best of my knowledge that database is still in use today.

Databases in heavy use will not just outlast your application, they have a strong chance of outlasting your career and they very well may outlast you as a person.

1 more reply

Yhippa3y ago

I think this is and will continue to be a common use case. I'm very thankful for these applications that the data was still stuck in a crusty old relational database for me to work on top of as I built a new application.

It's going to be interesting when this same problem occurs years from now when people are trying to reverse schemas from NoSQL databases or if they become difficult to extract.

The only sticking point is when business logic is put into stored procedures. On one hand if you're building an app on top of it, there's a temptation to extract and optimize that logic in your new back-end. On the other hand, it is kind of nice to even have it at all should the legacy app go poof.

1 more reply

irrational3y ago

The NULL issue is so true. We migrated a large database from Oracle to Postgres. It took 2 years. By far and away the biggest issue was rewriting queries to account for the (correct) way Postgres handles NULLs versus how Oracle does it.

Also, in my experience, the database is almost always the main cause of any performance issues. I would much rather hire someone who is very good at making the database perform well than making the front end perform well. If you are seeking to be a full stack developer, devote much more time to the database layer than anything else.

SulphurSmell3y ago

>the database is almost always the main cause of any performance issues

I would be careful with the term "cause". There is a symbiotic relationship between the application and the database. Or, if talking to a DBA...a database and its applications. Most databases can store any sets of arbitrary information...but how they are stored (read: structure) must take into account how the data is to be used. When the database designer can be told up-front (by the app dev team) considerations can be made to optimize performance along whatever vector is most desired (e.g. read speed, write speed, consistency, concurrency, etc). Most database performance issues result when these considerations are left out. Related: Just because a query works (ie. returns the right data) does not mean it's the best query.

1 more reply

Aeolun3y ago

It’s like. If the database doesn’t perform well, nothing else performs well either.

If your database is great, at least you have the option of a fast backend.

nijave3y ago

>Also, in my experience, the database is almost always the main cause of any performance issues

More generically, state stores are almost always bottlenecks (they tend to be harder to scale without some tradeoff)

fipar3y ago

> As an old bastard, I would pass on one thing: Realize that any reasonably used database will likely outlast the applications leveraging it.

I’ve been working with and on databases for a long, long time, and I’ve even written about things I think people should know about if they want to do this, yet I never came up with such great insight. This is so true it should be engraved somewhere. Hats off!

SulphurSmell3y ago

Thanks. Scar tissue sometimes breeds insight. In further conversation on this phenomenon, I would argue that "long lived databases" are not so as result of brilliant design. Rather, it happens because the database itself is neglected and largely misunderstood, and gets less investment. And they live on and on...managers come and go...no investment. And then, years later, some poor bastard is stuck with a hideous mess that can't go anywhere. Don't let this happen to you.

hodgesrm3y ago

The article left out one of the most fundamantal topics of databases--clustering of data in storage is everything. Examples:

1. If you store data in rows it's quite fast to insert/update/delete individual rows. Moreover, it's easy to do it concurrently. However reads can be very slow because you read the entire table if you scan a single column. That's why OLAP databases use column storage.

2. If you sort insert data in the table, reading ranges based on the sort key(s) is very fast. On the other hand inserts may spray data over over the entire table, (eventually) forcing writes to all blocks, which is very slow. That's why many OLTP databases use heap (unsorted) row organization.

In small databases you don't notice the differences, but they become dominant as volume increases. I believe this fact alone explains a lot of the proliferation of DBMS types as enterprise datasest have grown larger.

Edit: minor clarification

beckingz3y ago

I heard about Flywaydb today, which appears to be an open source database versioning tool. Pretty interesting! https://flywaydb.org/

vladsanchez3y ago

Pretty open-source, until you need "premium" features like "rollback" :/ (headwall)

2 more replies

zippergz3y ago

This is one reason that ORMs which wish to own the database schema make me uncomfortable. How much fun is that schema going to be years down the road when that ORM is out of fashion, but you still need an app working with that data? Some are better than others at doing things in a sane way, of course.

Aeolun3y ago

Doesn’t really matter though? Even if the ORM is changed, the actual schema is still in the database.

I’ve migrated ORM several times, and the only thing that changes is the entity definition. The database remains the same.

motogpjimbo3y ago

Or even worse, when the ORM is written in a programming language your organisation no longer uses and is part of codebase that is no longer under development.

YetAnotherNick3y ago

> I have found that databases in general tend to be less sexy than the front-end apps

I don't know if there is a single soul who believes this. If you are designing a database, it is much more cooler than front end apps.

SulphurSmell3y ago

I think they are wonderful (from the Codd and Date days...) but mostly everyone else disagrees.

vbezhenar3y ago

I agree. I have some kind of design hierarchy. Database -> Architecture -> Services for outside consumers -> Backend -> Frontend. Things coming first must be designed more thoroughly as they're likely to live longer. Proper database design is paramount. Spend as much time as necessary. Iterate before going live as long as necessary to ensure that design is sound. Because it's so much harder to change database later. Trivial changes often require huge efforts.

bitexploder3y ago

Rob Pike’s 5 rules:

https://users.ece.utexas.edu/~adnan/pike.html

Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

Jemaclus3y ago

> And, if you can, please, limit the use of anything that can hold a NULL.

I'm curious: what's the alternative to NULL? I'm struggling to think of a database where NULL wouldn't be super useful. It feels like NULL as a concept is almost required, but I think you're suggesting that's a faulty assumption.

Would love to hear more about this.

goto113y ago

The article probably means: Define anything as non-nullable which can be non-nullable. Unfortunately SQL defaults to nullable, so there is a tendency to define too many columns as nullable. Normalization can also reduce the need for nullable columns in base tables (but you will get them back if you perform an outer join, so it is not a panacea).

But if a columns truly has unknown values, NULL's are the best ways to represent it. It is sometimes suggested to use "sentinel values" like empty string or -1 to represent missing values, but IMHO this is much worse than NULL's, since these will be treated as regular values by operators. When you have missing values, you want three-valued logic.

1 more reply

layer83y ago

See here: https://stackoverflow.com/a/4358687

1 more reply

SulphurSmell3y ago

Noticed I said "limit", and not "eliminate". The concept of NULLS in an RDBMS has been discussed and argued for decades. Three valued logic is generally not well understood, and as such, it's usually skipped over. Binary logic is easy...0 or 1. It's there, or not. OFF/ON. 3 valued logic introduces a third state: "unknown". It really means that there is no meaningful answer...not yet, anyway. In simple terms, NULL in RDBMS is dangerously often equated to 0 (zero) in numeric fields. Or "space" in character fields. Neither are true. On occasion, these might behave as such...but you are playing with fire here. What's worse, you can't compare a NULL to a NULL. NULL != NULL. Which is why you often see the "IS NULL" operator used in DML for such things. What it boils down to is that your applications need to pay careful attention when digging around (read: joining) tables with NULLS. Additional code logic is often required to ensure that things work the way you expect them to when NULLS are involved. Formal primary keys cannot NULL (this is enforced by the RDBMS) but it does not stop ad-hoc clever queries from including NULL columns as part of the "where..." clause. So what do? You can tell your DBA to ensure that all columns are NOT NULL. This really tightens things down, and makes some operations a bit more sane. However, if a column value is actually not known (yet!) then one is forced to populate it with data that may not be correct/relevant. These are often called "sentinal" values and can cause a mess of their own. There are use cases where a RDBMS schema with everything as NOT NULL can make sense. In my experience, databases whose data is never (directly) seen/input by actual people can work. When a human sees a field with "placeholder value" instead of just blank space..it is uncomfortable. My advice is to really understand why something might be NULL, and don't blindly add a mess of columns to a table as NULL because it's easy. Remember, that shit will live forever. Google around for "three valued logic" and start down the rabbit hole. Long-term (think: migrating from one RDBMS impl to another) you will absolutely find that NULLs don't behave the same. Various operations may or not be consistent from one to another...and this will break your apps. The key (haha) relationships modeled in your schema...if you strip all the unimportant stuff away...should avoid NULL. The flip side of this is to do a code scan (app side) and search for "is NULL" , "is NOT NULL" in the embedded SQL. Especially when there are a lot of "and ____ IS NOT NULL and ___IS NOT NULL" and so forth. This will indicate those parts of the database that are "hot spots" for NULL issues. I have seen SQL where 80% of the DML is taken up with NULL handling of some kind.

1 more reply

Akronymus3y ago

I personally try to strive towards a database design as if the next person were to know my address and having anger and control issues.

Fixing up a database step by step is a painful process.

yla923y ago

Great post. Also highly recommend Designing Data-Intensive Applications by Martin Kleppmann (https://www.amazon.com/Designing-Data-Intensive-Applications...). The sections on "Storage and Retrieval", "Replication", "Partitioning" and "Transactions" really opened up my eyes!

itsmemattchung3y ago

Second this.

I really like how he (Martin Kelppman) in the book starts with a primitive data structure for constructing a database design, and then evolves the system slowly and describes the various trade offs with building a database from the ground up.

lysecret3y ago

Absolutely loved the book. Can someone recommend similar books?

dangets3y ago

I have not read it personally, but I've seen 'How Query Engines Work' highly recommended several times before. I have a procrasinatory tab open to check it out some day.

https://leanpub.com/how-query-engines-work

avinassh3y ago

Database Internals is also pretty good.

1 more reply

pixelmonkey3y ago

There is a quite-nice interactive browser dataviz here that shows you books similar to the themes, categories, and topics discussed in DDIA:

https://anvaka.github.io/greview/ddia/1/

wombatpm3y ago

Database Design for Mere Mortals by Ray Hernandez

tiffanyh3y ago

#1 thing you should know, RDBMS can solve pretty much every data storage/retrieval problem you have.

If you're choosing something other than an RDBMS - you should rethink why.

Because unless you're at massive scale (which still doesn't justify it), choosing something else is rarely the right decision.

randomdata3y ago

> RDBMS can solve pretty much every data storage/retrieval problem you have.

Except the most important problem: A pleasant API. Which is, no doubt, why 95% of those considering something other than an RDBMS are making such considerations.

RDBMS can have pleasant APIs. It is not a fundamental limitation. We have built layers upon layers upon layers of abstraction over popular RDBMSes to provide nice APIs and they work well enough. But those additional layers come with a lot of added complexity and undesirable dependencies that most would prefer to see live in the DBMS itself instead.

At least among the RDBMSes we've heard of, there does not seem to be much interest in improving the APIs at the service level to make them more compelling to use natively like alternative offerings outside of the relational space have done.

brightball3y ago

I've honestly never understood why people have such a distaste for SQL. SQL and Linux/Unix have been the biggest constants of my entire programming career to this point (20ish years). I always know I can count on them.

9 more replies

dgb233y ago

With SQL you kind of have two options/extremes that are unpleasant in their own way.

You either model things in a very domain specific and classic fashion. Here you get the benefit of being quite declarative and ad-hoc queries are natural. Also your schema is stronger, as in it can catch more misuse by default. But this kind of schema tends to have _logical_ repetition and is so specific that change/evolution is quite painful, because every new addition or use-case needs a migration.

Or you model things very generically, more data driven than schema driven. You lose schema strength and you definitely lose sensible ad-hoc queries. But you gain flexibility and generality and can cover much more ground.

You can kind of get around this dichotomy with views, perhaps triggers and such. In an ideal world you'd want the former to be your views and the latter to be your foundational schema.

But now you get into another problem, which is that homogeneous tables are just _super_ rigid as result sets. There are plenty of very common types you cannot cover. For example tagged unions, or any kind of even shallowly nested result (extremely common use case), or multiple result groups in one query. All of these things usually mean you want multiple queries (read transaction) or you use non-SQL stuff like building JSONs (super awkward).

If you can afford to use something like SQLite, then some of the concerns go away. The DB is right there so it's fine to query it repeatedly in small chunks.

I wonder if we're generally doing it wrong though, especially in web development. Shouldn't the backend code quite literally live on the database? I wish my backend language would be a data base query language first and a general purpose language second, so to speak. Clojure and its datalog flavors come close. But I'm thinking of something even more integrated and purpose built.

2 more replies

jjav3y ago

> Except the most important problem: A pleasant API

A pleasant API is clearly not the most important business problem a database is there to solve.

The data in it is presumably the life and blood of the business, whereas the API is something only developers need to deal with.

But that aside, the interface will be SQL which is quite powerful, long-lived (most important) and, fortunately, very pleasant.

swagasaurus-rex3y ago

EdgeDB looks promising. Postgres under the hood so you know it's stable.

paulryanrogers3y ago

What is a pleasant API? For what kind of data?

1 more reply

slaymaker19073y ago

Yeah, a lot of RDBMS are adding JSON support, but the support is often a bit clunky to use and may be missing important features. If you're dealing with a bunch of semistructured APIs that return JSON natively, Mongo makes it really easy to just dump all that into a collection and then just add indices as needed.

Akronymus3y ago

>Except the most important problem: A pleasant API.

For that, there are stored procedures.

nlnn3y ago

I've found it's not just scale, but also down to query patterns across the data being stored.

I'm with you on using an RDBMS for almost everything, but worked on quite a few projects where alternatives were needed.

One involved a lot of analytics queries (aggregations, filters, grouping etc.) on ~100-200GB of data. No matter what we tried, we couldn't get enough performance from Postgres (column-based DBs / Parquet alternatives gave us 100x speedups for many queries).

Another was for storing ~100M rows of data in a table with ~70 columns or so of largely text based data. Workload was predominantly random reads of subsets of 1M rows and ~20 columns at a time. Performance was also very poor in Postgres/MySQL. We ended up using a key/value store, heavily compressing everything before storing, and got a 30x speedup compared to using an RDBMS using a far smaller instance host size.

I wouldn't call either of them massive scale, more just data with very specific query needs.

giovannibonetti3y ago

> Another was for storing ~100M rows of data in a table with ~70 columns or so of largely text based data. Workload was predominantly random reads of subsets of 1M rows and ~20 columns at a time.

Kimball's dimensional modelling helps a lot in cases like this, since probably there is a lot of repeated data in these columns.

snarfy3y ago

It's pretty old problem as they are competing ideas. It's OLTP vs OLAP. Postgres is designed for OLTP.

1 more reply

jimnotgym3y ago

The other day I said to a junior dev, when you started planning a locking scheme to handle concurrency in you file based system it is time to swap to a db

corpMaverick3y ago

Similarly. People don't use Object Modeling/Entity relation-ship diagrams anymore.

Every day, I see people struggling with problems that would be easy to understand if you had one. You don't even need to have an RDBMs. They are good just to model how things are related to each other.

w0m3y ago

Is 'Not performance bound, and dot knowing the future shape of your data' a valid reason? Less overhead on initial rollout to just Toss it up there.

> choosing something else is rarely the right decision

I think this is a little bit of a 'We always did it this way' statement.

dspillett3y ago

There are circumstances where you really don't know the shape of the data, especially when prototyping for proof of concept purposes, but usually not understanding the shape of your data is something that you should fix up-front as it indicates you don't actually understand the problem you are trying to solve.

More often than not it is worth sometime thinking and planning to work out at least the core requirements in that area, to save yourself a lot of refactoring (or throwing away and restarting) later, and potentially hitting bugs in production that a relational DB with well-defined constraints could have saved you from while still in dev.

Programming is brilliant. Many weeks of it sometimes save you whole hours of up-front design work.

1 more reply

adrianmonk3y ago

> future shape of your data

Contrary to what people seem to assume, you actually can change the schema of a database and migrate the existing data to the new schema. There's a learning curve, but it's doable.

If you go schema-less, you run into another problem: not knowing the past shape of your data. When you try to load old records (from previous years), you may find that they don't look like the ones you wrote recently. And, if your code was changed, it may fail to handle them.

This makes it hard to safely change code that handles stored data. You can avoid changing that code, you can accept breakage, or you can do a deep-dive research project before making a change.

If you have a schema, you have a contract about what the data looks like, and you can have guarantees that it follows that contract.

1 more reply

emerongi3y ago

With Postgres, you can always just have a JSONB column for data whose shape you're unsure of. Personally, I'd rather start with Postgres and dump data into there and retain the powers of RDBMS for the future, rather than the other way around and end up finding out that I really would like to have features that come out of the box with relational databases.

I think a valid reason for not choosing a relational database is if your business plan requires that you grow to be a $100B+ company with hundreds of millions of users. Otherwise, you will probably be fine with RDBMS, even if it will require some optimizing in the future.

3 more replies

gryn3y ago

as long as you're willing to write a whole lots of data validation, constraint checking scripts by hand in the future, ETL scripts for non-trivial analytical queries (depending on what NoSQL you chose, but if you chose it for perf this one is usual a price you have to pay). and keep a very rigorous track of the conceptual model of your data somewhere else, or simply don't care about its consistency when different parts of your not-schema have contradicting data (at that point why are you even storing it?)

and that you ruled out using a JSON string column(s) as a dump for the uncertain parts, de-normalization and indexing, and the EAV schema as potential solutions to your problems.

the point is noting is free, and you have to be sure it's a price your willing to pay.

are you ready to give up joins ?, have your data be modeled after the exact queries your going to make ?, for you data to be duplicated across many places ? etc ...

xmprt3y ago

I think the tradeoff is similar to using a weakly typed vs strongly typed language. Strong typing is more up front effort but it will save you down the line because it's more predictable. Similarly, an RDBMS will require more up front planning and design and regular maintenance but that extra planning will save you more time down the line.

marcosdumay3y ago

> Is 'Not performance bound, and dot knowing the future shape of your data' a valid reason?

That's a very good reason for going with a RDBMS even if looks like it's not the clearest winner for your use case.

If you invert any of those conditions, it may become interesting to study alternatives.

alecfong3y ago

I find myself forced to model access pattern when choosing non relational dbs. This often results in a much less flexible model if you didn’t put a lot of thought into it. Ymmv

roflyear3y ago

No absolutely not. 1 hr spent making a schema and a few hours of migrations is way less than the headaches you'll have by going nosql first.

VirusNewbie3y ago

It is very frustrating to work with engineers who don't understand the nuances of RDBMS and assume they can solve all the things. The small company I work for has 3B rows. We have a high write volume. Can you use an RDBMs database to solve this? Sure, but it would be a terrible waste of engineering effort.

mulmen3y ago

Is that hard? Sounds like it might be a little expensive in terms of server resources but I don’t see why the engineering would be hard.

I’m curious how you handle this with less engineering effort without using an RDBMS.

HelloNurse3y ago

A "high write volume" requires fast disks, and billions of rows require large disks. Two simple requirements that are the same for any relational or less relational database.

What's interesting is query performance, and a RDBMS supports explicit control over indexing (usually including analyzing execution plans to find out which queries are going to work well). Where do you see "a terrible waste of engineering effort"?

qaq3y ago

? as opposed to buggy reimplementation of subset of RDBMS functionality on the Application side?

throwamon3y ago

Isn't this like saying you can solve every programming problem you have with <insert your favorite Turing-complete language here>? Of course you can, but aren't there any cases where the tradeoffs outweigh the benefits, even if it's about something selfish like ergonomics or, dare I say, fun?

_the_inflator3y ago

I agree with you. However, conversely I don't see anything that proves him wrong. Databases are not like programming languages. There is a reason why we don't use punch cards anymore.

spmurrayzzz3y ago

The premise here (I think, correct me if I'm mistaken) is that there are net-negative tradeoffs to using nosql/non-rdbms.

If that assumption is true, then it follows that the same argument used in the last statement also applies— that if you're not at massive scale, then its likely the aforementioned tradeoff of not using RDBMS is likely de minimis.

(This assumes that the tradeoffs are of the magnitude that they only manifest impact at scale, hard to address that without concrete examples though)

bcrosby953y ago

> (This assumes that the tradeoffs are of the magnitude that they only manifest impact at scale, hard to address that without concrete examples though)

The tradeoff is usually flexibility. You run into flexibility problems anytime requirements change. Scale doesn't factor in.

aoms3y ago

You're speaking my language. After more than 20 years of custom software dev, this statement has so much merit.

FlyingSnake3y ago

This rings true in my experience. SQL knowledge has consistently helped me over my career. A simple exercise in designing the relational data model vastly improves the system architecture.

googletron3y ago

Good point. Its often the problem space and other constraints that usually drive these decisions. Its important that you deal with problems when you have them.

jsiaajdsdaa3y ago

Who wants to shard MySQL once a week at Amazon levels of scale? I prefer a managed service with consistent hashing.

roflyear3y ago

Very frequently polled queues come to mind, but usually I'll use a db first anyway as there are benefits to it.

dobin3y ago

I just use files

eikenberry3y ago

Pretty easy. RDBMs have a shit API (SQL is terrible) and the largest (PostgreSQL) have a shit HA story. IMO you should think why you are using an RDBMs.

I have used both and have never regretted NOT using an RDBMs. Maybe its a taste thing but I'd rather use a simple K/V database than a relational database any day.

jpdb3y ago

Hard disagree. The operational overhead of RDBMS and specifically their inherent reliance on a single primary node makes them, in my opinion, a bad place to start your architecture.

I want to be able to treat the servers in my database tier as cattle instead of pets and RDBMSs don't fit this paradigm well. Either NoSQL or NewSQL databases are, in my opinion, a much better place to start.

I feel like RDBMSs being the "default" option is because most people have worked with them in the past and already understand them. It doesn't mean they are the best tool for the job or even the tool most likely to solve the unknown problems you'll encounter in the future.

hhjinks3y ago

Only once have I worked on a project where a document database did not completely gimp our ability to deliver the data that was required of us, and that was only because that data was regularly cloned to a relational database we could use for asynchronous tasks. As a project grows, I have, without fail, come to find that you need relations to efficiently deliver the data that new requirements demand.

imachine1980_3y ago

You can have multi tb postgree database, that are fast and usable whit limited number of cache layers for speed, but you probably don need it. mediums migrate from single postress in 2020.

1 more reply

anonymousDan3y ago

What is your go to NewSQL database these days (and why) out of interest?

roflyear3y ago

What a joke.

Merad3y ago

> a dirty read occurs when you perform a read, and another transaction updates the same row but doesn't commit the work, you perform another read, and you can access the uncommitted (dirty) value

It's even worse than this with MS SQL Server. When using the READ UNCOMMITTED isolation level it's actually possible to read corrupted data, e.g. you might read a string while it's being updated, so the result row you get contains a mix of the old value and new value of the column. SQL Server essentially does the "we got a badass over here" Neil deGrasse Tyson meme and throws data at you as fast as it can. Unfortunately I've worked on several projects where someone apparently thought that READ UNCOMMITTED was a magic "go fast" button for SQL and used it all throughout the app.

jiggawatts3y ago

I really wish SERIALIZABLE was the default transaction isolation level and anything lower was opt in… with warnings.

hodgesrm3y ago

SERIALIZABLE is ridiculously slow if you have any level of concurrency in your app. READ COMMITTED is a reasonable default in general. The behavior GP is describing sounds like an out and out bug.

Dirty reads incidentally weren't supported for quite some time in the Sybase architecture (which forked to MS SQL Server in 1992). There was a Sybase effort to add dirty read support around 1995 or so. The project name was "Lolita."

AtNightWeCode3y ago

Not sure how to use these recommendations in practice though even if the info is somewhat correct. SQL is a beast of tech and it is used because of battle history and since there is simply no other viable tech replacing it when it comes to transactions and aggregated queries.

Indexes are a nightmare to get right. Often performance optimizations of SQL databases include removing indexes as much as adding indexes.

larrik3y ago

Indexes aren't a "make my DB faster" magic wand. They have benefits and costs.

If you are seeing performance gains from removing indexes, then I'm assuming your workload is very heavy on writes/updates compared to reads.

dspillett3y ago

Too many indexes can cause significant performance problems if RAM is short. If the indexes are actually used (rather than sitting idle on disk because other indexes are better choices for all your applications' typical queries) then they will “compete” for memory potentially causing a cache thrashing situation.

But yes, the issue with too many indexes is more often that they harm write performance.

A related issue is indexes that are too wide, either covering many columns or “including” them. As well as eating disk space they also eat extra memory (and potentially cause extra IO load) when used (less rows per page, so more pages loaded into RAM for the same query).

Both problems together, too many indexes many of which are too wide, usually comes from blindly accepting recommendations from automated tools (particularly when they are right that there is a problem, and it is a problem that a given index may solve, but fixing the queries so existing indexes are useful could have a much greater effect than adding the indexes).

AtNightWeCode3y ago

Mostly because of overlapping indexes. Then if there are include columns it may get out of hand. Not too difficult to achieve. Just blindly follow recommendations from a tool or a cloud service.

roflyear3y ago

Or you're using MySQL ;)

vorpalhex3y ago

It's not that SQL is all that beastly, it's that most tutorials fail to explain the internals and basics and so you just see all these features and interfaces of the system and can't build a mental model of how the system works.

AtNightWeCode3y ago

Well, SQL does come with liberties. I worked with expensive commercial software that destroys the performance of databases by doing everything from complicated ad hoc queries to massive amounts of point reads.

1 more reply

donatj3y ago

I still think about my first job out of college. Shopping cart application, we would add indexes exclusively when there was a problem rather than proactively based on expected usage patterns. It's genuinely a testament to MySQL that we got as far as we did without knowing anything about what we were doing.

One of my most popular StackOverflow questions to this day is about how to handle one million rows in a single MySQL table (shudder).

The product I work on now collects more rows than that a day in a number of tables.

mjb3y ago

Introductory material is always welcome, but I suspect this isn't going to hit the target for most people. For example:

> Therefore, if the price isn’t an issue, SSDs are a better option — especially since modern SSDs are just about as reliable as HDDs

This needs a tiny extra bit of detail: if you're buying random IO (IOPS) or throughput (MB/s), SSDs are significantly (orders of magnitude!) cheaper than HDDs. HDDs are only cheaper on space, and only if your need for throughput or IO doesn't cause you to "strand" space.

> Consistency can be understood after a successful write, update, or delete of a row. Any read request immediately receives the latest value of the row.

This isn't the ACID definition of C, and is closer to the distributed systems (CAP) one. I can't fault the article for getting this wrong, though - it's super confusing!

googletron3y ago

You are absolutely right about the C being more inline with CAP one.

I have a post in draft to discuss disk trade offs which digs into this aspect, its impossible to dig into everything in this level of a post.

thedougd3y ago

I have to plug the "Designing Data-Intensive Applications" book. It dives deep into the inner workings of various database architectures.

https://dataintensive.net/

wrs3y ago

From the SERIALIZABLE explanation: “The database runs the queries one by one … It is essential to have some retry mechanism since queries can fail.”

I know they’re trying to simplify, but this is confusing. If the first part is true, the second part can’t be. In reality the database does execute the queries concurrently, but will try to make it seem like they were done one by one. If it can’t manage that, a query will fail and have to be retried by the application.

googletron3y ago

I believe there was a caveat around this exact point later in the post. It was really tough striking a balance for people learning this for the first time and more knowledgeable audience without confusing them further.

I do appreciate the feedback and will look to add some more color here! Thank you!

blupbar1233y ago

It's kind of saying something which isn't true. Optimally one would find a wording that doesn't confuse beginners but also is factual, IMHO.

bironran3y ago

Nice post, though for the indexing "introduction-deep-dive" I would still recommend newbies to look at https://use-the-index-luke.com/ .

konfusinomicon3y ago

also check out rick james's mysql documents http://mysql.rjweb.org/

I send those 2 links to coworkers all the time

googletron3y ago

Great resource! I have it linked as a reference!

jwr3y ago

Some of the explanations are questionable: I think they were overly simplified, and while I applaud the goal, some things just aren't that simple.

I highly recommend reading https://jepsen.io/consistency and clicking on each model on the map. This is the best resource I found so far for understanding databases, especially distributed ones.

petergeoghegan3y ago

> Some of the explanations are questionable: I think they were overly simplified, and while I applaud the goal, some things just aren't that simple.

I am an expert on the subject matter, and I don't think that the overall approach is questionable. The approach that the author took seems fine to me.

The definition of certain basic concepts like 'consistency' is even confusing to experts at times. This is made all the more confusing by introducing concepts from the distributed systems world, where consistency is often understood to mean something else.

Here's an example of that that I'm familiar with, where an expert admits to confusion about the basic definition of consistency in the sense that it appears in ACID:

https://queue.acm.org/detail.cfm?id=3469647

This is a person that is a longtime peer of the people that invented the concepts!

Not trying to rigorously define these things makes a great deal of sense in the context of a high level overview. Getting the general idea across is far more important.

googletron3y ago

I would love the feedback, what was questionable? striking the balance is tough. jepsen's content is great.

gumby3y ago

Everyone can disagree on what is the precise place to slice "this is beginner content" from "this is almost-beginner content". I could stick my own oar in in this regard but I won't.

I think your level of abstraction is quite good for the absolute "what on earth are people talking about when they use that 'database' word?". With an extremely high level understanding, when they encounter more detail they'll have a "place to put it".

Diggsey3y ago

One thing that can be surprising is that for "REPEATABLE READ", not all "reads" are actually repeatable.

There are at least two ways (that I'm aware of) that this can be violated. For example, if you run an update statement like this:

    UPDATE foo SET bar = bar + 1

Then the read of "bar" will always use the latest value, which may be different from the value other statements in the same transaction saw.

1 more reply

galaxyLogic3y ago

https://github.com/prql/prql :

" Unlike SQL, it forms a logical pipeline of transformations, and supports abstractions such as variables and functions. It can be used with any database that uses SQL, since it transpiles to SQL. "

jandrewrogers3y ago

> "Scale of data often works against you, and balanced trees are the first tool in your arsenal against it."

An ironic caveat to this is that balanced trees don't scale well, only offering good performance across a relatively narrow range of data size. This is a side-effect of being "balanced", which necessarily limits both compactness and concurrency.

That said, concurrent B+trees are an absolute classic and provide important historical context for the tradeoffs inherent in indexing. Modern hardware has evolved to the point where B+trees will often offer disappointing results, so their use in indexing has dwindled with time.

petergeoghegan3y ago

> Modern hardware has evolved to the point where B+trees will often offer disappointing results, so their use in indexing has dwindled with time.

This is pure nonsense. B+Trees are used extensively and by default by 5 out of 5 of the top database systems, according to db-engines.com.

jandrewrogers3y ago

You don't actually address the point.

If your database engine is an old design or your data is small by modern standards, then a B+tree will be one of the few indexing algorithms available and if the data is small it will probably work. Modern database kernels targeting modern hardware and storage densities typically aren't using B+trees and the reasons why are well-understood. No one with any sense is using a B+tree to index e.g. a trillion records, which is a pretty ordinary thing to do on a single server in 2022.

You can't just swap out indexing architectures due to their dependency on storage engine and scheduling behavior, so older databases like PostgreSQL will be using B+trees for the indefinite future even if suboptimal.

The transition away from B+tree based architectures in new databases engines started about 10-15 years ago. Back then I used them ubiquitously but I honestly don't remember the last time I've seen one in a new design.

2 more replies

ddorian433y ago

You never post anything constructive. Just that everything in open source sucks while you do fancy algorithms in your basement with spatial data at the boring facility. Anytime anybody asks for more you never reply with something constructive or link to a paper/algorithm or something.

The maximum you've told is "yeah do what scylladb does but you will still suck".

It just feels as advertisement and doesn't really add anything to the discussion I believe. All your comments are the same in all database threads.

endrant

hashmash3y ago

What kinds of indexing structures are used instead, and how do they differ from B+trees? Do you have examples of which relational databases have replaced B+tree indexes?

jandrewrogers3y ago

The property being optimized for, relative to B+trees, is extreme compactness of representation. In the pantheon of possible indexing algorithms, B+trees are pretty far on the bloated end of the spectrum in terms of the ratio between data space and index space. All indexes have a scaling performance cliff due to the index structure filling up and eventually overflowing available cache, crowding out the data and forcing page faults for almost every index write. In B+tree indexes this happens relatively early and often.

Radically improving index compactness is achieved by loosening design constraints on B+trees: the indexes represent a partial order which only converges on a total order at the limit and the search structure is unbalanced. In the abstract these appear slightly less efficient but it enables the use of selectivity-maximizing succinct representations of the key space that can get pretty close to the information theoretic limits. Scalability gains result from the radical reduction in cache footprint when represented this way.

Optimal compressive indexes are not computable (being equivalent to AI), so the efficient approximation strategies people come up with tend to be diverse, colorful, and sometimes impractical. Tangentially, some flavors have excellent write performance. It is not a trivial algorithm problem but there are a few design families that generalize well to real databases engines. I wouldn't describe this as a fully solved problem but many ordinary cases are covered.

There isn't much incentive to design a relational database engine that can use these types of indexes, since the types of workloads and data models that recommend them usually aren't relational. Someone could, there just isn't much incentive. It is more de rigueur for graph, spatiotemporal, and some types of analytical databases, where there is no other practical option if scalability matters at all.

xmprt3y ago

I know Clickhouse uses MergeTrees which are different from B+trees. However it can't really be used as an RDBMS. It's especially bad at point reads.

https://en.wikipedia.org/wiki/Log-structured_merge-tree

1 more reply

analyst743y ago

Databases designed to query large amounts of data tend to rely on partition key to allow parallel workloads (i.e. map reduce). And this partition data is stored like hash index.

Big query would be one example.

jrm43y ago

To go big picture; I'm kind of glad databases are largely like cars in this respect, in ways that other software tooling isn't.

Which is to say they're frequently good enough such that the human working with them on whatever level can safely not know a lot of these details and get a LOT done. Kudos to whoever deserves them here.

charcircuit3y ago

Isn't that true for almost all software? You only need to know the implementation of a small subset of parts. I would say databases are worse since you need to know how they are implemented else you will start making O(rows) queries or doing other inefficient stuff.

jrm43y ago

Going broadly (which is all I can do because I teach this stuff and don't build in depth) -- "the database" is the part I can most easily "abstract" away as if it were walled off?

As opposed to aspirationally discrete classifications that end up being porous, e.g. MVC, "Object Oriented" etc.

googletron3y ago

This is a quick rundown of database indexes and transactions. Excited to continue sharing these notes with community!

mgrouchy3y ago

I have been really enjoying the content so far, any hits on whats coming up?

googletron3y ago

We have another couple of notes from a few companies like Temporal, Sentry, and Gadget.

1 more reply

trhoad3y ago

An interesting subject! The article could do with an edit, however. There are lots of grammatical errors.

molly03y ago

Anyone read this pdf/book https://sql-performance-explained.com and would recommend?

r0b053y ago

Nicely written and informative!

googletron3y ago

Thank you!

manish_gill3y ago

What tool was used to create the visuals?

praveenhm3y ago

I am guessing it was done on iPad

sonofacorner3y ago

This is great. Thanks for sharing!

dennalp3y ago

Really nice guide.

otherflavors3y ago

why is this tagged "MySQL" but not also "SQL"

googletron3y ago

Thanks! Added!

throwaway7875443y ago

Can anyone give me a brief understanding of stored procedures and when I should use them?

j / k navigate · click thread line to collapse

241 comments

SulphurSmell3y ago

mmcnl3y ago

Aeolun3y ago

This man has clearly never seen our database schema.

Show me either flowcharts and/or tables, it doesn’t matter, I’ll continue to be mystified.

4 more replies

SulphurSmell3y ago

This is going on my wall. Thanks so much.

2 more replies

emerongi3y ago

> Realize that any reasonably used database will likely outlast the applications leveraging it.

I love this statement. It's true too, having seen a decades-old database that needed to be converted to Postgres. The old application was going to be thrown away, but the data was still relevant :).

evilduck3y ago

Databases in heavy use will not just outlast your application, they have a strong chance of outlasting your career and they very well may outlast you as a person.

1 more reply

Yhippa3y ago

It's going to be interesting when this same problem occurs years from now when people are trying to reverse schemas from NoSQL databases or if they become difficult to extract.

1 more reply

irrational3y ago

SulphurSmell3y ago

>the database is almost always the main cause of any performance issues

1 more reply

Aeolun3y ago

It’s like. If the database doesn’t perform well, nothing else performs well either.

If your database is great, at least you have the option of a fast backend.

nijave3y ago

>Also, in my experience, the database is almost always the main cause of any performance issues

More generically, state stores are almost always bottlenecks (they tend to be harder to scale without some tradeoff)

fipar3y ago

> As an old bastard, I would pass on one thing: Realize that any reasonably used database will likely outlast the applications leveraging it.

SulphurSmell3y ago

hodgesrm3y ago

The article left out one of the most fundamantal topics of databases--clustering of data in storage is everything. Examples:

Edit: minor clarification

beckingz3y ago

I heard about Flywaydb today, which appears to be an open source database versioning tool. Pretty interesting! https://flywaydb.org/

vladsanchez3y ago

Pretty open-source, until you need "premium" features like "rollback" :/ (headwall)

2 more replies

zippergz3y ago

Aeolun3y ago

Doesn’t really matter though? Even if the ORM is changed, the actual schema is still in the database.

I’ve migrated ORM several times, and the only thing that changes is the entity definition. The database remains the same.

motogpjimbo3y ago

Or even worse, when the ORM is written in a programming language your organisation no longer uses and is part of codebase that is no longer under development.

YetAnotherNick3y ago

> I have found that databases in general tend to be less sexy than the front-end apps

I don't know if there is a single soul who believes this. If you are designing a database, it is much more cooler than front end apps.

SulphurSmell3y ago

I think they are wonderful (from the Codd and Date days...) but mostly everyone else disagrees.

vbezhenar3y ago

bitexploder3y ago

Rob Pike’s 5 rules:

https://users.ece.utexas.edu/~adnan/pike.html

Jemaclus3y ago

> And, if you can, please, limit the use of anything that can hold a NULL.

Would love to hear more about this.

goto113y ago

1 more reply

layer83y ago

See here: https://stackoverflow.com/a/4358687

1 more reply

SulphurSmell3y ago

1 more reply

Akronymus3y ago

I personally try to strive towards a database design as if the next person were to know my address and having anger and control issues.

Fixing up a database step by step is a painful process.

yla923y ago

itsmemattchung3y ago

Second this.

lysecret3y ago

Absolutely loved the book. Can someone recommend similar books?

dangets3y ago

I have not read it personally, but I've seen 'How Query Engines Work' highly recommended several times before. I have a procrasinatory tab open to check it out some day.

https://leanpub.com/how-query-engines-work

avinassh3y ago

Database Internals is also pretty good.

1 more reply

pixelmonkey3y ago

There is a quite-nice interactive browser dataviz here that shows you books similar to the themes, categories, and topics discussed in DDIA:

https://anvaka.github.io/greview/ddia/1/

wombatpm3y ago

Database Design for Mere Mortals by Ray Hernandez

tiffanyh3y ago

#1 thing you should know, RDBMS can solve pretty much every data storage/retrieval problem you have.

If you're choosing something other than an RDBMS - you should rethink why.

Because unless you're at massive scale (which still doesn't justify it), choosing something else is rarely the right decision.

randomdata3y ago

> RDBMS can solve pretty much every data storage/retrieval problem you have.

Except the most important problem: A pleasant API. Which is, no doubt, why 95% of those considering something other than an RDBMS are making such considerations.

brightball3y ago

9 more replies

dgb233y ago

With SQL you kind of have two options/extremes that are unpleasant in their own way.

You can kind of get around this dichotomy with views, perhaps triggers and such. In an ideal world you'd want the former to be your views and the latter to be your foundational schema.

If you can afford to use something like SQLite, then some of the concerns go away. The DB is right there so it's fine to query it repeatedly in small chunks.

2 more replies

jjav3y ago

> Except the most important problem: A pleasant API

A pleasant API is clearly not the most important business problem a database is there to solve.

The data in it is presumably the life and blood of the business, whereas the API is something only developers need to deal with.

But that aside, the interface will be SQL which is quite powerful, long-lived (most important) and, fortunately, very pleasant.

swagasaurus-rex3y ago

EdgeDB looks promising. Postgres under the hood so you know it's stable.

paulryanrogers3y ago

What is a pleasant API? For what kind of data?

1 more reply

slaymaker19073y ago

Akronymus3y ago

>Except the most important problem: A pleasant API.

For that, there are stored procedures.

nlnn3y ago

I've found it's not just scale, but also down to query patterns across the data being stored.

I'm with you on using an RDBMS for almost everything, but worked on quite a few projects where alternatives were needed.

I wouldn't call either of them massive scale, more just data with very specific query needs.

giovannibonetti3y ago

> Another was for storing ~100M rows of data in a table with ~70 columns or so of largely text based data. Workload was predominantly random reads of subsets of 1M rows and ~20 columns at a time.

Kimball's dimensional modelling helps a lot in cases like this, since probably there is a lot of repeated data in these columns.

snarfy3y ago

It's pretty old problem as they are competing ideas. It's OLTP vs OLAP. Postgres is designed for OLTP.

1 more reply

jimnotgym3y ago

The other day I said to a junior dev, when you started planning a locking scheme to handle concurrency in you file based system it is time to swap to a db

corpMaverick3y ago

Similarly. People don't use Object Modeling/Entity relation-ship diagrams anymore.

w0m3y ago

Is 'Not performance bound, and dot knowing the future shape of your data' a valid reason? Less overhead on initial rollout to just Toss it up there.

> choosing something else is rarely the right decision

I think this is a little bit of a 'We always did it this way' statement.

dspillett3y ago

Programming is brilliant. Many weeks of it sometimes save you whole hours of up-front design work.

1 more reply

adrianmonk3y ago

> future shape of your data

Contrary to what people seem to assume, you actually can change the schema of a database and migrate the existing data to the new schema. There's a learning curve, but it's doable.

This makes it hard to safely change code that handles stored data. You can avoid changing that code, you can accept breakage, or you can do a deep-dive research project before making a change.

If you have a schema, you have a contract about what the data looks like, and you can have guarantees that it follows that contract.

1 more reply

emerongi3y ago

3 more replies

gryn3y ago

and that you ruled out using a JSON string column(s) as a dump for the uncertain parts, de-normalization and indexing, and the EAV schema as potential solutions to your problems.

the point is noting is free, and you have to be sure it's a price your willing to pay.

are you ready to give up joins ?, have your data be modeled after the exact queries your going to make ?, for you data to be duplicated across many places ? etc ...

xmprt3y ago

marcosdumay3y ago

> Is 'Not performance bound, and dot knowing the future shape of your data' a valid reason?

That's a very good reason for going with a RDBMS even if looks like it's not the clearest winner for your use case.

If you invert any of those conditions, it may become interesting to study alternatives.

alecfong3y ago

I find myself forced to model access pattern when choosing non relational dbs. This often results in a much less flexible model if you didn’t put a lot of thought into it. Ymmv

roflyear3y ago

No absolutely not. 1 hr spent making a schema and a few hours of migrations is way less than the headaches you'll have by going nosql first.

VirusNewbie3y ago

mulmen3y ago

Is that hard? Sounds like it might be a little expensive in terms of server resources but I don’t see why the engineering would be hard.

I’m curious how you handle this with less engineering effort without using an RDBMS.

HelloNurse3y ago

A "high write volume" requires fast disks, and billions of rows require large disks. Two simple requirements that are the same for any relational or less relational database.

qaq3y ago

? as opposed to buggy reimplementation of subset of RDBMS functionality on the Application side?

throwamon3y ago

_the_inflator3y ago

I agree with you. However, conversely I don't see anything that proves him wrong. Databases are not like programming languages. There is a reason why we don't use punch cards anymore.

spmurrayzzz3y ago

The premise here (I think, correct me if I'm mistaken) is that there are net-negative tradeoffs to using nosql/non-rdbms.

(This assumes that the tradeoffs are of the magnitude that they only manifest impact at scale, hard to address that without concrete examples though)

bcrosby953y ago

> (This assumes that the tradeoffs are of the magnitude that they only manifest impact at scale, hard to address that without concrete examples though)

The tradeoff is usually flexibility. You run into flexibility problems anytime requirements change. Scale doesn't factor in.

aoms3y ago

You're speaking my language. After more than 20 years of custom software dev, this statement has so much merit.

FlyingSnake3y ago

This rings true in my experience. SQL knowledge has consistently helped me over my career. A simple exercise in designing the relational data model vastly improves the system architecture.

googletron3y ago

Good point. Its often the problem space and other constraints that usually drive these decisions. Its important that you deal with problems when you have them.

jsiaajdsdaa3y ago

Who wants to shard MySQL once a week at Amazon levels of scale? I prefer a managed service with consistent hashing.

roflyear3y ago

Very frequently polled queues come to mind, but usually I'll use a db first anyway as there are benefits to it.

dobin3y ago

I just use files

eikenberry3y ago

Pretty easy. RDBMs have a shit API (SQL is terrible) and the largest (PostgreSQL) have a shit HA story. IMO you should think why you are using an RDBMs.

I have used both and have never regretted NOT using an RDBMs. Maybe its a taste thing but I'd rather use a simple K/V database than a relational database any day.

jpdb3y ago

Hard disagree. The operational overhead of RDBMS and specifically their inherent reliance on a single primary node makes them, in my opinion, a bad place to start your architecture.

hhjinks3y ago

imachine1980_3y ago

You can have multi tb postgree database, that are fast and usable whit limited number of cache layers for speed, but you probably don need it. mediums migrate from single postress in 2020.

1 more reply

anonymousDan3y ago

What is your go to NewSQL database these days (and why) out of interest?

roflyear3y ago

What a joke.

Merad3y ago

> a dirty read occurs when you perform a read, and another transaction updates the same row but doesn't commit the work, you perform another read, and you can access the uncommitted (dirty) value

jiggawatts3y ago

I really wish SERIALIZABLE was the default transaction isolation level and anything lower was opt in… with warnings.

hodgesrm3y ago

SERIALIZABLE is ridiculously slow if you have any level of concurrency in your app. READ COMMITTED is a reasonable default in general. The behavior GP is describing sounds like an out and out bug.

AtNightWeCode3y ago

Indexes are a nightmare to get right. Often performance optimizations of SQL databases include removing indexes as much as adding indexes.

larrik3y ago

Indexes aren't a "make my DB faster" magic wand. They have benefits and costs.

If you are seeing performance gains from removing indexes, then I'm assuming your workload is very heavy on writes/updates compared to reads.

dspillett3y ago

But yes, the issue with too many indexes is more often that they harm write performance.

AtNightWeCode3y ago

Mostly because of overlapping indexes. Then if there are include columns it may get out of hand. Not too difficult to achieve. Just blindly follow recommendations from a tool or a cloud service.

roflyear3y ago

Or you're using MySQL ;)

vorpalhex3y ago

AtNightWeCode3y ago

1 more reply

donatj3y ago

One of my most popular StackOverflow questions to this day is about how to handle one million rows in a single MySQL table (shudder).

The product I work on now collects more rows than that a day in a number of tables.

mjb3y ago

Introductory material is always welcome, but I suspect this isn't going to hit the target for most people. For example:

> Therefore, if the price isn’t an issue, SSDs are a better option — especially since modern SSDs are just about as reliable as HDDs

> Consistency can be understood after a successful write, update, or delete of a row. Any read request immediately receives the latest value of the row.

This isn't the ACID definition of C, and is closer to the distributed systems (CAP) one. I can't fault the article for getting this wrong, though - it's super confusing!

googletron3y ago

You are absolutely right about the C being more inline with CAP one.

I have a post in draft to discuss disk trade offs which digs into this aspect, its impossible to dig into everything in this level of a post.

thedougd3y ago

I have to plug the "Designing Data-Intensive Applications" book. It dives deep into the inner workings of various database architectures.

https://dataintensive.net/

wrs3y ago

From the SERIALIZABLE explanation: “The database runs the queries one by one … It is essential to have some retry mechanism since queries can fail.”

googletron3y ago

I do appreciate the feedback and will look to add some more color here! Thank you!

blupbar1233y ago

It's kind of saying something which isn't true. Optimally one would find a wording that doesn't confuse beginners but also is factual, IMHO.

bironran3y ago

Nice post, though for the indexing "introduction-deep-dive" I would still recommend newbies to look at https://use-the-index-luke.com/ .

konfusinomicon3y ago

also check out rick james's mysql documents http://mysql.rjweb.org/

I send those 2 links to coworkers all the time

googletron3y ago

Great resource! I have it linked as a reference!

jwr3y ago

Some of the explanations are questionable: I think they were overly simplified, and while I applaud the goal, some things just aren't that simple.

I highly recommend reading https://jepsen.io/consistency and clicking on each model on the map. This is the best resource I found so far for understanding databases, especially distributed ones.

petergeoghegan3y ago

> Some of the explanations are questionable: I think they were overly simplified, and while I applaud the goal, some things just aren't that simple.

I am an expert on the subject matter, and I don't think that the overall approach is questionable. The approach that the author took seems fine to me.

Here's an example of that that I'm familiar with, where an expert admits to confusion about the basic definition of consistency in the sense that it appears in ACID:

https://queue.acm.org/detail.cfm?id=3469647

This is a person that is a longtime peer of the people that invented the concepts!

Not trying to rigorously define these things makes a great deal of sense in the context of a high level overview. Getting the general idea across is far more important.

googletron3y ago

I would love the feedback, what was questionable? striking the balance is tough. jepsen's content is great.

gumby3y ago

Everyone can disagree on what is the precise place to slice "this is beginner content" from "this is almost-beginner content". I could stick my own oar in in this regard but I won't.

Diggsey3y ago

One thing that can be surprising is that for "REPEATABLE READ", not all "reads" are actually repeatable.

There are at least two ways (that I'm aware of) that this can be violated. For example, if you run an update statement like this:

    UPDATE foo SET bar = bar + 1

Then the read of "bar" will always use the latest value, which may be different from the value other statements in the same transaction saw.

1 more reply

galaxyLogic3y ago

https://github.com/prql/prql :

" Unlike SQL, it forms a logical pipeline of transformations, and supports abstractions such as variables and functions. It can be used with any database that uses SQL, since it transpiles to SQL. "

jandrewrogers3y ago

> "Scale of data often works against you, and balanced trees are the first tool in your arsenal against it."

petergeoghegan3y ago

> Modern hardware has evolved to the point where B+trees will often offer disappointing results, so their use in indexing has dwindled with time.

This is pure nonsense. B+Trees are used extensively and by default by 5 out of 5 of the top database systems, according to db-engines.com.

jandrewrogers3y ago

You don't actually address the point.

2 more replies

ddorian433y ago

The maximum you've told is "yeah do what scylladb does but you will still suck".

It just feels as advertisement and doesn't really add anything to the discussion I believe. All your comments are the same in all database threads.

endrant

hashmash3y ago

What kinds of indexing structures are used instead, and how do they differ from B+trees? Do you have examples of which relational databases have replaced B+tree indexes?

jandrewrogers3y ago

xmprt3y ago

I know Clickhouse uses MergeTrees which are different from B+trees. However it can't really be used as an RDBMS. It's especially bad at point reads.

https://en.wikipedia.org/wiki/Log-structured_merge-tree

1 more reply

analyst743y ago

Databases designed to query large amounts of data tend to rely on partition key to allow parallel workloads (i.e. map reduce). And this partition data is stored like hash index.

Big query would be one example.

jrm43y ago

To go big picture; I'm kind of glad databases are largely like cars in this respect, in ways that other software tooling isn't.

charcircuit3y ago

jrm43y ago

Going broadly (which is all I can do because I teach this stuff and don't build in depth) -- "the database" is the part I can most easily "abstract" away as if it were walled off?

As opposed to aspirationally discrete classifications that end up being porous, e.g. MVC, "Object Oriented" etc.

googletron3y ago

This is a quick rundown of database indexes and transactions. Excited to continue sharing these notes with community!

mgrouchy3y ago

I have been really enjoying the content so far, any hits on whats coming up?

googletron3y ago

We have another couple of notes from a few companies like Temporal, Sentry, and Gadget.

1 more reply

trhoad3y ago

An interesting subject! The article could do with an edit, however. There are lots of grammatical errors.

molly03y ago

Anyone read this pdf/book https://sql-performance-explained.com and would recommend?

r0b053y ago

Nicely written and informative!

googletron3y ago

Thank you!

manish_gill3y ago

What tool was used to create the visuals?

praveenhm3y ago

I am guessing it was done on iPad

sonofacorner3y ago

This is great. Thanks for sharing!

dennalp3y ago

Really nice guide.

otherflavors3y ago

why is this tagged "MySQL" but not also "SQL"

googletron3y ago

Thanks! Added!

throwaway7875443y ago

Can anyone give me a brief understanding of stored procedures and when I should use them?

j / k navigate · click thread line to collapse