How SQL Database Engines Work, by the Creator of SQLite (2008) [video] (opens in new tab)

(youtube.com)

813 pointszbaylin7y ago129 comments

129 comments

ryanworl7y ago

There is a more recent lecture on the same topic from 2015 at CMU: https://youtu.be/gpxnbly9bz4

_wmd7y ago

this video is a 100ft overview, original link is about SQL internals

netgusto7y ago

Detailed explanations start at 26m40s:

https://youtu.be/gpxnbly9bz4?t=26m40s

okket7y ago

Much better quality, thank you.

prudhvis7y ago

At 34:30 mark he goes on to say that there are some buggy implementations for mmap. Is that in anyway related to how Linux handles pages marked as free?

codetrotter7y ago

Mods please update link to this and change title to reflect the year of the replacement video

dang7y ago

Does it cover the same material though? I don't want to deprive people of the original survey.

1 more reply

provlem7y ago

This requires more upvote to remain on top.

dmoreno7y ago

I recently created a database engine (exosql [1]), only query and no storage. It uses postgres-like foreign data wrappers to get all data.

It's not valid for big datasets, as it stores all in memory (or maybe it is?), but as a learning experience has been amazing to think and develop a real database: planner, executor, choose algorithms, implement features as lateral joins and so on.

I will definetly listen very carefully to these talks.

[1] https://gitHub.com/Serverboards/exosql

suj1th7y ago

A similar learning experience for me was when I was exploring Apache Calcite. That again is only query, and no storage. It has a concept of 'adapters' which, I assume, is similar to the postgres-like foreign data wrappers you mention.

zerr7y ago

Any books/lectures/articles you followed?

dmoreno7y ago

I followed specially the postgres and sqlite documentations. For some specific areas I checked their source codes. But mainly I used explain from postgres as reference on what algorithms (seq scan, hash scan and do on) use for specific queries.

pipu7y ago

I truly recommend CMU's Andy Pavlov's video lectures on the topic (and also more advanced stuff)

https://www.youtube.com/playlist?list=PLSE8ODhjZXjYutVzTeAds...

PretzelFisch7y ago

I found Prof. Dr. Jens Dittrich database playlists interesting and pleasant to watch. https://www.youtube.com/channel/UCC9zrtAkl6yY4dpcnWrCHjA

manigandham7y ago

Yes, they put several courses over the years and they're all great: https://www.youtube.com/channel/UCHnBsf2rH-K7pn09rb3qvkA/pla...

swinghu7y ago

very good

logicallee7y ago

Any video filter experts here?

Request to any video filter expert

------------------------------------

I started watching this. The slides are unreadable but the camera is perfectly still and the slides are for several "key frames" where the compression algorithm decides to replace one set of compression artifacts for another.

For example try to read the first keyword under "Translates into:":

https://www.youtube.com/watch?v=Z_cX3bzkExE&t=2m14s

The keyword is unreadable at the start but as you keep looking at it over 50 keyframes it becomes readable to me.

Since the camera is in a fixed position it should be possible to combine the data from those artifacts into a single superresolution video with very small assumptions. (i.e. the assumption that the image is the same image until at least 5% change or something). There's not even anyone moving in front of it.

-> Can someone who actually knows this stuff apply a superresolution interlacing filter to this video and post the superresolution version somewhere?

I hope this is not too much work, and I am sure we would all appreciate the results since the slides are not human-readable before applying some kind of superresolution!

reilly30007y ago

There was a similar issue with the detection of text at angles faced when trying to decipher if the first lady was wearing jacket with a dog whistle insensitive message on it. The original source image was at an angle, so determining its authenticity was challenging until other images emgerged.

peterwwillis7y ago

I'm sorry you got downvoted for this comment. HN voting is the worst.

logicallee7y ago

Perhaps I was naive about the state of the art. After the now-dead reply I received, I searched, and found a couple of papers like this

https://arxiv.org/abs/1801.04590 - "Frame-Recurrent Video Super-Resolution"

but if you look at p. 8, I think many of the algorithms still wouldn't end up with readable text. This paper is from this year, so it is an area of active research.

I wrote a quick mail to the authors to see if they would put the video through their setup (since the last paper update was just 3 months ago) and share their results.

2. Trying it myself...

After my downvotes I tried this small piece of software:

http://www.infognition.com/VideoEnhancer/

Which shows a before/after. Here is their page on their super-resolution algorithm:

http://www.infognition.com/articles/what_is_super_resolution...

I used their plugin on virtualdub on a sample of the video. The results weren't useable. Here is a picture which shows the before and after:

https://imgur.com/a/0rhy7q7

(The diagonal lines are a watermark because I didn't pay to register video enhancer.) Also note that though it might look like a sharpen mask was applied, in fact it was not: this is just the superresolution that video enhancer came up with.

Now granted I don't think that this particular site uses state of the art algorithms (its references on the page I linked are decades old) but it's the first one I found.

The site also has a page explaining when it doesn't work:

http://www.infognition.com/articles/when_super_resolution_do...

It specifically calls out "If your video is compressed to a low bitrate, in many cases this is very bad for super-resolution."

This certainly seems to be the case here. On my comparison picture above you can see that it certainly is an improvement, it is just not enough. I still can't read most of the lines. I think this also doesn't use as many keyframes as it could. (Which makes sense - it is rare that a rare static image is up for, in this case, 25 full seconds!)

There are at least 14 full keyframes there so I think there is more detail to be extracted, but it would, obviously, take longer analysis. I'll let you know if I find anything better or get an answer from the paper authors.

1 more reply

dicroce7y ago

so I was trying to figure out why a query was slow the other day... it was a nasty query with like 14 joins... I used explain and saw that it was a mess... now in my case I was able to switch to outer joints and nest related joins and got it fast.. but I had some interesting thoughts.

In SQL, indexes are implicit.. they are used if available but it's easy get a large query to scan sometimes when it shouldnt... what if there was a different query language with explicit index syntax.. I think you'd get a lot more predictable performance.

DenisM7y ago

Two reasons to not have indexes in the query:

1. Query expresses the result it produces, not the method that was used to obtain it. Semantic vs implementation. It may be a pain to write, but it will be easier to read later.

2. DBA could add/drop indexes on the fly to tune performance of a live system without making any changes to the application code. And being 100% certain he is not changing the semantics of what's going on.

As others noted, if you must you can use query hints for force particular index to be used for a particular operation. MSSQL also allows to pin down a query plan you like for a given query so that it doesn't drift away later due to environment changes.

I agree it is sometimes a pain to force SQL to use the index you wanted it to use.

AmericanChopper7y ago

I’ve never worked anywhere where I had to worry about DBAs running around dropping indexes. The main reasons not to build an index are usually storage and write overhead. Every index a table has means you have to do another write operation on every insert, which can really start to add up. They can also add significant overhead to any migration operation that happens to require an index rebuild.

In my experience, the most common reason for an optimizer choosing not to use an existing index, is out of date statistics. For those who aren’t aware, the database collects table statistics for things like cardinality, number of distinct values, etc... This is the information the optimizer uses when it’s building a plan. If they get out of date the optimizer will start to come up with nonsense plans. Even worse, if your stats get too out of date, you can become scared to update them, because a new set of stats can potentially change the plans built for every single query in ways that are hard to predict.

As others have stated, you can put index hints directly into your queries, but this should be avoided as they’re hard to maintain. Most ‘enterprise’ RBDMS also have some form of plan management, but this should be avoided even more, as managed plans permanently bypass the optimizer, which is even harder to maintain.

cryptonector7y ago

So, the way I think this should work is that there should be a way of addressing the table sources in a query from an application that has them parsed, and then externally (i.e., in the application) provide planning hints to the compiler.

Something like (in some terrible pseudocode):

    q = parse_query("...");
    q.hint(FIRST_TABLE, "a");
    q.hint(INDEX, "b", "b_idx1");
    c = q.compile();
    r = c.run(...);

vidarh7y ago

MySQL has or had a way of forcing index use, and while it was very occasionally a life-safer, it was much less useful than you might think, as often when a DB engine falls back to sequential scan, it's for a reason (e.g. the query planner might have found that the indexes don't cover the columns you need, and the number of indexed lookups into the table that is needed are costly enough that a sequential scan might end up being faster, for example).

It was useful occasionally "back in the day" when MySQL's query planner was really bad, but today it will mostly appear useful if there's something subtly wrong with your config. I don't use MySQL much any more, but on Postgres one typical mistake might be that the costs configured for the query planner doesn't match your hardware (e.g. if your seek cost is configured to be high enough relative to sequential reads, sequential scans starts to look really good even when you need only a small portion of the data; principle will be the same on MySQL but I don't know if you have the same control over the query planner costs or not).

wgjordan7y ago

> what if there was a different query language with explicit index syntax..

There is, it's a feature in MySQL called Index Hints [1].

[1] https://dev.mysql.com/doc/refman/8.0/en/index-hints.html

brianwawok7y ago

Or oracle has it

A DBA can even sit there as queries fly past, and add hints on the fly.

And then you change a query from "select" (lowercase) to "SELECT" (uppercase), and query plans break and you break production.

Fun times

1 more reply

Amezarak7y ago

I would suggest that there's potentially something you need to look at with your database schema - a couple dozen joins shouldn't be causing any problems you have to think about.

Part of this is because of the way a well-normalized database is organized. Most databases have a few large tables and many smaller tables. So in the general case, most of your joins will be against smaller tables. Joins with larger tables are usually very fast, as long as the fields you join on are indexed (and you're not doing a CROSS JOIN or something.) The other thing that helps (which it sounds like you did by "nesting related joins") is to always think about limiting (filtering) the datasets you're joining against at as many stages as possible; that way you're always doing the least amount of work necessary, and it's usually conceptually simpler to read and understand.

As others have said, most databases do have index hinting as part of the query language. However, in my (long) experience, you should almost never use it. Index hints should be a huge code smell.

kimdotcom7y ago

I use UUIDs as primary keys, you insensitive clod!

api7y ago

Sqlite is incredible. Tiny and usually used for small stuff but I have heard of 1TB+ databases with acceptable performance.

stevoski7y ago

If you are a Java programmer and want to learn how an SQL database engine works, take a look at the source code of H2.

Even better, try to add a basic feature to H2 (eg. a new built-in function). It is surprisingly easy, and you come away with a decent understanding of the basics of building an SQL database engine.

A_Person7y ago

Gosh, I must say there seems to be some misunderstanding of RDBMS concepts in some posts in this thread!

I was writing database systems professionally, back in the days before the RDBMS concept was even a thing. So here's my (enormously long and convoluted) 2 cents worth. Please be sure to pack a sandwich and get hydrated before you continue.

Say you were dealing with doctors and patients, and needed to store that information in a database. Back in the day, you'd typically use a so-called hierarchical database. To design one of those, you need to decide, what is the most common access method expected to be: getting the patients for a given doctor, or the doctors for a given patient? You'd design the schema accordingly. Then, the preferred access method was easy to code, and efficient to run. The "other" access method was still possible, but harder to code, and slower to run. The database schema depended on how you thought the users would access the data.

But that is absolutely what NOT what to do with an RDBMS. Certainly you look at the users' screens, reports, and so on - but that's just to determine what unique entities the system must handle - in this case, doctors and patients. Then you ignore how the users will access the data, and work out what are the inherent logical relationships between all the entities.

Your initial answer might be this. A doctor can have many patients, and a patient can have many doctors. As any competent relational designer will instantly know, this means you need a resolving table whose primary key is a composite key comprising the primary keys of the other two tables. So if Mary was a patient of Tom, you'd add Mary to the patients table (if not already there), Tom to the doctors table (ditto), then add a Mary/Tom record to the resolving table. By that means, a doctor could have any number of patients, a patient could have any number of doctors, and it's trivially easy to write simple, performant SQL to access that data however you want.

But then you'd have a ghastly thought: patients can also be doctors, and doctors can also be patients! Say Tom was also a patient of Mary! Now you need a Tom record in the patient's table, but that would inappropriately duplicate all his data from the doctors table! Something's clearly wrong. You'd soon see that from a data modelling viewpoint, you don't want doctors and patients as separate entities - you want a single entity Person, and a resolving table to relate arbitrary people in specified ways.

So what?!!

So this. IMHO, many developers using relational databases have absolutely no idea about any of that. They design hopelessly unnormalised schemas, which then need reams of ghastly SQL to get anything out. The query planner can barely parse all that crap, let along optimise it. The database has to stop every five minutes to wet its brow and take a drink.

So here's my advice to any inexperienced relational database designers who have actually managed to get this far!! If you can answer the following questions off the top of your head, you're probably on the right track. If you can't, you're lacking basic knowledge that you need in order to use an RDBMS properly:

- what is a primary key? - what is a foreign key? - what is an important difference between primary keys and foreign keys? - what is a composite key? When would you use one? - what are the three main relations? - what is normalisation? - what is denormalization? - what is a normal form? and so on.

Just my 2c! :-)

vram227y ago

I'm guessing A_Person is making some up of what he/she said (to be entertaining) (I don't mean the DB facts - which are right, of course, and the questions at the end), but it was an amusing post anyway :)

Well done.

A_Person7y ago

Thanks :-)

mmjaa7y ago

I imagine you've written more than your fair share of PROGRESS 4GL code in the past .. your qualifications questions are pretty much straight out of the PROGRESS 4GL user guide .. ;)

A_Person7y ago

Nope! I've never used postgres at all. Most of my RDBMS work was done on HN NSFW ALERT - oracle :-) But your comment supports my general point, which is, that data modelling and relational schema design skills are product agnostic; normalization is normalization, as it were.

1 more reply

kimdotcom7y ago

So, were you writing DB software before Codd's research at IBM was available?

A_Person7y ago

Further to my other reply, I've just checked the intertubes, and found that Codd's paper "A Relational Model of Data for Large Shared Data Banks" was published in 1970, and Dijkstra's "Go To Statement Considered Harmful" in 1968. So I think my fading memories of all this are accurate (for once!).

A_Person7y ago

Yes indeed. I haven't checked his research dates, but I was writing database software in the late (uh) 60s and early 70s. I think that was even before "goto harmful". I still remember our standards officer saying, let's try some of this structured programming stuff!

okket7y ago

Sadly bad audio (room mic with all the ambient noise) and bad video quality (slides are almost unreadable). But great content.

randop7y ago

Thank you. Very educational. Interesting to know that ORDER BY includes significant performance penalty without LIMIT.

serioushaha7y ago

slides : https://www.slideshare.net/VikasBansal23/how-sqlite-works

angelfreak7y ago

Really great, thanks for posting.

bitmapbrother7y ago

This is a much better talk with better video and sound.

https://www.youtube.com/watch?v=Jib2AmRb_rk

cup-of-tea7y ago

What is that acronym he keeps saying? MBCC?

ntonozzi7y ago

MVCC - Multi version concurrency control.

anothergoogler7y ago

Love how he started. People who don't stop their conversations for a presenter are the worst. People who don't stop their conversations for a presentation by Richard Hipp deserve a spell of laryngitis.

1 more reply

blackrock7y ago

I'm not going to call anyone out here, but why do people keep using the word orthogonal?

It doesn't even compute. It doesn't even make any sense, in how they use it in relation to the topic.

Are the issues at right angles of one another? No.

Are the issues statistically independent of one another? Perhaps.

I suggest to use a more appropriate descriptive word to describe the situation.

You folks should read the urban meaning of orthogonal, to understand how people roll their eyes at you, when you inappropriately use the term.

https://www.urbandictionary.com/define.php?term=orthogonal

Just another friendly PSA.

ternaryoperator7y ago

That's how languages evolve. Words that meant one very distinct thing come to mean something only partially like the original. People decry the misuse. And finally the new meaning becomes the one true meaning and the original sense is marked in dictionaries as "archaic."

For a word that's gone through that exact cycle, have a stare at "artificial," which was the adjective for "artifice," which at one time meant craftsmanship. When St. Paul's Cathedral was first shown to King Charles II, he praised it for being "very artificial" -- a compliment. [1]

In the meantime, I agree that it can be frustrating to see words apparently misused. But I think this is hardly the mark of an "idiot," as you put it.

[1] https://quoteinvestigator.com/2012/10/31/st-pauls-cathedral/

sanderjd7y ago

Oh wow I totally disagree. I have always found the computer science usage of "orthogonal" to be very analogous to orthogonal sets in linear algebra. It is only in two dimensions that it boils down to right angles; That's the uninteresting case! The more general concept is one of independence, of something being broken down into its constituent parts, such that each part is pulling its weight in some way that even all the other parts combined could not. Which is exactly how I see it being used here. I still remember the moment my programming languages professor introduced the concept of orthogonality in that field; I intuitively grasped the meaning and was awed by that power of analogy.

wwweston7y ago

What does it mean for something to be at a right angle to something else?

There's a euclidean geometric answer to that statement, but it's hardly the only correct answer.

When people use it to mean that they're speaking of two issues that have a range of independent possibilities, it's not wrong to invoke linearly independent bases.

abiox7y ago

is english a second language for you? 'orthogonal' is frequently used to indicate two things are not directly related or dependent.

> You folks should read the urban meaning of orthogonal

nope, nope, nope. that site's a hive of scum and villainy, and a massive number of entries are just random nonsense.

i'd rather go to wiktionary[0], which includes:

"Of two or more problems or subjects, independent of or irrelevant to each other."

[0] https://en.wiktionary.org/wiki/orthogonal

Izkata7y ago

> You folks should read the urban meaning of orthogonal, to understand how people roll their eyes at you, when you inappropriately use the term.

If that mattered at all, then we'd have stopped using other remapped words first, like "tree".

jokoon7y ago

I don't like to use SQL engine because I don't understand how they work, I never really know if my query will be O(1), O(log(n)), O(n), etc, or what kind of algorithm will optimize my query.

Who really does understand how a SQL engine work? Don't you usually require to understand how something work before starting using it? Which SQL analyst or DB architect really knows about the internals of a SQL engine? Do they know about basic data structures? Advanced data structures? Backtracking?

That's why I tend to avoid systematically using a SQL engine unless the data schema is very very simple, and manage and filter the data case by case in code. SQL is good for archiving and storing data, and work as an intermediary, but I don't think it should drive how a software works. Databases can be very complex, and unfortunately, since developers like to make things complicated, it becomes hairy.

I think SQL was designed when RAM was scarce and expensive, so to speed up data access, it has to be properly indexed with a database engine. I really wonder who, today, have data that cannot fit in RAM, apart from big actors.

I tend to advocate for simple designs and avoid complexity as most as I can, so I might biased, but many languages already offers things like sets, maps, multimaps, etc. Tailoring data structures might yield good results too.

Databases still scare me.

blattimwind7y ago

You're not scared, you're just too lazy to learn the tools of your trade.

Databases are not very complex and use pretty much only textbook data structures and algorithms. Understanding how they process a given query and how a query will probably perform/scale (even without EXPLAIN ANALYZE) is not hard to learn. You do need to learn it (at some point; you don't for small data, which is most). But it's far from difficult.

> That's why I tend to avoid systematically using a SQL engine unless the data schema is very very simple, and manage and filter the data case by case in code.

And that's the mentality that gives us webshops were applying a simple filter results in a couple seconds load time and uses hundreds of MB of RAM per request, server side.

barrkel7y ago

Databases are amongst the most complex systems you will ever use as a developer. At the limit, they rival operating systems for complexity - distributed concurrent systems with lots of low-level memory and filesystem action, along with a parser, optimizing compiler and often a code generator too.

2 more replies

aardvark2917y ago

>Databases are not very complex and use pretty much only textbook data structures and algorithms

skeptical expression

3 more replies

coldtea7y ago

>I don't like to use SQL engine because I don't understand how they work, I never really know if my query will be O(1), O(log(n)), O(n), etc, or what kind of algorithm will optimize my query.

Unless you're generating totally dynamic queries that's a moot point.

You can always try it and measure it -- just like you know, you would profile a program in any programming language. And you can trivially have the database show you the query plan as well.

Do you also not use APIs because you don't know a priori if a call is O(1) or O(N) or O(Nlog(N)) etc?

>I think SQL was designed when RAM was scarce and expensive, so to speed up data access, it has to be properly indexed with a database engine. I really wonder who, today, have data that cannot fit in RAM, apart from big actors.

That's really orthogonal.

Speed and indexes still matter today with big data (or plain "tens of thousands of web users" loads), where we often have to denormalize or use indexed non-sql stores just to get more speed for the huge data we still need to be able to query fast.

Besides, something indexed will be faster whether they are in disk or in RAM compared to something in the same storage that's not indexed.

So unless we're coding something trivial, server side we still want all the speed we can get from our data than plain having them as simple structures RAM provides.

You wouldn't use a linked link as opposed to a hash table just because your data "fit in RAM". Even in RAM ~O(1) vs ~ O(N) matters [1].

SQL was invented and caught on because: companies had tried and were burned by no-sql stores with incompatible storage standards, lax reliability guarantees, no interoperability between client programs, no over-econmpassing logical abstraction (compared to relational algebra) but ad-hoc reinventions of the square wheel, and so on. Ancient issues the wave of Mongo fanboys brought back all over again.

[1] unless the linked list is so tiny as to fit in cache and avoid the hash table inderection, but I digress

ludsan7y ago

Your concern about the opaque and abstract layers below you apply to language compilers as well (which I think is the "better" alternative you seem to prefer).

There is literature legion on the implementation of the database that would assuage you, should you concern yourself with reading it. I don't think you need to. Trust that many many smart people have engineered many many decades of excellent software.

That is, not to say, that you won't need to peek below or concern yourself with certain choices -- indices, commits, columnar-vs-row, etc. as your performance or access patterns dictate.

More importantly, the relational model is still the gem that shines as a beacon to model logic and data and is too often undervalued due to its association with 'enterprise software' and the implementation language (SQL is a bit warty).

glhaynes7y ago

100% agreed - it's exactly the same pattern as other tools. Write it in the most simple/obvious/maintainable way; then, if you have a performance issue (quite rare IME when building something that doesn't obviously need a database FTE from the outset), spend a few minutes semi-educatedly poking at it to see if you can stumble onto a drastic improvement; then, if not, dive deeper.

nostalgeek7y ago

> There is literature legion on the implementation of the database that would assuage you, should you concern yourself with reading it. I don't think you need to. Trust that many many smart people have engineered many many decades of excellent software.

can you recommend some material?

1 more reply

greenyoda7y ago

> I really wonder who, today, have data that cannot fit in RAM, apart from big actors.

Keeping all your data in RAM has significant problems, even if it all fits. For example, would you want to lose all your customers' orders and billing information if your code crashed?

In addition to the relational database model, SQL databases offer ACID transactions, which are useful if you want to have consistent and reliable data:

https://en.wikipedia.org/wiki/ACID

slow_donkey7y ago

To be fair, using redis or elasticsearch as a main datastore is doable. Although I'm not sure they're much better choices in terms of understanding how they work.

You could summon Antirez I guess

1 more reply

PuercoPop7y ago

> For example, would you want to lose all your customers' orders and billing information if your code crashed?

There are things like WAL and snapshots. Having your dataset in RAM and querying directly doesn't exclude persisting it to disk. Read Stonebraker's "The End of an Architectural Era"[0]. Basically the OP is right in that SQL DBs were designed assuming that RAM was scarce and that asumption is no longer valid. They are innefficient for every common use case. By at least an order of magnitude.

[0]: http://cs-www.cs.yale.edu/homes/dna/papers/vldb07hstore.pdf

2 more replies

jnwatson7y ago

For Postgres at least, you can literally ask it how a query works, via EXPLAIN. Now, there’s a skill to understanding the output of that, but at least it isn’t a black box.

wolf550e7y ago

All RDBMS engines have some kind of EXPLAIN, otherwise it's impossible to troubleshoot performance issues. The differences are in the amount of detail you get from the optimizer/query planner, whether you get profile of actual run side-by-side with the plan, etc.

sebojanko7y ago

There's something similar for SQL Server too.

ebikelaw7y ago

And the query plan can literally change out from under you at any time. SQL sucks. You should be able to dictate the query plan to the engine directly. If SQL exists as a tool to create and serialize such plans via exploration and experimentation, that’s fine. As a runtime query system it is completely unsuitable.

8 more replies

dragonwriter7y ago

> Who really does understand how a SQL engine work?

Presumably, at a minimum, all the people who work on such engines, including committees to the various open-source ones.

But also lots of other people.

> Don't you usually require to understand how something work before starting using it?

No. Very few programmers understand how compilers work before they start using them. I'd say it's more common to require working on something to really understand how it works than the reverse.

> I think SQL was designed when RAM was scarce and expensive, so to speed up data access, it has to be properly indexed with a database engine. I really wonder who, today, have data that cannot fit in RAM, apart from big actors.

Indexing is no less important for in-memory data access.

josteink7y ago

This is what you get when developers are afraid to touch anything which doesn't look like Javascript or JSON.

The incompetence and ignorance shown in this post is simply astounding.

dagw7y ago

I really wonder who, today, have data that cannot fit in RAM, apart from big actors.

It's also a question of price. Once you get above about 256 GB of RAM server prices start to go up really really fast. And while there are systems with dozens of TB of RAM they are stupidly expensive.

So even if, in theory, most databases could fit in RAM, most people cannot afford that. And at the end of the day, 100+TB isn't that large a database in the grand scheme of things and you're not easily fitting that into RAM.

mmt7y ago

> Once you get above about 256 GB of RAM

I think it might be as high as 1TB these days, though with what's going on with DDR4 prices, the situation is strange at the moment.

Of course, I don't disput your point that a 100+TB database isn't all that large, especially with indexes.

I suspect that it's this false dichotomy of "fit in RAM" and "big data" has resulted in many needless forays into distributed computing.

1 more reply

collyw7y ago

Understanding the relational model properly will allow you to write performant simple code. Unlike replacing all the well tested code you will have to write for yourself when you get rid of a relational database.

dmoreno7y ago

I completely disagree.

Unless your data requirements are very specific, mem only, or not adapted to the relational paradigm any SQL engine will provide you with the best and more efficient algorithms to manipulate your data in the most common situations.

I think that if any, databases should be used more.

PuercoPop7y ago

> Unless your data requirements are very specific, mem only, or not adapted to the relational paradigm any SQL engine will provide you with the best and more efficient algorithms to manipulate your data in the most common situations.

Too bad that Michael Stonebraker, Turing Award winner, disagrees with you. SQL are not the best solution for any common use case from the performance perspective.

Nevermind what they do to the design of an application. IMHO less people should default to using a database upfront. At least while protyping the idea.

https://cs.brown.edu/~ugur/fits_all.pdf

2 more replies

1stranger7y ago

How do you prefer to persist your data?

tobyhinloopen7y ago

In memory database, dumped to file? Not saying how I would do it, just thinking alternatives.

I’m using postgresql usually

2 more replies

zzzcpan7y ago

> I never really know if my query will be O(1), O(log(n)), O(n), etc, or what kind of algorithm will optimize my query

Aka leaky abstractions. SQL just wasn't designed for performance. A query language that takes performance into account should definitely ignore any ideas from SQL. Maybe have declared data structures instead of tables with operations that use known algorithms, explicitly chosen.

gaius7y ago

SQL just wasn't designed for performance.

That is strictly true in the literal sense that SQL is just a textual representation of relational algebra and calculus, and noone says a mathematical notation is "designed for performance" or otherwise.

But in a more practical, useful sense, it's the language most designed for performance, since the query planner has so much leeway to perform optimisation. It can do more dramatic transformations of the parse tree even than a C compiler.

1 more reply

j / k navigate · click thread line to collapse

129 comments

ryanworl7y ago

There is a more recent lecture on the same topic from 2015 at CMU: https://youtu.be/gpxnbly9bz4

_wmd7y ago

this video is a 100ft overview, original link is about SQL internals

netgusto7y ago

Detailed explanations start at 26m40s:

https://youtu.be/gpxnbly9bz4?t=26m40s

okket7y ago

Much better quality, thank you.

prudhvis7y ago

At 34:30 mark he goes on to say that there are some buggy implementations for mmap. Is that in anyway related to how Linux handles pages marked as free?

codetrotter7y ago

Mods please update link to this and change title to reflect the year of the replacement video

dang7y ago

Does it cover the same material though? I don't want to deprive people of the original survey.

1 more reply

provlem7y ago

This requires more upvote to remain on top.

dmoreno7y ago

I recently created a database engine (exosql [1]), only query and no storage. It uses postgres-like foreign data wrappers to get all data.

I will definetly listen very carefully to these talks.

[1] https://gitHub.com/Serverboards/exosql

suj1th7y ago

zerr7y ago

Any books/lectures/articles you followed?

dmoreno7y ago

pipu7y ago

I truly recommend CMU's Andy Pavlov's video lectures on the topic (and also more advanced stuff)

https://www.youtube.com/playlist?list=PLSE8ODhjZXjYutVzTeAds...

PretzelFisch7y ago

I found Prof. Dr. Jens Dittrich database playlists interesting and pleasant to watch. https://www.youtube.com/channel/UCC9zrtAkl6yY4dpcnWrCHjA

manigandham7y ago

Yes, they put several courses over the years and they're all great: https://www.youtube.com/channel/UCHnBsf2rH-K7pn09rb3qvkA/pla...

swinghu7y ago

very good

logicallee7y ago

Any video filter experts here?

Request to any video filter expert

------------------------------------

For example try to read the first keyword under "Translates into:":

https://www.youtube.com/watch?v=Z_cX3bzkExE&t=2m14s

The keyword is unreadable at the start but as you keep looking at it over 50 keyframes it becomes readable to me.

-> Can someone who actually knows this stuff apply a superresolution interlacing filter to this video and post the superresolution version somewhere?

I hope this is not too much work, and I am sure we would all appreciate the results since the slides are not human-readable before applying some kind of superresolution!

reilly30007y ago

peterwwillis7y ago

I'm sorry you got downvoted for this comment. HN voting is the worst.

logicallee7y ago

Perhaps I was naive about the state of the art. After the now-dead reply I received, I searched, and found a couple of papers like this

https://arxiv.org/abs/1801.04590 - "Frame-Recurrent Video Super-Resolution"

but if you look at p. 8, I think many of the algorithms still wouldn't end up with readable text. This paper is from this year, so it is an area of active research.

I wrote a quick mail to the authors to see if they would put the video through their setup (since the last paper update was just 3 months ago) and share their results.

2. Trying it myself...

After my downvotes I tried this small piece of software:

http://www.infognition.com/VideoEnhancer/

Which shows a before/after. Here is their page on their super-resolution algorithm:

http://www.infognition.com/articles/what_is_super_resolution...

I used their plugin on virtualdub on a sample of the video. The results weren't useable. Here is a picture which shows the before and after:

https://imgur.com/a/0rhy7q7

Now granted I don't think that this particular site uses state of the art algorithms (its references on the page I linked are decades old) but it's the first one I found.

The site also has a page explaining when it doesn't work:

http://www.infognition.com/articles/when_super_resolution_do...

It specifically calls out "If your video is compressed to a low bitrate, in many cases this is very bad for super-resolution."

1 more reply

dicroce7y ago

DenisM7y ago

Two reasons to not have indexes in the query:

1. Query expresses the result it produces, not the method that was used to obtain it. Semantic vs implementation. It may be a pain to write, but it will be easier to read later.

I agree it is sometimes a pain to force SQL to use the index you wanted it to use.

AmericanChopper7y ago

cryptonector7y ago

Something like (in some terrible pseudocode):

    q = parse_query("...");
    q.hint(FIRST_TABLE, "a");
    q.hint(INDEX, "b", "b_idx1");
    c = q.compile();
    r = c.run(...);

vidarh7y ago

wgjordan7y ago

> what if there was a different query language with explicit index syntax..

There is, it's a feature in MySQL called Index Hints [1].

[1] https://dev.mysql.com/doc/refman/8.0/en/index-hints.html

brianwawok7y ago

Or oracle has it

A DBA can even sit there as queries fly past, and add hints on the fly.

And then you change a query from "select" (lowercase) to "SELECT" (uppercase), and query plans break and you break production.

Fun times

1 more reply

Amezarak7y ago

I would suggest that there's potentially something you need to look at with your database schema - a couple dozen joins shouldn't be causing any problems you have to think about.

As others have said, most databases do have index hinting as part of the query language. However, in my (long) experience, you should almost never use it. Index hints should be a huge code smell.

kimdotcom7y ago

I use UUIDs as primary keys, you insensitive clod!

api7y ago

Sqlite is incredible. Tiny and usually used for small stuff but I have heard of 1TB+ databases with acceptable performance.

stevoski7y ago

If you are a Java programmer and want to learn how an SQL database engine works, take a look at the source code of H2.

Even better, try to add a basic feature to H2 (eg. a new built-in function). It is surprisingly easy, and you come away with a decent understanding of the basics of building an SQL database engine.

A_Person7y ago

Gosh, I must say there seems to be some misunderstanding of RDBMS concepts in some posts in this thread!

So what?!!

Just my 2c! :-)

vram227y ago

Well done.

A_Person7y ago

Thanks :-)

mmjaa7y ago

I imagine you've written more than your fair share of PROGRESS 4GL code in the past .. your qualifications questions are pretty much straight out of the PROGRESS 4GL user guide .. ;)

A_Person7y ago

1 more reply

kimdotcom7y ago

So, were you writing DB software before Codd's research at IBM was available?

A_Person7y ago

okket7y ago

Sadly bad audio (room mic with all the ambient noise) and bad video quality (slides are almost unreadable). But great content.

randop7y ago

Thank you. Very educational. Interesting to know that ORDER BY includes significant performance penalty without LIMIT.

serioushaha7y ago

slides : https://www.slideshare.net/VikasBansal23/how-sqlite-works

angelfreak7y ago

Really great, thanks for posting.

bitmapbrother7y ago

This is a much better talk with better video and sound.

https://www.youtube.com/watch?v=Jib2AmRb_rk

cup-of-tea7y ago

What is that acronym he keeps saying? MBCC?

ntonozzi7y ago

MVCC - Multi version concurrency control.

anothergoogler7y ago

1 more reply

blackrock7y ago

I'm not going to call anyone out here, but why do people keep using the word orthogonal?

It doesn't even compute. It doesn't even make any sense, in how they use it in relation to the topic.

Are the issues at right angles of one another? No.

Are the issues statistically independent of one another? Perhaps.

I suggest to use a more appropriate descriptive word to describe the situation.

You folks should read the urban meaning of orthogonal, to understand how people roll their eyes at you, when you inappropriately use the term.

https://www.urbandictionary.com/define.php?term=orthogonal

Just another friendly PSA.

ternaryoperator7y ago

In the meantime, I agree that it can be frustrating to see words apparently misused. But I think this is hardly the mark of an "idiot," as you put it.

[1] https://quoteinvestigator.com/2012/10/31/st-pauls-cathedral/

sanderjd7y ago

wwweston7y ago

What does it mean for something to be at a right angle to something else?

There's a euclidean geometric answer to that statement, but it's hardly the only correct answer.

When people use it to mean that they're speaking of two issues that have a range of independent possibilities, it's not wrong to invoke linearly independent bases.

abiox7y ago

is english a second language for you? 'orthogonal' is frequently used to indicate two things are not directly related or dependent.

> You folks should read the urban meaning of orthogonal

nope, nope, nope. that site's a hive of scum and villainy, and a massive number of entries are just random nonsense.

i'd rather go to wiktionary[0], which includes:

"Of two or more problems or subjects, independent of or irrelevant to each other."

[0] https://en.wiktionary.org/wiki/orthogonal

Izkata7y ago

> You folks should read the urban meaning of orthogonal, to understand how people roll their eyes at you, when you inappropriately use the term.

If that mattered at all, then we'd have stopped using other remapped words first, like "tree".

jokoon7y ago

I don't like to use SQL engine because I don't understand how they work, I never really know if my query will be O(1), O(log(n)), O(n), etc, or what kind of algorithm will optimize my query.

Databases still scare me.

blattimwind7y ago

You're not scared, you're just too lazy to learn the tools of your trade.

> That's why I tend to avoid systematically using a SQL engine unless the data schema is very very simple, and manage and filter the data case by case in code.

And that's the mentality that gives us webshops were applying a simple filter results in a couple seconds load time and uses hundreds of MB of RAM per request, server side.

barrkel7y ago

2 more replies

aardvark2917y ago

>Databases are not very complex and use pretty much only textbook data structures and algorithms

skeptical expression

3 more replies

coldtea7y ago

>I don't like to use SQL engine because I don't understand how they work, I never really know if my query will be O(1), O(log(n)), O(n), etc, or what kind of algorithm will optimize my query.

Unless you're generating totally dynamic queries that's a moot point.

You can always try it and measure it -- just like you know, you would profile a program in any programming language. And you can trivially have the database show you the query plan as well.

Do you also not use APIs because you don't know a priori if a call is O(1) or O(N) or O(Nlog(N)) etc?

That's really orthogonal.

Besides, something indexed will be faster whether they are in disk or in RAM compared to something in the same storage that's not indexed.

So unless we're coding something trivial, server side we still want all the speed we can get from our data than plain having them as simple structures RAM provides.

You wouldn't use a linked link as opposed to a hash table just because your data "fit in RAM". Even in RAM ~O(1) vs ~ O(N) matters [1].

[1] unless the linked list is so tiny as to fit in cache and avoid the hash table inderection, but I digress

ludsan7y ago

Your concern about the opaque and abstract layers below you apply to language compilers as well (which I think is the "better" alternative you seem to prefer).

That is, not to say, that you won't need to peek below or concern yourself with certain choices -- indices, commits, columnar-vs-row, etc. as your performance or access patterns dictate.

glhaynes7y ago

nostalgeek7y ago

can you recommend some material?

1 more reply

greenyoda7y ago

> I really wonder who, today, have data that cannot fit in RAM, apart from big actors.

Keeping all your data in RAM has significant problems, even if it all fits. For example, would you want to lose all your customers' orders and billing information if your code crashed?

In addition to the relational database model, SQL databases offer ACID transactions, which are useful if you want to have consistent and reliable data:

https://en.wikipedia.org/wiki/ACID

slow_donkey7y ago

To be fair, using redis or elasticsearch as a main datastore is doable. Although I'm not sure they're much better choices in terms of understanding how they work.

You could summon Antirez I guess

1 more reply

PuercoPop7y ago

> For example, would you want to lose all your customers' orders and billing information if your code crashed?

[0]: http://cs-www.cs.yale.edu/homes/dna/papers/vldb07hstore.pdf

2 more replies

jnwatson7y ago

For Postgres at least, you can literally ask it how a query works, via EXPLAIN. Now, there’s a skill to understanding the output of that, but at least it isn’t a black box.

wolf550e7y ago

sebojanko7y ago

There's something similar for SQL Server too.

ebikelaw7y ago

8 more replies

dragonwriter7y ago

> Who really does understand how a SQL engine work?

Presumably, at a minimum, all the people who work on such engines, including committees to the various open-source ones.

But also lots of other people.

> Don't you usually require to understand how something work before starting using it?

No. Very few programmers understand how compilers work before they start using them. I'd say it's more common to require working on something to really understand how it works than the reverse.

> I think SQL was designed when RAM was scarce and expensive, so to speed up data access, it has to be properly indexed with a database engine. I really wonder who, today, have data that cannot fit in RAM, apart from big actors.

Indexing is no less important for in-memory data access.

josteink7y ago

This is what you get when developers are afraid to touch anything which doesn't look like Javascript or JSON.

The incompetence and ignorance shown in this post is simply astounding.

dagw7y ago

I really wonder who, today, have data that cannot fit in RAM, apart from big actors.

mmt7y ago

> Once you get above about 256 GB of RAM

I think it might be as high as 1TB these days, though with what's going on with DDR4 prices, the situation is strange at the moment.

Of course, I don't disput your point that a 100+TB database isn't all that large, especially with indexes.

I suspect that it's this false dichotomy of "fit in RAM" and "big data" has resulted in many needless forays into distributed computing.

1 more reply

collyw7y ago

dmoreno7y ago

I completely disagree.

I think that if any, databases should be used more.

PuercoPop7y ago

Too bad that Michael Stonebraker, Turing Award winner, disagrees with you. SQL are not the best solution for any common use case from the performance perspective.

Nevermind what they do to the design of an application. IMHO less people should default to using a database upfront. At least while protyping the idea.

https://cs.brown.edu/~ugur/fits_all.pdf

2 more replies

1stranger7y ago

How do you prefer to persist your data?

tobyhinloopen7y ago

In memory database, dumped to file? Not saying how I would do it, just thinking alternatives.

I’m using postgresql usually

2 more replies

zzzcpan7y ago

> I never really know if my query will be O(1), O(log(n)), O(n), etc, or what kind of algorithm will optimize my query

gaius7y ago

SQL just wasn't designed for performance.

1 more reply

j / k navigate · click thread line to collapse