SQLite: Past, Present, and Future (opens in new tab)

(vldb.org)

282 pointschrstr3y ago143 comments

143 comments

dang3y ago

The pdf: https://www.vldb.org/pvldb/vol15/p3535-gaffney.pdf

I shared some notes on this on my blog, because I'm guessing a lot of people aren't quite invested enough to read through the whole paper: https://simonwillison.net/2022/Sep/1/sqlite-duckdb-paper/

airstrike3y ago

Thank you for this. Big fan of your blog and all your contributions to Django

thunderbong3y ago

That's a very comprehensive review. Thank you.

badgerdb3y ago

Indeed, an excellent summary.

sph3y ago

I waited for a tl;dr but this is even better. Much appreciated.

polyrand3y ago

Regarding hash joins, the SQLite documentation mentions the absence of real hash tables [0]

  SQLite constructs a transient index instead of a hash table in this instance 
  because it already has a robust and high performance B-Tree implementation at 
  hand, whereas a hash-table would need to be added. Adding a separate hash table 
  implementation to handle this one case would increase the size of the library 
  (which is designed for use on low-memory embedded devices) for minimal 
  performance gain.

It's already linked in the paper, but here's the link to the code used in the paper [1]

The paper mentions implementing Bloom filters for analytical queries an explains how they're used. I wonder if this is related to the query planner enhancements that landed on SQLite 3.38.0 [2]

  Use a Bloom filter to speed up large analytic queries.

[0]: https://www.sqlite.org/optoverview.html#hash_joins

[1]: https://github.com/UWHustle/sqlite-past-present-future

[2]: https://www.sqlite.org/releaselog/3_38_0.html

kpgaffney3y ago

That's correct, the optimizations from this paper became available in SQLite version 3.38.0.

As we were writing the paper, we did consider implementing hash joins in SQLite. However, we ultimately went with the Bloom filter methods because they resulted in large performance gains for minimal added complexity (2 virtual instructions, a simple data structure, and a small change to the query planner). Hash joins may indeed provide some additional performance gains, but the question (as noted above) is whether they are worth the added complexity.

gorjusborg3y ago

I came for SQLite, got sold DuckDB.

manimino3y ago

TFA appears to be about adapting SQLite for OLAP workloads. I do not understand the rationale. Why try to adapt a row-based storage system for OLAP? Why not just use a column store?

Comevius3y ago

SQLite is significantly better at OLTP and being a blob strorage than DuckDB, and it doesn't want to sacrifice those advantages and compatibility if OLAP performance can be improved independently. In my experience for diverse workloads it is more practical to start with a row-based structure and incrementally transform it into a column-based one. Indeed in the paper there is a suggested approach that trades space for improved OLAP performance.

didgetmaster3y ago

It is certainly possible to have a single system that can effectively process high volumes of OLTP traffic while at the same time performing OLAP operations. While there are systems that are designed to do one or the other type of operation well, very few are able to do both. https://www.youtube.com/watch?v=F6-O9v4mrCc

satyrnein3y ago

It seems like one idea in there is to store it both ways automatically (the HE variant)! That might be better then manually continually copying between your row store and your column store.

badgerdb3y ago

Great discussion here. As one of the co-authors of the paper, here is some additional information.

If you need both transactions and OLAP in the same system, the prevalent way to deliver high performance on this (HTAP) workload is to make two copies of the data. This is what we did in the SQLite3/HE work (paper: https://www.cidrdb.org/cidr2022/papers/p56-prammer.pdf; talk: https://www.youtube.com/watch?v=c9bQyzm6JRU). That was quite clunky. This two copy approach not only wasted storage but makes the code complicated, and it would be very hard to maintain over time (we did not want to fork the SQLite code -- that is not nice).

So, we approached it in a different way and started to look for how we could get higher performance on OLAP queries working as closely with SQLite's native query processing and storage framework.

We went through a large number of options (many of them taken from the mechanisms we developed in an earlier Quickstep project (https://pages.cs.wisc.edu/~jignesh/publ/Quickstep.pdf) and concluded that the Bloom filter method (inspired by a more general technique called Look-ahead Information Passing https://www.vldb.org/pvldb/vol10/p889-zhu.pdf) gave us the biggest bang for the buck.

There is a lot of room for improvement here, and getting high OLAP and transaction performance in a single-copy database system is IMO a holy grail that many in the community are working on.

BTW - the SQLite team, namely Dr. Hipp (that is a cool name), Lawrence and Dan are amazing to work with. As an academic, I very much enjoyed how deeply academic they are in their thinking. No surprise that they have built an amazing data platform (I call it a data platform as it is much more than a database system, as it has many hooks for extensibility).

spaniard892773y ago

I've been learning SQL recently with PostgreSQL and MySQL in an online bootcamp here in Spain. So far very comprehensive. We've touched indexing and partitioning with EXPLAIN ANALYZE for optimizing performance, and I've implemented this strategies successfully onto an active forum I own.

The SQL course has almost no love by the students but so far it has been the most useful and interesting to me.

I was able to create some complex views (couldn't understand how to make materialized views in MySQL), but they were still very slow.

I decided to copy most of this forum DB to DuckDB (with Knime now, until I know better), and optimization with DuckDB seems pointless. It's very, very fast. Less energy usage for my brain, and less time waiting. That's a win for me.

My current dataset is about 40GB, so It's not HUGE, and sure people here in HN would laugh at my "complex" views, but so far I've reduced all my concerns from optimizing to how to download the data I need without causing problems to the server.

dkjaudyeqooe3y ago

In the real world a relational database is the single most useful tool short of a compiler/interpreter. SQL is anachronistic but still works well even if its a pain.

My advice: avoid MySQL like the plague. PgSQL and SQLite is all you ever need and all you ever want.

aljgz3y ago

> The SQL course has almost no love by the students

This is a big early career mistake. I've seen experienced developers use NoSql in a project where Sql is clearly a great fit, then waste lots of manpower to emulate things you get with Sql for free.

Of course one's career can fall into a success path that never depends on SQL, but not learning SQL deeply is not a safe bet.

dkjaudyeqooe3y ago

Most "programmers" just can't understand relational databases and SQL. It's too hard.

I've seen things you wouldn't believe. Random deadlocks in multi-billion transaction reporting systems. Atomic transactions split into multiple commits in banking applications. Copying rows between tables instead of setting a flag on a row. All because highly paid programmers are scared of RDBs.

3 more replies

twh2703y ago

I've got some developer PTSD from a previous project where the solution architect decided to use CosmosDB for the entire domain model that was very relational and very transactional, all because "NoSQL is easy to learn and allows rapid development".

Yeah it is, until you're trying to manually create and maintain relations between documents in different schemas owned by different microservices.

spaniard892773y ago

I've read this again and again in this forum and other dev communities, so I didn't hesitate. I can't say I love SQL, but it's not that bad, Databases look interesting to me.

wodenokoto3y ago

Don’t sell yourself short. I’m sure the minority here knows what a complex view is

spaniard892773y ago

I have no real world experience, I've seen things in Stack Overflow that I hardly manage to understand.

js83y ago

Why cannot SQLite have two different table storage engines for different tables, one row and the other column oriented?

manigandham3y ago

The same reasoning in the article applies: it's a lot of added complexity that isn't related to its core use as a general purpose in-process SQL database.

Usually OLAP at these scales is fast enough with SQLite or you can use DuckDB if you need a portable format before graduating to a full on distributed OLAP system.

ryanworl3y ago

Storage layout is not the primary issue here because IO throughput on commodity hardware has increased significantly in the last 10 years.

DuckDB is significantly faster than SQLite because it has a vectorized execution engine which is more CPU efficient and uses algorithms which are better suited for analytical queries. If you implemented a scan operator for DuckDB to read SQLite files, it would still have better performance.

1egg0myegg03y ago

We have one of those! :-)

And yes it is fast!

https://github.com/duckdblabs/sqlite_scanner

mwish3y ago

I'm confused that why in Figure3, seems in Raspberry Pi, latency is slower than same queries' latency in cloud server. Did I missed something?

rafale3y ago

SQLite vs Postgres for a local database (on disk, not over the network): who wins? (Each in their most performance oriented configuration)

thomascgalvin3y ago

This is basically the exact use case SQLite was designed for; PostgreSQL is a marvel, and at the end of the day presents a much more robust RDBMS, but it's never going to beat SQLite at the thing SQLite was designed for.

bob10293y ago

>most performance oriented configuration

I am 99% sure SQLite is going to win unless you actually care about data durability at power loss time. Even if you do, I feel I could defeat Postgres on equal terms if you permit me access to certain ring-buffer-style, micro-batching, inter-thread communication primitives.

Sqlite is not great at dealing with a gigantic wall of concurrent requests out of the box, but using a little bit of innovation in front of SQLite can solve this problem quite well. The key is resolve the write contention outside of the lock that is baked into the SQLite connection. Writing batches to SQLite on a single connection with WAL turned on and Sync set to normal is pretty much like operating at line speed with your IO subsystem.

prirun3y ago

> I am 99% sure SQLite is going to win unless you actually care about data durability at power loss time.

SQLite will handle a power loss just fine.

From https://www.sqlite.org/howtocorrupt.html:

"An SQLite database is highly resistant to corruption. If an application crash, or an operating-system crash, or even a power failure occurs in the middle of a transaction, the partially written transaction should be automatically rolled back the next time the database file is accessed. The recovery process is fully automatic and does not require any action on the part of the user or the application."

From https://www.sqlite.org/testing.html:

"Crash testing seeks to demonstrate that an SQLite database will not go corrupt if the application or operating system crashes or if there is a power failure in the middle of a database update. A separate white-paper titled Atomic Commit in SQLite describes the defensive measure SQLite takes to prevent database corruption following a crash. Crash tests strive to verify that those defensive measures are working correctly.

It is impractical to do crash testing using real power failures, of course, and so crash testing is done in simulation. An alternative Virtual File System is inserted that allows the test harness to simulate the state of the database file following a crash."

samatman3y ago

Postgres obviously.

Sorry, just thought I'd buck the trend and assume a very write-heavy workload with like 64 cores.

If you don't have significant write contention, SQLite every time.

ledgerdev3y ago

Here's sqlite doing 100 million inserts in 33 seconds which should fit into nearly every workload, though it is batched. https://avi.im/blag/2021/fast-sqlite-inserts/

So write contention from multiple connections is what you're talking about, versus a single process using sqlite?

3 more replies

dinosaurdynasty3y ago

If you can have one "database" thread and 63 "worker" threads, send messages back and forth, and don't hold open transactions, this would probably work with sqlite. Aka treat sqlite like redis.

1 more reply

innocenat3y ago

Where is write contention coming from if it's operated locally?

2 more replies

RedShift13y ago

SQLite is always going to win in that category just from the fact that there are less layers of code to be worked through to execute a query.

remram3y ago

Latency-wise maybe, but throughput can be more important for a lot of applications or bigger databases.

I say "maybe" because even there, SQLite is much more limited in terms of query-planning (very simple statistics) and the use of multiple indexes.

That's assuming we're talking about reads, PostgreSQL will win for write-heavy workloads.

electroly3y ago

As long as you turn it into a throughput race instead of a latency race, PostgreSQL can definitely win. SQLite has a primitive query builder and a limited selection of query execution steps to choose from. For instance, all joins in SQLite are inner loop joins. It can't do hash or merge joins. It can't do GIN or columnstore indexes. If a query needs those things, PostgreSQL can provide them and can beat SQLite.

1 more reply

sophacles3y ago

> just from the fact that there are less layers of code to be worked through

This is not an invariant. I've seen be true, and I've seen it be false. Sometimes that extra code is just cruft yes. Other times though it is worth it to set up your data (or whatever) to take advantage of mechanical sympathies in hot paths, or filter the data before the expensive processing step, etc.

1 more reply

lvass3y ago

SQLite. The most performant configuration is unsuited to most usage, and may lead to database corruption on a system crash.

rafale3y ago

Should have said the most performance oriented setting that's also safe from data corruption.

1 more reply

kpgaffney3y ago

I think the (unsatisfying) answer is "it depends". There's a huge amount of diversity in database workloads, even among the workloads served by SQLite as we mention in the paper.

For read-mostly to read-only OLTP workloads, read latency is the most important factor, so I predict SQLite would have an edge over PostgreSQL due to SQLite's lower complexity and lack of interprocess communication.

For write-heavy OLTP workloads, coordinating concurrent writes becomes important, so I predict PostgreSQL would provide higher throughput than SQLite because PostgreSQL allows more concurrency.

For OLAP workloads, it's less clear. As a client-server database system, PostgreSQL can afford to be more aggressive with memory usage and parallelism. In contrast, SQLite uses memory sparingly and provides minimal intra-query parallelism. If you pressed me to make a prediction, I'd probably say SQLite would generally win for smaller databases. PostgreSQL might be faster for some workloads on larger databases. However, these are just guesses and the only way to be sure is to actually run some benchmarks.

ergocoder3y ago

Functionality-wise, SQLite's dialect is really lacking...

simonw3y ago

Is it the SQL dialect there lacking or is it the built-in functions?

I agree that SQLite default functionality is very thin compared to PostgreSQL - especially with respect to things like date manipulation - but you can extend it with more SQL functions (and table-valued functions) very easily.

2 more replies

bob10293y ago

The entire point is to bring your own functions to SQLite, since it is presumably running in-proc and can be integrated with trivially.

https://sqlite.org/appfunc.html

We currently use this path to offer a domain-specific SQL-based scripting language for our product.

nikeee3y ago

The documentation offers some advice on this:

https://www.sqlite.org/whentouse.html

youngtaff3y ago

Why do people have to publish papers in a weird two column academic format instead of something that's more easily readable?

badgerdb3y ago

Ha ha .. that is what the conference requires. Turns out that there is research that shows that when you are reading paper printed on paper this 2-column format is good for readability and not wasting paper. Conferences still insist on this format even though most people print papers.

Now the good news is that these days, conferences have an accompanying video associated with the paper, and that may be a good place to start for many. That video will be published on the conference website (https://vldb.org/2022/) in about a week.

youngtaff3y ago

Thanks, will lookout for the video

(I tend to read most things on a screen and find two columns of small text tiring)

oxff3y ago

I read papers most of the time on phone and these two column papers are such a PITA to read lol

Kalanos3y ago

i wish it had an optional server for more concurrent and networked transactions in the cloud

bityard3y ago

You may be interested in rqlite: https://github.com/rqlite/rqlite

jjtheblunt3y ago

you could make one pretty easily, no?

axelthegerman3y ago

I'd like to see that. I also think the single write situation is not great for web applications, but I don't see an easy way around it without sacrificing things like consistency

2 more replies

oaiey3y ago

After seeing one diagram: Why the hack are we still talking to databases with SQL strings and not directly specifying the Query-AST? Admins, sure (a fancy UI could help there as well) but why in our code?

samwillis3y ago

I believe most sql engines cache the query plans for parameterized queries, which would cover the majority of requests.

Caching the query plan is also going to go further in performance optimisations than just “precompiling” the SQL to a AST.

speed_spread3y ago

Because query parsing time is totally insignificant compared to query IO?

I mean, I get it but the chances that it makes a noticeable difference are zero in almost every case. Also you'd have to change a lot of the existing tooling, at which point you might as well send a compiled agent or use stored procs?

robertlagrant3y ago

> Because query parsing time is totally insignificant compared to query IO?

The problem there is that the SQL query string is not parsed at compile time of the host program, so things that could be caught at compile-time are not, and things like appending strings to SQL strings in an unsafe way are much too easy to do.

2 more replies

okennedy3y ago

We are... Spark's DataFrame is essentially a relational algebra AST-builder. Microsoft's LINQ interprets SQL directly in at compile-time. All of these, however, run queries more or less directly in the system in which they're specified.

It helps to think of SQL strings as an untrusted wire format. Yes, parsing is a pain, but it comes with two main benefits: (i) The wire format is human writable/interpretable, with all the accompanying benefits, and (ii) The wire format is easily extensible in a predictable way.

That latter one is particularly useful in keeping SQL's ecosystem open. Take a front-end library like SQLAlchemy or ScalikeJDBC for example. It's not practical for any one such library to support every extension provided by every database engine. SQL provides a fall-back for when you need a back-end feature that hasn't been implemented in any given front-end.

oaiey3y ago

C# LINQ does pass an expression tree into the abstraction before it will then serialize it SQL and then the database deserialize it into an AST. LINQ-to-Objects is in memory and works on the AST directly.

Also both LINQ language syntax and library methods are a builder paradigm for the expression tree. Valid, but still far from ideal representation of an AST.

bayindirh3y ago

> Why the hack are we still talking to databases with SQL strings and not directly specifying the Query-AST?

The same reason language servers took off. Instead of one to one mapping, SQL enables one to many mapping with minor tweaks, allowing everyone to do whatever they want over a well known, well defined, mature abstraction.

In the same spirit, I may ask why we're not writing assembly or even machine code, and we have programming languages? Testers, sure, abstraction means clarity up to an extent, but why the developers themselves still use programming languages?

keybored3y ago

You’re missing the point if you think that that is “in the same spirit”.

SQL-as-strings and SQL-as-AST are still the same thing. What is being proposed it not to write procedural code for record retrieval instead of declarative SQL.

Timpy3y ago

Isn't that what a parameterized query does? It separates the sql logic from the inputs so it can cache the query and then it accepts inputs separately. Safer and more optimal at the same time, the engine doesn't have to re-optimize the same query again for the life of the connection. If my understanding is wrong somebody please correct me, it's kind of hard to get good information on what's going on under the hood with these things.

manigandham3y ago

A SQL query is an AST, but represented in a compact portable form. It also supports functions, procedures, and parameterization for flexible and safe query construction.

Your code would get incredibly large and complicated if you had to specify any serious SQL query as a raw AST.

euroderf3y ago

Check out prql[1], it might be a conceptual model you'd like.

[1] https://prql-lang.org/

oaiey3y ago

That is just modern SQL with horrible LINQ memories ;)

leprechaun10663y ago

Most queries to kdb+ do exactly what you are asking for.

AndrewDucker3y ago

Can you give an example of what you mean, and what we'd gain from it?

dkjaudyeqooe3y ago

When you're accessing a SQLite database in code you have to generate a query string. Parameters ameliorate that somewhat, but in many cases you still have to regenerate a new string for each new query. It's inefficient to translate your query into a string only to have it parsed back into something structured by SQLite.

3 more replies

rprospero3y ago

Not the OP, but I can give two gains. First off, by passing an AST, instead of just an SQL string, we cut out a huge number of possibly SQL injection attacks. Second, in most of the projects where I've used of SQL, there's been some kind of database object that builds the actual query, which it then converts to a string. The database then takes that string and parses it into an AST. There's some performance gains to be made by skipping the middle man and just creating the AST directly.

Many years ago, I was on a project that needed to add rows to a database with a hard 10µs limit. Each rows was just four integers, so the writing part was trivial. However, allocating the string, formatting the integers to strings, then parsing the resulting string often put us over the time limit. Every time the time limit was breached, we lost about five grand. Why we were using an SQL database for this at all is a story for a different time.

1 more reply

danielheath3y ago

Having application build a string and pass it to a library which parses the string into an AST cannot be as efficient as just building the AST, right?

1 more reply

pif3y ago

Type safety at compile time to begin with.

emfax3y ago

I’ve been trying to do exactly this. It must be possible.

liveoneggs3y ago

PREPARE?

3233y ago

Didn't Microsoft LINQ try something like that? Was not particularly successful.

manigandham3y ago

LINQ stands for Language-Integrated Query and is incredibly successful at its purpose of providing powerful querying functionality and extensions baked into the C#/.NET language space itself.

This querying framework is what powers translations and compilation into SQL and several other languages (depending on the datastore provider used).

EntityFramework is one of the most advanced ORMs out there and is supremely productive because of LINQ.

Semaphor3y ago

LINQ is stunningly successful, LINQ Query syntax (what you probably are referring to) with LINQ to SQL less so.

And yes, as usual, we have the amazing confusion of Microsoft Naming.

But query syntax is essentially just a way to use an ORM that looks closer to SQL but is strongly typed.

oaiey3y ago

LINQ creates an AST in the .NET land, however, before passing it to an actual SQL database, it serializes the expression tree to SQL.

So no, it does not talk AST to the database.

Ducki3y ago

No, it never tried that.

afavour3y ago

Why not?

izacus3y ago

Same reason you don't read files by concating string commands into another different language and then posting them to the OS.

naikrovek3y ago

my ipad won’t let me search through the PDF, but i couldn’t find where “SSB” was defined, if anywhere. i did not see it defined in the first paragraph, which is where it is first used.

everyone: not all of your readers are domain experts. omissions like this are infuriating.

ryanworl3y ago

Star Schema Benchmark https://www.cs.umb.edu/~poneil/StarSchemaB.PDF

glhaynes3y ago

Just fyi: if you’re viewing the PDF in Safari on your iPad, you can search by typing into Safari’s Location Bar and then choosing “Find ‘xyz’” from the popup that appears.

jrochkind13y ago

came here to ask this. I wondered if it was a typo for SSD!

Thaxll3y ago

"While it continues to be the most widely used database engine in the world"

It realy depends what do you mean by that, yes it's shipping in every phones and browser, but I don't consider that as a database. Is the windows registry a database?

Oracle, MySQL, PG, MSSQL are the most widly used DB in the world, the web runs on those not SQLite.

adamrezich3y ago

there are far, far more sqlite instances than Windows Registry instances in the world.

"SQLite is likely used more than all other database engines combined. Billions and billions of copies of SQLite exist in the wild. [...] Since SQLite is used extensively in every smartphone, and there are more than 4.0 billion (4.0e9) smartphones in active use, each holding hundreds of SQLite database files, it is seems likely that there are over one trillion (1e12) SQLite databases in active use."

https://www.sqlite.org/mostdeployed.html

stonemetal123y ago

>SQLite is primarily designed for fast online transaction processing (OLTP), employing row-oriented execution and a B-tree storage format.

I found that claim to be fairly surprising, SQLite is pretty bad when it comes to transactions per second. SQLite even owns up to it in the FAQ:

>it will only do a few dozen transactions per second.

tiffanyh3y ago

> SQLite is pretty bad when it comes to transactions per second. SQLite even owns up to it in the FAQ: "it will only do a few dozen transactions per second."

That is an extremely poor quote taken way out of context.

The full quote is:

FAQ: "[Question] INSERT is really slow - I can only do few dozen INSERTs per second. [Answer] Actually, SQLite will easily do 50,000 or more INSERT statements per second on an average desktop computer. But it will only do a few dozen transactions per second. Transaction speed is limited by the rotational speed of your disk drive. A transaction normally requires two complete rotations of the disk platter, which on a 7200RPM disk drive limits you to about 60 transactions per second."

https://www.sqlite.org/faq.html#q19

hu33y ago

Yeah my GP got me confused. I remember doing 40k inserts/s in a trading strategy backtesting program with Go and SQLite. Reads were on the same magnitude, I want to say around 90k/s. My bottleneck was CPU.

hnfong3y ago

Given the prevalence of SSDs these days the figure might be out of date as well.

kpgaffney3y ago

Thanks for the clarification. It's true that transaction latency is limited by the write speed of the storage medium. However, an "average desktop computer" these days has an SSD that can support tens of thousands of transactions per second, depending on the workload.

stonemetal123y ago

What is your point? If I need transactions, not just bulk loading inserts, then SQLite isn't the bees knees. PG can handle at least an order of magnitude more transactions per second on the same hardware.

jessermeyer3y ago

Please quote the entire statement. And stop the needless "even owns up to it" FUD.

> Actually, SQLite will easily do 50,000 or more INSERT statements per second on an average desktop computer. But it will only do a few dozen transactions per second. Transaction speed is limited by the rotational speed of your disk drive. A transaction normally requires two complete rotations of the disk platter, which on a 7200RPM disk drive limits you to about 60 transactions per second.

j / k navigate · click thread line to collapse

143 comments

dang3y ago

The pdf: https://www.vldb.org/pvldb/vol15/p3535-gaffney.pdf

simonw3y ago

I shared some notes on this on my blog, because I'm guessing a lot of people aren't quite invested enough to read through the whole paper: https://simonwillison.net/2022/Sep/1/sqlite-duckdb-paper/

airstrike3y ago

Thank you for this. Big fan of your blog and all your contributions to Django

thunderbong3y ago

That's a very comprehensive review. Thank you.

badgerdb3y ago

Indeed, an excellent summary.

sph3y ago

I waited for a tl;dr but this is even better. Much appreciated.

polyrand3y ago

Regarding hash joins, the SQLite documentation mentions the absence of real hash tables [0]

  SQLite constructs a transient index instead of a hash table in this instance 
  because it already has a robust and high performance B-Tree implementation at 
  hand, whereas a hash-table would need to be added. Adding a separate hash table 
  implementation to handle this one case would increase the size of the library 
  (which is designed for use on low-memory embedded devices) for minimal 
  performance gain.

It's already linked in the paper, but here's the link to the code used in the paper [1]

The paper mentions implementing Bloom filters for analytical queries an explains how they're used. I wonder if this is related to the query planner enhancements that landed on SQLite 3.38.0 [2]

  Use a Bloom filter to speed up large analytic queries.

[0]: https://www.sqlite.org/optoverview.html#hash_joins

[1]: https://github.com/UWHustle/sqlite-past-present-future

[2]: https://www.sqlite.org/releaselog/3_38_0.html

kpgaffney3y ago

That's correct, the optimizations from this paper became available in SQLite version 3.38.0.

gorjusborg3y ago

I came for SQLite, got sold DuckDB.

manimino3y ago

TFA appears to be about adapting SQLite for OLAP workloads. I do not understand the rationale. Why try to adapt a row-based storage system for OLAP? Why not just use a column store?

Comevius3y ago

didgetmaster3y ago

satyrnein3y ago

It seems like one idea in there is to store it both ways automatically (the HE variant)! That might be better then manually continually copying between your row store and your column store.

badgerdb3y ago

Great discussion here. As one of the co-authors of the paper, here is some additional information.

So, we approached it in a different way and started to look for how we could get higher performance on OLAP queries working as closely with SQLite's native query processing and storage framework.

There is a lot of room for improvement here, and getting high OLAP and transaction performance in a single-copy database system is IMO a holy grail that many in the community are working on.

spaniard892773y ago

The SQL course has almost no love by the students but so far it has been the most useful and interesting to me.

I was able to create some complex views (couldn't understand how to make materialized views in MySQL), but they were still very slow.

dkjaudyeqooe3y ago

In the real world a relational database is the single most useful tool short of a compiler/interpreter. SQL is anachronistic but still works well even if its a pain.

My advice: avoid MySQL like the plague. PgSQL and SQLite is all you ever need and all you ever want.

aljgz3y ago

> The SQL course has almost no love by the students

This is a big early career mistake. I've seen experienced developers use NoSql in a project where Sql is clearly a great fit, then waste lots of manpower to emulate things you get with Sql for free.

Of course one's career can fall into a success path that never depends on SQL, but not learning SQL deeply is not a safe bet.

dkjaudyeqooe3y ago

Most "programmers" just can't understand relational databases and SQL. It's too hard.

3 more replies

twh2703y ago

Yeah it is, until you're trying to manually create and maintain relations between documents in different schemas owned by different microservices.

spaniard892773y ago

I've read this again and again in this forum and other dev communities, so I didn't hesitate. I can't say I love SQL, but it's not that bad, Databases look interesting to me.

wodenokoto3y ago

Don’t sell yourself short. I’m sure the minority here knows what a complex view is

spaniard892773y ago

I have no real world experience, I've seen things in Stack Overflow that I hardly manage to understand.

js83y ago

Why cannot SQLite have two different table storage engines for different tables, one row and the other column oriented?

manigandham3y ago

The same reasoning in the article applies: it's a lot of added complexity that isn't related to its core use as a general purpose in-process SQL database.

Usually OLAP at these scales is fast enough with SQLite or you can use DuckDB if you need a portable format before graduating to a full on distributed OLAP system.

ryanworl3y ago

Storage layout is not the primary issue here because IO throughput on commodity hardware has increased significantly in the last 10 years.

1egg0myegg03y ago

We have one of those! :-)

And yes it is fast!

https://github.com/duckdblabs/sqlite_scanner

mwish3y ago

I'm confused that why in Figure3, seems in Raspberry Pi, latency is slower than same queries' latency in cloud server. Did I missed something?

rafale3y ago

SQLite vs Postgres for a local database (on disk, not over the network): who wins? (Each in their most performance oriented configuration)

thomascgalvin3y ago

bob10293y ago

>most performance oriented configuration

prirun3y ago

> I am 99% sure SQLite is going to win unless you actually care about data durability at power loss time.

SQLite will handle a power loss just fine.

From https://www.sqlite.org/howtocorrupt.html:

From https://www.sqlite.org/testing.html:

samatman3y ago

Postgres obviously.

Sorry, just thought I'd buck the trend and assume a very write-heavy workload with like 64 cores.

If you don't have significant write contention, SQLite every time.

ledgerdev3y ago

Here's sqlite doing 100 million inserts in 33 seconds which should fit into nearly every workload, though it is batched. https://avi.im/blag/2021/fast-sqlite-inserts/

So write contention from multiple connections is what you're talking about, versus a single process using sqlite?

3 more replies

dinosaurdynasty3y ago

If you can have one "database" thread and 63 "worker" threads, send messages back and forth, and don't hold open transactions, this would probably work with sqlite. Aka treat sqlite like redis.

1 more reply

innocenat3y ago

Where is write contention coming from if it's operated locally?

2 more replies

RedShift13y ago

SQLite is always going to win in that category just from the fact that there are less layers of code to be worked through to execute a query.

remram3y ago

Latency-wise maybe, but throughput can be more important for a lot of applications or bigger databases.

I say "maybe" because even there, SQLite is much more limited in terms of query-planning (very simple statistics) and the use of multiple indexes.

That's assuming we're talking about reads, PostgreSQL will win for write-heavy workloads.

electroly3y ago

1 more reply

sophacles3y ago

> just from the fact that there are less layers of code to be worked through

1 more reply

lvass3y ago

SQLite. The most performant configuration is unsuited to most usage, and may lead to database corruption on a system crash.

rafale3y ago

Should have said the most performance oriented setting that's also safe from data corruption.

1 more reply

kpgaffney3y ago

I think the (unsatisfying) answer is "it depends". There's a huge amount of diversity in database workloads, even among the workloads served by SQLite as we mention in the paper.

For write-heavy OLTP workloads, coordinating concurrent writes becomes important, so I predict PostgreSQL would provide higher throughput than SQLite because PostgreSQL allows more concurrency.

ergocoder3y ago

Functionality-wise, SQLite's dialect is really lacking...

simonw3y ago

Is it the SQL dialect there lacking or is it the built-in functions?

2 more replies

bob10293y ago

The entire point is to bring your own functions to SQLite, since it is presumably running in-proc and can be integrated with trivially.

https://sqlite.org/appfunc.html

We currently use this path to offer a domain-specific SQL-based scripting language for our product.

nikeee3y ago

The documentation offers some advice on this:

https://www.sqlite.org/whentouse.html

youngtaff3y ago

Why do people have to publish papers in a weird two column academic format instead of something that's more easily readable?

badgerdb3y ago

youngtaff3y ago

Thanks, will lookout for the video

(I tend to read most things on a screen and find two columns of small text tiring)

oxff3y ago

I read papers most of the time on phone and these two column papers are such a PITA to read lol

Kalanos3y ago

i wish it had an optional server for more concurrent and networked transactions in the cloud

bityard3y ago

You may be interested in rqlite: https://github.com/rqlite/rqlite

jjtheblunt3y ago

you could make one pretty easily, no?

axelthegerman3y ago

I'd like to see that. I also think the single write situation is not great for web applications, but I don't see an easy way around it without sacrificing things like consistency

2 more replies

oaiey3y ago

samwillis3y ago

I believe most sql engines cache the query plans for parameterized queries, which would cover the majority of requests.

Caching the query plan is also going to go further in performance optimisations than just “precompiling” the SQL to a AST.

speed_spread3y ago

Because query parsing time is totally insignificant compared to query IO?

robertlagrant3y ago

> Because query parsing time is totally insignificant compared to query IO?

2 more replies

okennedy3y ago

oaiey3y ago

Also both LINQ language syntax and library methods are a builder paradigm for the expression tree. Valid, but still far from ideal representation of an AST.

bayindirh3y ago

> Why the hack are we still talking to databases with SQL strings and not directly specifying the Query-AST?

keybored3y ago

You’re missing the point if you think that that is “in the same spirit”.

SQL-as-strings and SQL-as-AST are still the same thing. What is being proposed it not to write procedural code for record retrieval instead of declarative SQL.

Timpy3y ago

manigandham3y ago

A SQL query is an AST, but represented in a compact portable form. It also supports functions, procedures, and parameterization for flexible and safe query construction.

Your code would get incredibly large and complicated if you had to specify any serious SQL query as a raw AST.

euroderf3y ago

Check out prql[1], it might be a conceptual model you'd like.

[1] https://prql-lang.org/

oaiey3y ago

That is just modern SQL with horrible LINQ memories ;)

leprechaun10663y ago

Most queries to kdb+ do exactly what you are asking for.

AndrewDucker3y ago

Can you give an example of what you mean, and what we'd gain from it?

dkjaudyeqooe3y ago

3 more replies

rprospero3y ago

1 more reply

danielheath3y ago

Having application build a string and pass it to a library which parses the string into an AST cannot be as efficient as just building the AST, right?

1 more reply

pif3y ago

Type safety at compile time to begin with.

emfax3y ago

I’ve been trying to do exactly this. It must be possible.

liveoneggs3y ago

PREPARE?

3233y ago

Didn't Microsoft LINQ try something like that? Was not particularly successful.

manigandham3y ago

LINQ stands for Language-Integrated Query and is incredibly successful at its purpose of providing powerful querying functionality and extensions baked into the C#/.NET language space itself.

This querying framework is what powers translations and compilation into SQL and several other languages (depending on the datastore provider used).

EntityFramework is one of the most advanced ORMs out there and is supremely productive because of LINQ.

Semaphor3y ago

LINQ is stunningly successful, LINQ Query syntax (what you probably are referring to) with LINQ to SQL less so.

And yes, as usual, we have the amazing confusion of Microsoft Naming.

But query syntax is essentially just a way to use an ORM that looks closer to SQL but is strongly typed.

oaiey3y ago

LINQ creates an AST in the .NET land, however, before passing it to an actual SQL database, it serializes the expression tree to SQL.

So no, it does not talk AST to the database.

Ducki3y ago

No, it never tried that.

afavour3y ago

Why not?

izacus3y ago

Same reason you don't read files by concating string commands into another different language and then posting them to the OS.

naikrovek3y ago

my ipad won’t let me search through the PDF, but i couldn’t find where “SSB” was defined, if anywhere. i did not see it defined in the first paragraph, which is where it is first used.

everyone: not all of your readers are domain experts. omissions like this are infuriating.

ryanworl3y ago

Star Schema Benchmark https://www.cs.umb.edu/~poneil/StarSchemaB.PDF

glhaynes3y ago

Just fyi: if you’re viewing the PDF in Safari on your iPad, you can search by typing into Safari’s Location Bar and then choosing “Find ‘xyz’” from the popup that appears.

jrochkind13y ago

came here to ask this. I wondered if it was a typo for SSD!

Thaxll3y ago

"While it continues to be the most widely used database engine in the world"

It realy depends what do you mean by that, yes it's shipping in every phones and browser, but I don't consider that as a database. Is the windows registry a database?

Oracle, MySQL, PG, MSSQL are the most widly used DB in the world, the web runs on those not SQLite.

adamrezich3y ago

there are far, far more sqlite instances than Windows Registry instances in the world.

https://www.sqlite.org/mostdeployed.html

stonemetal123y ago

>SQLite is primarily designed for fast online transaction processing (OLTP), employing row-oriented execution and a B-tree storage format.

I found that claim to be fairly surprising, SQLite is pretty bad when it comes to transactions per second. SQLite even owns up to it in the FAQ:

>it will only do a few dozen transactions per second.

tiffanyh3y ago

> SQLite is pretty bad when it comes to transactions per second. SQLite even owns up to it in the FAQ: "it will only do a few dozen transactions per second."

That is an extremely poor quote taken way out of context.

The full quote is:

https://www.sqlite.org/faq.html#q19

hu33y ago

hnfong3y ago

Given the prevalence of SSDs these days the figure might be out of date as well.

kpgaffney3y ago

stonemetal123y ago

jessermeyer3y ago

Please quote the entire statement. And stop the needless "even owns up to it" FUD.

j / k navigate · click thread line to collapse