I've seen it so many times over the years regarding users and email configurations. I can't count the number of times I've dropped into someone's badly behaving mail configuration and found they had MySQL hosting the users, and explained it was for "performance" reasons. Somehow they didn't grasp that /etc/passwd fits entirely in memory, and the map files Postfix uses for various lookups and stuff are already a database (just one specifically designed for the task at hand) and also fit entirely in memory. Putting that MySQL layer in there is disastrous if performance matters; it is orders of magnitude slower for any case I've seen...still plenty fast for most cases, but it's ridiculous that this idea gets cargo-culted around that if you store your mail users in MySQL your mail server will be faster.
A little knowledge is a dangerous thing, is what I'm trying to say, and people who know MySQL is "fast" may not know enough to know that it's not the right tool for the job in a lot of cases...and is probably slower for many use cases. I'm pretty confident this is one of those cases. SQLite is wicked fast on small data sets, and being smaller means more of it will fit in memory; I can't think of any way MySQL could be a more performant choice for this workload.
Also, I don't even want to try to imagine shipping an installable desktop application for non-technical users that relies on MySQL!
It's really not that hard to imagine. There is an embedded library version of MySQL called libmysqld which is made for this very purpose. Of course as you point out it's quite overkill for something that would work perfectly well with Berkeley DBs.
Probably because it runs in the same address space as each server process and there is no message passing (TCP/unix socket) overhead.
That is a good opportunity to ask questions and learn about the person, their knowledge and their use case.
Yeah sometimes the answer is they don't know how stuff works underneath and are just following a blog post or a trend or heard someone talk about it at meetup.
Programmers love programming tools / frameworks / languages. There are so many new shiny things out there. New ones pop up every month. New language, new build system, new db, new UI library, new message queue etc. Interstingly over the years, lots of those have had great performance improvements. Maybe they took advantage of new kernel API or some new algorithm and so on. That feeds the hype -- put new shiny things in and performance doubles.
So people have been trained to expect all the new things to just magically be better. But like said, you have to probe and find out the reason they propose the change, only then you can only see if it is justified or not (one of the resonable responses could be "I've actually measured with a production load and it does look better"),
Say you have an MVC web app, could you have asynchronous calls to a thread performing the actual DB writes? From your app's perspective, concurrent writes would be placed into a queue that performs them serially.
Are there any functional issues to this approach? Thanks in advance for your thoughts or info.
Anyway, I'm not saying "never use MySQL for mail users" (though, I think the percentage of deployments where it makes sense is closer to none than it is to one), I'm just trying to make the point that MySQL is, in some folks minds, a magical solution to performance problems. Often, it not only introduces needless complexity, it won't even improve performance. It's a classic example of "when all you have is a hammer, everything starts to look like a nail". MySQL is a very fine hammer. It just isn't the right tool for every job.
The article we're talking about is another case where a little knowledge is a dangerous thing. A desktop app serving one user with a tiny data set (as I understand it, we're talking about the metadata for a person's music collection) is exactly the right workload for SQLite. I'd be shocked if a naive port to MySQL were faster (though they acknowledge that there's room for query optimization), and not at all surprised if it were slower. And, I know it'll require more memory and disk space for the same working set.
Shouldn't user groups give a bit more accountability there?
Amarok music player on KDE linux desktop environment (at least on version 3.5) kept the music collection metadata in database. You could choose sqlite, mysql or postgress during installation if I remember correctly. It worked ok, because of the package system.
I guess if you used separately installed databases there would be some conflicts, though.
He (they?) picked SQLite for all the correct reasons:
- best tool for the job for their situation;
- write-light and read-heavy;
- zero configuration;
- easy to embed;
- understanding that optimizing queries by far gives the best performance in the shortest amount of time.
As an aside, I'm currently using SQLite for Bacula and Postfix, and it's a joy to use; the only drawback I found so far is lack of REGEXP REPLACE in the SQL dialect which the database supports (must be loaded with .load /path/to/lib/libpcre.so, but it is not part of the language). I used the Oracle RDBMS for my PowerDNS deployments, but in retrospect, the way PowerDNS works, SQLite would have been an even better match. All in all, it is great to read that someone picked it for all the correct reasons, rather than some fashion trend, as is often the case in computer industry.
In my bitcoin implementation, as an experiment, I tried storing the blockchain in sqlite, postgres, and leveldb. I gathered up a bunch of data from the first ~200k blocks of the blockchain and benchmarked all three databases. I queried for something like 30,000 utxos out of a set of a couple million. What took 300-400ms in leveldb took 1.6 seconds in postgres (on the repl. in my actual node it would have taken longer due to deserialization of the utxos). What took 1.6 seconds in postgres took over 30 seconds in SQlite.
Now, you can tell me I did the benchmarks wrong, and "oh, if you just did this it would be faster!", but 30+ seconds is slower to an absolutely insane level. Needless to say, I went the key-value store route, but I was still astounded at how slow sqlite was once it got a few million records in the database.
I actually like sqlite, but when you know you're going to be dealing with 70gb of data and over 10 million records, preemptive optimization is the key. If I were the author, I would consider switching to postgres if there are over 500k-1m records to be expected. That being said, if they're partial to sqlite, SQLightning (https://github.com/LMDB/sqlightning) looks pretty interesting (SQLite with an LMDB backend).
edit: To clarify, these weren't particularly scientific benchmarks. This was me timing a very specific query to get an idea of the level of data management I was up against. Don't take my word for it.
It was a benchmark I ran just to personally give me a general idea of what I was up against in terms of data management. This is primarily why I didn't post them in the first place. This was several months ago. I'll try to find the code that ran them and put them up on github if I can, but I doubt it will be that useful without the actual data, which isn't easy to upload.
So yes, to clarify, don't take my word for it. This is just my experience.
fsync()ing a couple hundred thousand individual INSERTs isn't fast.
I'm sorry to say but you're almost certainly doing something wrong then.
If I remember right: Querying for leveldb was iteration from [20-byte-hash][lo-utxo-id] to [20-byte-hash][hi-utxo-id] and subsequent lookups on the keys containing the actual data. The sql table setup was `id` as a char primary key and `address` as a char with an index. The query was a simple `select data from utxos where address = ?`. The actual `data` blob was on average maybe 60 bytes in size.
Maybe there was something I could have done better there. I'm not sure. This was just my experience. This is all beside the point anyway. My point was whatever database was better, it was worth testing each one, and I don't consider it to be premature optimization.
Can you share what the table schema is? What the queries are? And what indices built for the Postgres and Sqlite tables?
https://engineering.microsoft.com/2015/10/29/sqlite-in-windo...
Between that, and packages readily available on most Linux and BSD distros out there (and, in most cases, installed by default), it's well on its way to become a de facto standard system API for relational storage.
The nice thing about SQLite is how little it assumes about the world outside. That made it easy to run in WinRT application sandbox with minimal changes; and for quite a while, it was the only local DB readily available to WinRT code written in C# or C++ (JS got IndexedDB).
So I am speaking about embeddable DBs here.
Konstantin Knizhnik have implemented impressive set of various embedded DBs: http://garret.ru/databases.html
Like his POST++ has direct mapping to C++ classes so if you use C++ then you don't need any ORM.
In my Sciter[1] Engine I am using his DyBase library [3] as a bult-in persistence for Sciter's script [2] (JavaScript++).
With the DyBase in script you have features similar to MongoDB (noSQL free-form DB) but without any need for ORM and DAL - you can declare some root object as be persistable and access those data trees as if they are JavaScript objects. The engine pumps objects from DB into memory when they are needed:
var storage = Storage.open(...);
var dataRoot = storage.root; // all things inside are persitable
dataRoot.topics = []; // flat persistable list
dataRoot.topics.push({ foo:1, bar:2 }); // storing object
/* create indexed collection with string keys, keys can be unique or not */
dataRoot.titles = storage.createIndex(#string);
DyBase has Python bindings too.[1] http://sciter.com - multiplatform HTML/CSS UI Engine for Desktop and Mobile Application
[2] TIScript - http://www.codeproject.com/Articles/33662/TIScript-Language-...
[3] DyBase - http://www.garret.ru/dybase.html
It took until the very last paragraph for the blog post to make that point.
I'm actually surprised WordPress hasn't ever moved over to it for ease of installation/deployment - WordPress and PHP seem more likely to trip over in most deployments I've seen before SQLite would.
For the same reasons Wordpress hasn't moved to Postgres or doesn't include feature X,Y or Z. Wordpress has to remain relatively stable. A change in the database means breaking Wordpress ecosystem, as many plugins add tables to the DB. There is very little abstraction when it comes to Wordpress API and its interaction with the DB. Wordpress doesn't ship with an ORM.
What problems should I be expecting in a multi-access scenario? I've never had that question answered adequately.
Wouldn't a "full" RDBMS like Mysql/Postgres offer a ton of benefits over SQLite (like the features to handle edge cases as they arise) to the point where, even if SQLite would work, SQLite still wouldn't be the ideal choice?
Does wordpress completely abstract the database or do third-party plugins use their own interfaces to the database, in which case a migration to SQLite would break a lot of them?
Providing there aren't too many updates at once, such that locking becomes a problem, it works really really well.
[1] - http://dhcp.io/
It's not just "internal" tests, like unit tests or things that can be run from client C programs, but tests of all kinds of hard-to-simulate external situations like out-of-memory and power failure situations.
I also noted that the assert section essentially does Design-by-Contract. This is a subset of formal specification that's prime value is in preventing interface errors (vast majority in big apps) and supporting formal verification. Done in design/spec phase in high-assurance since both Edsger Dijkstra and Margaret Hamilton independently discovered technique's value. Done at language-level since Eiffel due to Bertrand Meyer. Good to see that, even if not all techniques, they're doing the 80/20 rule to get most benefit out of what formal specs they're using. Also allow you to easily enable run-time checks if you can accept performance hit. Nice.
> we have almost no indices, no principled denormalization
Sounds like an easy win. People are probably suggesting a database switch because they're finding issues with the current speed, but they're not using their current database to its full potential yet.
> The main case when beets writes to its database is on import, and import performance is dominated by I/O to the network and to music files.
Small file I/O is universally slow, even on SSDs, and if you're hitting rate-limited MusicBrainz servers for each file, database performance is almost irrelevant.
IMO instead writing how SQLite is the best choice for the project I think it would be better to add support for multiple backends. Something that is good for you and majority of users does not mean it's good for everyone.
Sure for larger or concurrent user access to a db, we use other databases, but nothing beats the 'zero footprint' install of SQLite. I even enjoyed the fact that earlier versions of SQLite forced me to think optimise my queries due to the lack of nested SELECTs.
SQLite still kind of reminds me of my early days in MySQL. I was recently trying to downgrade MySQL 5.7 to 5.6 after a slew of issues, which forced me to reminisce about how simple things used to be when MySQL was a lot like SQLite still is now...
I'm glad to see this post; one of the reasons that I like beet so much is that everything is self-contained.
Another nice advantage of it is if you are distributing something that requires a small dataset[0][1]. If I give you both the code and the data already imported into a sqlite database, then you can use the code right away, or you can dump the data to a different database very easily.
Desktop apps are like the sweet spot for sqlite. It's practically made for them.
Of course I can add locking or something in my code, but I'd prefer to handle at a lower level — for example, have SQLite take the latest write without corrupting. I'm hoping someone has solved this problem with SQLite elegantly.
Relevant: https://www.sqlite.org/howtocorrupt.html
You should enable WAL (pragma journal_mode=wal) I believe you need to do this on every connection.
Now, whenever you are accessing the data do it within transactions (https://www.sqlite.org/lang_transaction.html) generally SQLite will acquire proper locks when needed (it will minimize amount of locked time by deferring locks until they are needed, but you can use immediate or exclusive modifiers. If you for example put exclusive modifier only one program will be able to read/write at the time, you generally will want to use defer (default) behavior since it still provides the safety but multiple applications still can access the database at the same time)
Also note that locking might not work if you keep the database on network file system.
As the developers behind the project, I'd have to think the authors are in the best position to make the determination about which tool is appropriate.
Still we should recall the utility gained in using the right tool is finite and variable. In some cases can be like night and day while in other cases a moderate or negligble improvement might be all you see. Under some circumstances familiarity with a given tool may outweigh the advantages of using a better suited tool - especially if a team lacks well-rounded expertise.
Reinforcing the other side of the argument is the fact that programmers are often subject to constraints placed on them from on high that restrict the choices they might make in such.
Given the specific situation, the author seems to build a pretty good case for SQLite at least by my own limited understanding of the facts.
What the browser vendors agreed on is that they would rather have IndexedDB. I've never used it, but they say it is a lower-level API than SQL, and using it you could build your own SQL abstraction layer above it.
So we have as standards the key-value-based localStorage and then, halfway between that and SQL, is IndexedDB.
More: https://hacks.mozilla.org/2010/06/beyond-html5-database-apis...
The author doesn't say a word about synchronization when writing to SQLite.
You don't even need to read the comments to know what people will say:
"SQLite is a great fit for this type of application. It's a replacement for fopen people. fopen."
"What about ALTER TABLE?"
"It's just a toy database, it doesn't even support concurrent writers"
----- "WAL mode"
"Hey, golang, rqlite"
----- "Whoa I wrote something similar for a ..."
----- "Why would you use this? Just use postgres"
"SQLite is the best database ever"
"SQLite is the worst database ever"
If you want tangible info you can actually use, read sqlites documentation. There's a wealth of information there.
Here are some of posts, for the Python crowd:
http://charlesleifer.com/blog/five-reasons-you-should-use-sq...
http://charlesleifer.com/blog/using-the-sqlite-json1-and-fts...
http://charlesleifer.com/blog/my-list-of-python-and-sqlite-r...