Are there huge differences in performance, features or search quality? At which scale does using Postgres for full text search still make sense?
IMO, the only time to reach for an outside system is when the data isn't being written to PG first (like log ingestion with elastic search) or when search is such a central part of your app that it mandates a separate dedicated system.
As in... it works well enough, and the advantage of not having to add other tech makes it a no-brainer, I've had zero support issues or customer complaints and most of my applications use full text search heavily.
The big advantage over other approaches, because it's SQL and it's there in my database where I also store users and permissions knowledge... I can permission-limit my fulltext searches.
We tested recently to load our index to an Elasticsearch index for one particular use case (a weighted sum of the 20M rows based on a FTS critera) where postgres was underperforming in our opinion. On the same hardware, using all available RAM and CPUs, ES took 6s and PG took 0.7s.
So far, on the 30+ queries of our dashboard tool, we have yet to find a use case that Postgres didn't handle better than Lucene based solutions.
"ZomboDB is a Postgres extension that enables efficient full-text searching via the use of indexes backed by Elasticsearch. In order to achieve this, ZomboDB implements Postgres' Access Method API.
In practical terms, a ZomboDB index appears to Postgres as no different than a standard btree index. As such, standard SQL commands are fully supported, including SELECT, BEGIN, COMMIT, ABORT, INSERT, UPDATE, DELETE, COPY, and VACUUM."
Feel free to email the mailing list (zombodb@googlegroups.com). I'd be happy to help answer any questions you might have
The time came to replace our website search (tens of thousands of pages), and we decided to try rolling our own. Someone suggested ElasticSearch, and as I read through it, it seemed to do less than PostgreSQL. I still had the hard problems of (1) spidering the site and (2) converting all the file formats (.doc, .xls, .pdf, etc.).
I ended up just putting wget on a daily cron job to spider the site. Then I ran the saved files through a hodgepodge of scripts to extract the plain text and put it into PostgreSQL.
Once it's there, it's far easier to do the rest. Postgres has its own functions to search for matches, rank the matches, give you snippets, and even highlight the search words in the snippets. It's amazing.
Searches run in a split second. Well, at first, when I was testing, they often took a few seconds. But the weird thing is that after go-live it ran faster. My best guess is that so many users caused Postgres to cache more and more of itself into RAM. The whole server is still using less than 1 GB though, and it's running Apache and Postgres for the website and all its apps.
In my experience performance is great if you're just doing text search, but if you combine that with other operators in the same SELECT it can be much slower than Elasticsearch since in many of those cases Postgres needs to fall back to a full table scan.
The first issue is with relation to https://www.postgresql.org/docs/9.6/static/textsearch-parser...: The documentation says
> At present PostgreSQL provides just one built-in parser, which has been found to be useful for a wide range of applications
and it really means it - changing the behaviour of this component is not possible unless you write a completely different parser in C which, while possible is no fun experience.
We're using the full text feature over product data and we're having to work around the parser sometimes too eagerly detecting email addresses and URLs which messes with properly detecting brand names which might contain some of these special characters.
The other problem is the compound support. A lot of our data is in German which like other languages likes to concatenate nouns.
For example, you'd absolutely want to find the term "Weisswürste" for the query "wurst" (note the concatenation and the added umlaut for the plural in wurst).
Traditionally, you do this using a dictionary and while Postgres has support for ispell and hunspell dictionaries, only hunspell has acceptable compound support, which in turn isn't supported by Postgres.
So we've ended up using a hacked ispell dictionary where we have to mark all known compounds which is annoying and error-prone.
Also, once you have to use a dictionary, you end up with a further issue: Loading the dictionary takes time and due to the way how Postgres currently works, it has to happen per connection. In our case, with the 20MB hacked german ispell dictionary, this takes ~0.5s which is way too long.
The solution for this is to use a connection pooler in front of Postgres. This works fine but, of course, adds more overhead.
The other solution is http://pgxn.org/dist/shared_ispell/, but I've had multiple postmaster crashes due to corrupted shared memory (thank you, Postgres, for crashing instead of corrupting data) related to that extension, so I would not recommend this for production use.
Lucene and by extension ElasticSearch has much better built-in text analysis features so we could probably fix the parser and compound issue, but that would of course mean even more additional infrastructure, plus, probably some performance issues as we, unfortunately, absolutely cannot return all the FTS matches but instead have to check them for other reasons why they must not be shown which, of course, uses the database again and I'm wary of putting all that logic somehow into ES as well.
This is why we currently deal with the postgres tsearch limitations. But sooner or later, we'd probably want to bite the bullet and go dedicated solution.
Postgres is really reliable, and I think a lot of the performance difference comes from robust transactions. For some use cases you can use both and replicate data or query one + the other in sequence.
http://www.sai.msu.su/~megera/postgres/talks/pgopen-2016-rum...
This makes migrating off Heroku for Postgres a PITA and requiring down-time.
Everyone speaks about InnoDB and how performant and reliable it is... and multiple firms even use it as a KV-store (Uber/Pinterest/AWS) bypassing MySQL entirely. I have never heard much about storage engines in Postgres, why could this be so?
Wikipedia has a (stub) article on InnoDB, but nothing on Postgres' storage engines... just wondering why that is.
We (Pinterest, I wrote most of the MySQL automation) make heavy use of MySQL replication which is vastly simpler to manage than PG. All queries still flow through SQL and unlike PG, we can force whatever execution plan we need. We do lots of PK lookups, and InnoDB is really good at that. In InnoDB all the data is stored in the PK while in PG it is just a pointer.
This is just a consequence of the PK being a clustered index in InnoDB which has both pros and cons. One of the big cons is that all of the columns of the PK are implicitly added to every secondary index as the row identifier. That isn't a big problem if your PK is a single column int, but if it's multiple columns, that often results in unnecessary bloat in your secondary indexes. Ideally (as in, dare I say, MS SQL Server), you'd have the option of a clustered or non-clustered PK for your table so you could choose the optimal index structure for your workload on a per-table basis.
Being able to force the execution plan is more useful in MySQL than PostgreSQL because MySQL's optimizer is not very good at planning queries.
If you do a lot of PK lookups, then you don't need to force the execution plan.
https://www.postgresql.org/docs/9.5/static/postgres-fdw.html
For example, Citus Data provides a column store for Postgres via the fdw api.
Some earlier (2013) discussion on the same topic: https://wiki.postgresql.org/wiki/2013UnconfPluggableStorage
Here is as well some documentation on the matter: https://wiki.postgresql.org/wiki/HeapamRefactoring
Having "CREATE ACCESS METHOD [...] ON STORAGE|TABLE" to create a custom access method, or storage engine, and extending CREATE TABLE to be able to pass a storage method with the table definition could become a quite powerful combination. The main challenge is to come up with an interface solid enough to be able to handle problems related to MVCC, like VACUUM cleanup.
Because PG isn't designed around pluggable storage engines, so its not really as practical to take a storage engine out and use it separately, and doesn't make much sense to talk about the storage engine separately from the whole system.
Even then, there aren't a lot of published cases of people using these alternative access methods at scale yet. AFAIK, all of the large kv use-cases you've mentioned still go through traditional SQL queries. Despite the overhead of SQL parsing, it provides more control and visibility. The ecosystem around alternative access methods isn't nearly as mature.
What? Everyone speaks about how unreliable it is and how many major data corruption problems it has.
>I have never heard much about storage engines in Postgres, why could this be so?
Because they didn't take the approach of having multiple storage engines, they just made one that works and is not easily removed from the database.
InnoDB is, and always has been, a very reliable and durable storage engine with solid performance characteristics.
This one is huge for my company. Almost every single query of ours could use an index-only scan, but the planner would never choose to perform one because of the weirdness around partial indexes. We expecting a several x speedup once we upgrade to 9.6. All the need to improve now is a way to keep the visibility map up to date without relying on vacuums.
I don't think that's that unlikely to change. There's two major avenues: Write it during hot-pruning (which is done on page accesses), and perform a "lower impact" vacuum on insert-only tables more regularly
> but hopefully the changes in 9.6 will make it a non-issue on large tables.
You mean the freeze map? That doesn't really change the picture for regular vacuums, it changes how bad anti-wraparound vacuums are. The impact of the table vacuum itself is most of the time not that bad these days (due to the visibility map), what's annoying is usually the corresponding index scans. They have to scan the whole index, which is quite expensive.
Curious about this:
> parallelism can speed up big data queries by as much as 32 times faster
Why would it be only 32 times faster? The sky's the limit if there aren't major bottlenecks on the way.
For a sequential full table scan I could process about 2000MB/s of data(only 125MB/s was read from each SSD), I was limited by CPU power.
Anyway, same query took about 25 minutes on PostgreSQL 9.5 and now it was down to 2minutes and 30 seconds. For comparison, SQL Server 2012 spent 7 minutes on the same dataset on the same hardware.
I realize I'm asking a stranger on the internet to do something for free for me. If you don't have time or inclination to do this, no worries, but it seems like you've got a nice setup to be able to play with this. I'm sure I'm not the only one curious to see such a comparison.
(You can probably get more memory level parallelism with random access, but your overall bandwidth will likely be lower... fully exploiting memory bandwidth is complicated and difficult to do for real applications).
I'm running pg on ec2 with a hot standby slave. I need the postgis extension but am not doing anything particularly esoteric. Ideally I'd like to have the certainty of aws handling backups for me.
I was researching moving to RDS today and would love to hear thoughts on whether it's a good general solution or not. What happens about downtime during upgrades or swapping instance sizes?
This is one of my favorite features of RDS: You can set a maintenance window and have the option to not have changes take effect until that window. So if I want to upgrade Postgres or change the instance size, I set it up and the downtime happens when I'm fast asleep and nobody is using the site.
I also think (but not 100% sure) that if you have Multi-AZ enabled, changes are done by upgrading the slave, failing over, and then upgrading the ex-master, so downtime is limited to the failover period.
One major issues is that you are restricted to what you can do with it, not all options are available. You can only use extensions that they provide. (this I'm a bit fuzzy about) but changing disk size made service unavailable for ~30 minutes (proportional to new disk size). You weren't able to configure replication, the replication only happens to the backup node. You weren't even able to set up replication across regions.
The replication is kind of a bummer, because if you ever would like to move your data (perhaps to a vm or outside of aws) you would need to have an outage. Also if I remember there was no way to do major version update in place.
There was also another incident (it was caused by bug so hopefully it was fixed and won't happen to anyone else). We had cluster set up with a backup. One day out of nowhere the service stopped working and was unavailable for 1.5 hours. That was quite big issue because we used it for monitoring (zabbix), so any outage makes us blind to issues. Turned out that due to bug their backup routine made a mistake and started doing backup on the master server (normally it supposed to do on slave).
Sounds good!
I think more like the former -- as I recall, the recent articles have mostly been about specific work going on for the 9.6 release, prereleases of 9.6, and now the actual release of 9.6.
Well, due to the delayed 9.5 release (January 7th), there have been two this year ;)
http://blog.2ndquadrant.com/bdr-is-coming-to-postgresql-9-6/
TL;DR: It is not in mainline, but it does not need a patch anymore. You need to bring your own conflict resolution logic.
Index-only scans for partial indexes