I've been playing around with full text search in Postgres, and I took this sort of approach when starting out, but then realized you I could just have the index be an expression.
So instead of (per the example)
CREATE INDEX tsv_idx ON documents USING gin(tsv);
doing something like CREATE INDEX tsv_idx ON documents USING gin(to_tsvector('english', text));
Is there any reason you wouldn't do this? For multi-languange you you'd have to detect the language of the text, but there's no reason you couldn't parameterize that too.I have a table with 30 million documents using pgsql's full text index. Creating the index takes ages, and search performance is generally very poor. The difference between creating the index with the precomputed column versus creating the index with the expression in the index itself (which is how I originally did it) was substantial.
Of course you can always just index your preprocessing/aggregating function and call it every time you want to search, but depending on how expensive that is, it might be in your interest to do it upfront and make your searches a bit quicker.
Using an index means the read queries become a little more complex (they have to exactly match the index expression).
The way you have written the query, it has to extract the title from all of the JSON documents before applying the LIMIT.
Perhaps do the work to get the identifiers, order by rank, and apply the limit... all in a subquery. So that only the outer query actually touches the JSON document and extracts the titles.
Try it and report back on the speed improvement.
SELECT id, meta->>'title' as title, meta FROM(
SELECT id, meta FROM (
SELECT id, meta, tsv
FROM data_rows, plainto_tsquery('search query') AS q
WHERE (tsv @@ q)
) AS t1 ORDER BY ts_rank_cd(t1.tsv, plainto_tsquery('search query')) DESC LIMIT 5
) AS t2;
Please let me know if this isn't what you meant. The above did improve the speed but only a little. I think for our use case this approach wouldn't be entirely necessary but maybe if we had a larger data set then it'd be worth implementing. SELECT d.id
,d.meta->>'title' as title
,d.meta
FROM (
-- Only select the identifier column
SELECT id
FROM data_rows
,plainto_tsquery('search query') AS q
WHERE tsv @@ q
ORDER BY ts_rank_cd(tsv, q) DESC
LIMIT 5
) AS ids
-- Join back onto the data_rows, this fetches the documents
INNER JOIN data_rows d ON d.id = ids.id;
You'll want to go through the explain plan, but treat identifier retrieval as a step by itself. Things like ts_headline (not shown), or meta->>'title' (shown) should only be performed for the few rows you actually are retrieving and you should make the data you pass around be as small as possible at all points.Note that you don't need to call plainto_tsquery twice, you can use the one you've already declared.
And I notice you are returning the JSON document whole, have you checked whether it's faster to extract the title here or later in whatever your programming language is?
The giveaway clue in your post btw is this: "We experimented with returning the document text with results and found that if we returned the full text of documents, it added about 350ms to the query time which seems to be network overhead rather than a slower query.".
As soon as you notice a linear slowdown, when you're only returning 5 rows whatever happens... then internally you must be doing something in the query that relates to all rows.
PostgreSQL fulltext scales nicely, I use it on a platform driving over 300 web sites with 10s of millions of comments. I originally used it thinking it will buy me time before I have to add a dedicated search engine, and it helps now as I store permissions knowledge in the database too... but truth is, it's just not been an issue yet. I'm not needing to consider anything else as it's performing great. The slowest part of my query isn't fulltext search, it's my permissions stuff (because I've gone down the rabbit hole of users, groups, roles, etc and this is calculated for all content), but still the 95th percentile for searches is typically under 10ms.
I've never tried with PostgreSql, but I noticed the same drop in performance in MySql at the 1-2 million documents count, although full text search queries in MySql are slower (from what I remember, I haven't tested in a couple of years).
The power of Solr and other tools out there is all the other stuff you can do, like synonyms processing, faceting, range queries, etc, etc... as well as give you more flexibility on what data you load and how you process it. But if you don't need it, there's no reason to use a different tool.
It's certainly more than just full text search. Check out this article [1] that talks about Postgres' support for stemming, ranking / boost, multiple languages, fuzzy search, and accent support.
[1] http://blog.lostpropertyhq.com/postgres-full-text-search-is-...
Still, I would describe most of those features as 'basic', as you really need some support for all that to use full-text search in real life and I wonder how easy it is to add/remove stopwords or other tweaks like that.
Anyway, thank you very much for the link, very interesting! I've been looking at PostGIS as an option for a project I'm working on and using PostgreSql as the main db is getting more and more points... Thanks!
Isn't this a concern as the main objective of search is to provide accurate results?
This should be a concern to the author, but there's no reason to think the search ranking is not working as documented.
[0] http://www.postgresql.org/docs/9.4/static/textsearch-control...
If it provides 99% of the results and misses some documents because of some weird bug or encoding issue, then it could very well be good enough for their purposes. Heck, even 90% could be good depending on what they do (e.g. serve articles in an online site).
For other uses, like police or medical records obviously they'd need 100% results.
Mine covers tags and other aspects of faceted search as well.
We were able to build a really simple search facility for our marketplace using Postgres in a very short period of time and for cheap. It was incredibly helpful for a small startup short on people and resources.
Solr/Lucene is the right answer for a full blown solution but it incurs the cost of having to maintain a second service besides your webapp as well as learning and gaining expertise in a second system. All things that can and should be done assuming you see growth, but as quick solution that would work for several months, you can knock out a reasonable search in a couple weeks with tech your app is already using.
Also, lack of faceted search, phrase search, partial word search (although you could use pg_trgm)
Previous 2 related discussions: - https://news.ycombinator.com/item?id=8381748 - https://news.ycombinator.com/item?id=8714477
Googling the error lead me to a mailing list where someone was "working on it" or "going to revert the problem soon".
Ended up using AWS CloudSearch which costs a minimum of $50/mo where I was already paying for Postgres.
Would be useful to know how many rows you had in the table (?)
I wonder if there are any specific limitations to search performance in PostgreSQL stemming from its underlying data model and structures or could it possibly reach similar performance and scale as say Solr (or Luecene for that matter) without too much of an overhaul. Just thinking single node here.
While Postgres can do full text search and it may help you in your case, you can set up Elasticsearch for this sort of quantity of data in like an afternoon, if that.
Also, these 'advanced features' people are throwing around are really trivial; they're the sort of things that would crop up in lecture 2 of an information retrieval course. Good search is really reliant on proper weighting etc (not like 'we weight the title more highly than the body' but more 'foogle is a far more important word than bar, if we have both in a query, we care about foogle more'). This generally requires a tonne of experimentation; information retrieval is not easy and it's very subtle.
Postgres seems to try and make everyone design their own ranking function, which is a recipe for disaster. I haven't used Elasticsearch for a while, but I'd be surprised if they didn't have a bunch of relatively good presets.
Postgres is good for loads of things, but Elasticsearch or Solr are state-of-the-art and Postgres isn't (so far as I know); if you aren't just hacking around then please consider the extra time to set up Elasticsearch and it'll make your life easier.
Seriously though, none of my servers have a JVM installed. Most of them have nginx and ruby on them. And all of them talk to a handful of postgres servers. If a client wants to add some kind of full-text search to an app, this is a great approach to doing it in a cost-effective way. I don't particularly want to have to add more infrastructure just to be able to search through data. That's an additional monthly cost, along with a new environment to monitor for critical security updates etc.
If someone needs high performance search, sure, I'd cost out elasticsearch or solr. If they just need something basic, postgres full-text looks like the perfect low-effort solution.