In PostgreSQL, powerful Full Text Search is available out of the box (opens in new tab)

(admcpr.com)

96 pointssequence73y ago19 comments

19 comments

PostgreSQL FTS is mostly great - I wrote a tutorial on using it to build faceted search with Django a few years ago: https://simonwillison.net/2017/Oct/5/django-postgresql-facet...

It does have one surprising limitation: it calculates relevance based on just the current row, rather than being able to take statistics across the whole corpus into account.

Most search engines use TF/IDF or BM25 for relevance calculations, which consider the relative common-ness of terms in comparison to the rest of the corpus. PostgreSQL FTS can't do that as far as I know.

SQLite's built-in FTS CAN do relevance calculations like this! Surprising to see a feature as significant as that show up in SQLite but not in PostgreSQL.

imaurer3y ago

I hope someone implements BM25 and combines it with Pgvector to bring hybrid search to Postgres. I feel like that is the jsonb of the next couple of years.

justinclift3y ago

> Most search engines use TF/IDF or BM25 for relevance calculations ...

Ahhh. That's what's SQLite's "bm25()" function is for.

Was white listing the SQLite FTS functions a few days ago for DBHub.io (eg so people can use them), but don't really understand (yet) how most of them are actually used. Some day I'll get around to learning about them properly. :)

sequence7OP3y ago

I often see complicated search solutions implemented when PostgreSQL is just sitting there with its incredible FTS powers being ignored. To help some of the teams I'm working with get an idea of the power available to them I wrote up this two part article with a github repo for spinning up a db to follow along.

pmarreck3y ago

I use it; it is indeed awesome, although the “powerful” version of the query DSL can be intimidating (I implemented some complex-ish regex transformations to utilize it).

It’s fantastic having 1 less dependency though!

One caveat- if you migrate any field in the indexed table, you will likely have to drop and recreate all your triggers and stored procs again

po3y ago

I read these articles and am always tempted, but I've found they often don't do well with Chinese, Japanese, Korean or other non-latin languages. I understand that isn't the focus for most people but it's very hard to find good data/information on how to best support these languages.

hnfong3y ago

I find it dishonest to call it "full text search" whereas it's actually just "English/Indo-European full text search" that uses language-specific features to achieve its goals.

Instead of pretending to have solved the string searching problem by using "language hacks", I'd really like see an open source database that provides easy to use interfaces to suffix trees instead.

The even more infuriating thing is that apparently some databases actually do have suffix tree implementations, but because of assumptions that the data is English/European, other languages work half-assedly on it.

Imagine i18n implications for projects that are based on them. And the users would have no clue how f*cked up things are.

nr2x3y ago

With tsvector you have to declare the language, and if you are ingesting a diverse range of web documents you just end up applying English as a guess.

bomewish3y ago

There's a C++ library for tokenising Chinese for sqlite FTS: https://github.com/wangfenjin/simple

koromak3y ago

I've had problems with performance over large data sets, maybe 1M records and a few < 300 char fields per record.

I've never implemented full text search using a dedicated database so maybe 1M is too much to ask for.

nr2x3y ago

Yup, at a certain point you hit a wall.

nonethewiser3y ago

Great article and underlying point. Reminds me of https://www.amazingcto.com/postgres-for-everything/ which also highlights FTS by linking to this article: https://supabase.com/blog/postgres-full-text-search-vs-the-r...

leonardo22043y ago

The only problem I face is partial querying. For instance, if the user queries rus (and it does not match any lexemme) nothing returns. Any workarounds?

simonw3y ago

I've had good results for partial match searching using PostgreSQL trigram indexes: https://about.gitlab.com/blog/2016/03/18/fast-search-using-p...

nbashaw3y ago

The example here only covers short text fields like titles — does anyone know if it performant for full body search of long documents?

sequence7OP3y ago

Yes, it works well but there is a limit of 16k words. See this reddit comment thread [0] for more details, but if your documents are anywhere near that big then I'd say Postgres is definitely not the right tool for you.

[0] https://www.reddit.com/r/programming/comments/12yhhcg/commen...

puika3y ago

Yes, and if it falls short you can also use the RUM extension if you don't mind the extra index size (can install via package manager or maybe you db provider like supabase has it available)

simonw3y ago

Yes, it works great for that.

nr2x3y ago

Depends on the length of the documents as well. I’ve used psql with smaller datasets and it’s fine, but it starts to have issues when you go bigger.

j / k navigate · click thread line to collapse

19 comments

simonw3y ago

PostgreSQL FTS is mostly great - I wrote a tutorial on using it to build faceted search with Django a few years ago: https://simonwillison.net/2017/Oct/5/django-postgresql-facet...

It does have one surprising limitation: it calculates relevance based on just the current row, rather than being able to take statistics across the whole corpus into account.

SQLite's built-in FTS CAN do relevance calculations like this! Surprising to see a feature as significant as that show up in SQLite but not in PostgreSQL.

imaurer3y ago

I hope someone implements BM25 and combines it with Pgvector to bring hybrid search to Postgres. I feel like that is the jsonb of the next couple of years.

justinclift3y ago

> Most search engines use TF/IDF or BM25 for relevance calculations ...

Ahhh. That's what's SQLite's "bm25()" function is for.

sequence7OP3y ago

pmarreck3y ago

I use it; it is indeed awesome, although the “powerful” version of the query DSL can be intimidating (I implemented some complex-ish regex transformations to utilize it).

It’s fantastic having 1 less dependency though!

One caveat- if you migrate any field in the indexed table, you will likely have to drop and recreate all your triggers and stored procs again

po3y ago

hnfong3y ago

I find it dishonest to call it "full text search" whereas it's actually just "English/Indo-European full text search" that uses language-specific features to achieve its goals.

Instead of pretending to have solved the string searching problem by using "language hacks", I'd really like see an open source database that provides easy to use interfaces to suffix trees instead.

Imagine i18n implications for projects that are based on them. And the users would have no clue how f*cked up things are.

nr2x3y ago

With tsvector you have to declare the language, and if you are ingesting a diverse range of web documents you just end up applying English as a guess.

bomewish3y ago

There's a C++ library for tokenising Chinese for sqlite FTS: https://github.com/wangfenjin/simple

koromak3y ago

I've had problems with performance over large data sets, maybe 1M records and a few < 300 char fields per record.

I've never implemented full text search using a dedicated database so maybe 1M is too much to ask for.

nr2x3y ago

Yup, at a certain point you hit a wall.

nonethewiser3y ago

leonardo22043y ago

The only problem I face is partial querying. For instance, if the user queries rus (and it does not match any lexemme) nothing returns. Any workarounds?

simonw3y ago

I've had good results for partial match searching using PostgreSQL trigram indexes: https://about.gitlab.com/blog/2016/03/18/fast-search-using-p...

nbashaw3y ago

The example here only covers short text fields like titles — does anyone know if it performant for full body search of long documents?

sequence7OP3y ago

[0] https://www.reddit.com/r/programming/comments/12yhhcg/commen...

puika3y ago

Yes, and if it falls short you can also use the RUM extension if you don't mind the extra index size (can install via package manager or maybe you db provider like supabase has it available)

simonw3y ago

Yes, it works great for that.

nr2x3y ago

Depends on the length of the documents as well. I’ve used psql with smaller datasets and it’s fine, but it starts to have issues when you go bigger.

j / k navigate · click thread line to collapse