Pg_bm25: Elastic-Quality Full Text Search Inside Postgres (opens in new tab)

(docs.paradedb.com)

206 pointsbillwashere2y ago71 comments

71 comments

I checked the benchmarks and was surprised to see that native search is (a) so slow (seconds), and (b) demonstrating O(N) behavior – with indexing, it should not happen at all.

Indeed, looking at the benchmark source code (thanks for providing it!), it completely lacks index for the native case, leading to a false statement the that native full-text search indexes Postgres provides (usually GIN indexes on tsvector columns) are slow.

https://github.com/paradedb/paradedb/blob/bb4f2890942b85be3e... – here the tsvector is being built. But this is not an index. You need CREATE INDEX ... USING gin(search_vector);

This mistake could be avoided if bencharks included query plans collected with EXPLAIN (ANALYZE, BUFFERS). It would quickly become clear that for the "native" case, we're dealing with SeqScan, not IndexScan.

GINs are very fast. They are designed to be very fast for search – but they have a problem with slower UPDATEs in some cases.

Another point, fuzzy search also exists, via pg_trgm. Of course, dealing with these things require understanding, tuning, and usually a "lego game" to be played – building products out of the existing (or new) "bricks" totally makes sense to me.

philippemnoel2y ago

One of the ParadeDB authors here, hey! Thanks for pointing this out, you're completely right. That's an oversight on our end. We'll update the benchmarks and re-run them to correct this :)

gvkhna2y ago

Great to hear, a benchmark against trigram searching with gin index would also be great. There are multiple ways to do full text search with postgres and they’re all insanely fast and memory efficient. Benchmarking various methods for comparison would be helpful.

https://www.crunchydata.com/blog/postgres-full-text-search-a...

1 more reply

some_developer2y ago

I learned the hard way that Gin updates are too slow, and in my case it was not even 100 updates per seconds on average, but could peak to 1000.

How does Pg_bm25 compare here with maintaining the index & performance?

1 more reply

arecurrence2y ago

This sort of thing is more common with postgres than you'd think. I interviewed a candidate once whose company completely replaced querying in their postgres with elasticsearch because they could not figure out how to speed up certain text search queries. Nothing they tried would use the index.

westurner2y ago

"Preferred Index Types for Text Search" https://www.postgresql.org/docs/current/textsearch-indexes.h... :

> There are two kinds of indexes that can be used to speed up full text searches: GIN and GiST. Note that indexes are not mandatory for full text searching, but in cases where a column is searched on a regular basis, an index is usually desirable.

gvkhna2y ago

I had same thought as soon as I read the article, with a gin index the benchmarks would be wildly different and not sure why they didn’t compare against that. Of course a non indexed search is going to be slow.

I was looking for comparison against a gin index specifically, without it pros/cons unclear.

dekimir2y ago

I still can't figure out how pg_trgm is supposed to work for multi-term searches and how to ensure the dictionary table it needs stays up-to-date. Is there a good writeup somewhere?

retakeming2y ago

Blog post author and one of the pg_bm25 contributors here. Super excited to see the interest in pg_bm25!

pg_bm25 is our first step in building an Elasticsearch alternative on Postgres. We built it as a result of working on hybrid search in Postgres and becoming frustrated with Postgres' sparse feature set when it comes to full text search.

To address a few of the discussion points, today pg_bm25 can be installed on self-hosted Postgres instances. Managed Postgres providers like RDS are pretty restrictive when it comes to the Postgres extension ecosystem, which is why we're currently working on a managed Postgres database called ParadeDB which comes with pg_bm25 preinstalled. It'll be available in private beta next week and there's a waitlist on our website (https://www.paradedb.com/).

ralusek2y ago

For what it's worth, the single biggest selling point to a better search, for me, would be not having to deal with additional infrastructure and all the hassle that comes with keeping data in sync. I would be very reluctant to move off of RDS/Aurora, and therefore have my principal motivation to use something like this is greatly negated.

I understand that it becomes very hard to monetize if you're not able to offer your own hosted service, and I don't have a solution for that, but not supporting RDS is going to really diminish the product for many people.

philippemnoel2y ago

Our goal is for one day ParadeDB to be a viable alternative to AWS RDS/Aurora, so that like you say, you don't need to keep data in-sync and can just use one system (ParadeDB). Soon it will be possible for you to have ParadeDB running on your AWS (utilizing your cloud credits+all security/privacy guarantees) but be managed via the ParadeDB dashboard, similar to how Aurora works from a developer UX.

Of course if you are 100% attached to AWS RDS itself (rather than the convenience of AWS RDS, which is replicable by ParadeDB), then there's not much we can do here, as we also need to eat :')

1 more reply

wdb2y ago

Yes, I have a similar feeling towards Cloud SQL for Postgres. Would be great if Azure/GCP would be supported in some manner

paulddraper2y ago

What are the features of RDS/Aurora that you need?

Also, it would be possible to set up a logical PG replica.

1 more reply

raybb2y ago

Could this also work as an alternative to Apache Solr? If so might be worth while to market it that way a bit.

I don't really know much about Solr but just started using it while helping with a project for openlibrary.org and it seems pretty alright but I'm still not totally sure I understand what makes it popular.

rdhatt2y ago

Solr and Elasticsearch are both Java servers built on top of the Java search library Lucene. There are plenty of articles on the internet describing how they differ. However since they share the same core, so they are very similar as well. For the context of this discussion, you can consider Solr & Elasticsearch as interchangeable - a potayto, potahto situation.

phamilton2y ago

With an AGPL license, does that make it unlikely to be included in hosted environments like RDS?

My understanding of the spirit of the license is that it should be fine as long as modifications are made available. Anyone know of any existing extensions in RDS that are AGPL?

allan_s2y ago

Related question, could it be possible that at some point postgresql natively implements that algorithm ? Or as there is already an extension doing it , regardless of the licence , it is unlikely that patches in that direction will be accepted ?

j452y ago

Running it for your own purposes as part of a solution that includes search should be fine under AGPL.

If your product is elastic search built into Postgres as a repackaged and direct competitor to this search plug-in, that’s where my understanding is over the line.

2 more replies

adobrawy2y ago

See who made pg_bm25 - vendor of database based on PostgreSQL. Most likely they would like offer that as hosted solution itself, so they attempt avoid Elasticsearch / Terraform-like drama using AGPL license from beginning.

klysm2y ago

I forget, does AWS let you use custom extensions from pgrx?

adobrawy2y ago

No, they allow use Rust for custom functions (alternatively to PL/SQL) only.

hardwaresofton2y ago

pgrx is one of the greatest enabling innovations in the PG ecosystem in a long time.

Awesome to see so many high quality extensions come out of it.

https://github.com/pgcentralfoundation/pgrx

philippemnoel2y ago

pgrx is awesome and making pg_bm25 would've been infinitely more challenging without it. Check them out if you want to make a Postgres extension, we can't recommend them enough

zombodb2y ago

Thank you. I’ll pass this on to the team.

wkoszek2y ago

Hey guys. Congratulations - this is an exciting development. Can you show some benchmarks around showing the count of matches -- `select count() from table where text match is there`?
This was the top reason that made us (Segmed.ai) give up on PostgreSQL FTS -- our folks require a very exact count of matches for medical conditions that are present in 20M reports. And doing COUNT(
) in PostgreSQL was crazy, crazy slow. If your extension could do simple len(invertedindex[word]) that would already be a great improvement.

ELK has it immediately, but at a cost of being one more thing to maintain, and the whole Logstash thing is clunky. I'd love to use FTS inside of PostgreSQL.

benpacker2y ago

I’m not sure if Postgres could support that type of operation directly via count() since I don’t know if the fact that no other filters are present is available to the Index Access Method API.

It might be possible to do a separate function though, like:

select pg_bm25_direct_count(‘term’)*

dekimir2y ago

If you do that, I can update postgres-searchbox [1] to use it for better frontend experience.

[1] https://www.npmjs.com/package/postgres-searchbox

wkoszek2y ago

That would be fine--basically any way of achieving it would be fine. As of now, in PostgreSQL's FTS, I don't think there's any way to do this fast enough to give it back to the user.

retakeming2y ago

Thanks!

We released support for metrics aggregations a few days ago, including count: https://docs.paradedb.com/aggregations/metrics#count.

We haven't gotten around to benchmarking aggregations - that's the focus for next week and we'll publish them once they're done. I would suspect that it's a lot faster than Postgres aggregates since it leverages Tantivy Columnar.

francoismassot2y ago

Nice! I would be very interested by your benchmark, don't hesitate to jump in the quickwit discord server to talk about the results. https://discord.quickwit.io/

machty2y ago

What kind of "consistency" do bm25 indexes offer? e.g. I think ElasticSearch is eventually consistent and is constantly indexing in the background and classic Postgres GIN indexes have configuration like `gin_pending_list_limit` and `fastupdate` functionality to avoid slowdowns on insertions (and then you get slowdowns when an insert hits the threshold and triggers the catch-up indexing).

philippemnoel2y ago

ParadeDB and pg_bm25 offer weak consistency. pg_bm25 doesn't slow down transactions for indexing, and like ElasticSearch it becomes become eventually consistent shortly after (typically at most a few seconds, altough your mileage may vary based on the amount of data modified in the transaction(s)).

mkleczek2y ago

This is really exciting and I hope to try it out at my company ASAP.

iamdanieljohns2y ago

Seems really really cool. Is this a full DB, as in they have to take PG source, put in tantivy and their sauce, compile, and distribute? Or is this an extension? If it's the latter, what's the point of putting DB at the end of the name?

iamdanieljohns2y ago

Ok, all caught up now. Great work and best of luck!

When it comes to the business model: it seems an acqui-hire by Supabase/Neon/etc would be the best bet. It insures the team's focus is on the core product instead of the litany of things to figure out when creating a pg hosting service (payments, downtime, upgrades, customer support, ...) in this highly competitive and demanding market.

ckok2y ago

Does this also cover some kind of facetted search? (Counting the different colored and sized t-shirt) in an efficient way? As that is also a large part that elastic can do but PostgreSQL isn't very good at.

antman2y ago

An important step, could be a good combination with pg_vector if they are fast enough

pritambaral2y ago

I believe the parent project — paradedb — already does that, for their support of HNSW indexes.

philippemnoel2y ago

That's right, we do support pgvector (it is pre-installed on ParadeDB) and support full HNSW. In fact, we even have another extension, called pg_search, which is the combination of searching on pgvector and pg_bm25 for better results! Topic of another blog post to come sometime soon :)

tristan9572y ago

Interesting that you guys are the same people behind Whist. I once interviewed there at your behest, and never heard back. It seems like that venture fizzled out?

rawsh2y ago

Is it possible to use this for hybrid search in combination with pg_embedding? My understanding is that hybrid search currently requires syncing with Postgres

philippemnoel2y ago

Yes! We have another extension, pg_search, which is specifically for hybrid search using pg_bm25+pgvector. You can find it here: https://github.com/paradedb/paradedb/tree/dev/pg_search

anon3738392y ago

This is very exciting. BM25 in Postgres will enable really nice search experiences to be built in projects where Elasticsearch is just too much complexity.

est2y ago

looks like a cool project https://github.com/paradedb/paradedb

aiunboxed2y ago

I wonder how do legacy search players like elastic / solr compete against the new age startups combining semantic and regular search ?

binarymax2y ago

Lots of reasons:

1) switching search engines is hard when you’ve built your information needs around one. I’ve led lots of search engine migrations and they’re not fun. I even gave a talk on the problems companies face when doing so. https://haystackconf.com/us2020/search-migration-circus/

2) lots of the new search startups don’t offer full feature coverage. So just because a company is the new hotness it doesn’t mean it can fill the need of someone entrenched in Solr/elastic

3) why risk going to a startup when they haven’t proven they’ll be around in 3 to 5 years?

4) incumbent search engines eventually catch up at the speed of the enterprise market. Why spend a year migrating when the engine your using will implement the feature for you within that timeframe?

ntonozzi2y ago

By adding the features that those new age startups launch: https://www.elastic.co/guide/en/elasticsearch/reference/curr...

Building a classic text search engine is way harder than building a KNN engine, and bolting a KNN engine into a term search engine is easier than the other way around.

vb-84482y ago

Reading "legacy" near "elastic" make me feel a little bit old :D :D

BTW, if you are one of the leaders of the market, you don't need to continuously improve, just wait and let your competitors do the research job and implement only when the feature is mature.

aiunboxed2y ago

:D :D

Sorry my question was on the basis of the quality of the results, simply put .. how does players who have good semantic search turn out against "legacy" players who had good text search

jillesvangurp2y ago

They are part of the hype. Lucene has vector search capabilities. Elasticsearch and Opensearch have support for that (slightly different implementations). I assume solr has similar capabilities. The combination of traditional search and vector search makes a lot of sense from a cost control point of view. Vector search at scale is expensive. The smaller the result set, the cheaper it is to do vector search over it. So using a cheap traditional search to limit the results before you run vector search makes a lot of sense.

Also, bm25 holds up well against vector search. A well tuned model can outperform it but many off the shelf models struggle to do that. Vector search is a useful tool but so far it's not a one size fits all solution that "just works". It's something that can work really well if you know what you are doing and with a lot of tuning. With things like Elasticsearch you can try both approaches.

philippemnoel2y ago

pg_bm25/ParadeDB author here. What we're doing is building an opinionated alternative within PostgreSQL. If you are not using Postgres, or want your system to be separate, Elastic is still the best choice and will likely remain so.

Other people have brought up great points for why or why not to switch. Our vision for this is that ParadeDB is not merely "better" than Elastic, but rather different. Elastic will never be a PostgreSQL database, and we'll never be a NoSQL search engine. If you want one or the other, you'll pick either ParadeDB or Elastic.

kriz92y ago

Who is the competition besides Algolia? Last I checked most of the competition is either very expensive or very feature limited compared to Elastic/Solr.

pg_bot2y ago

Meilisearch seems like it is the best open source option.

https://www.meilisearch.com/

aiunboxed2y ago

I think pretty much all the companies who provide vector search are indirect competitors

canadiantim2y ago

ParadeDB and the work they’re doing with this extension is incredibly exciting. Love to see it.

eclectic292y ago

Is BM25 still used by "modern" search engines? I wasn't aware.

mugivarra692y ago

is this better than lucene

benpacker2y ago

The underlying engine, Tantivy, has better performance characteristics than Lucene.

You can compare Lucene to Tantivy and can compare Elasticsearch to pg_bm25 or ParadeDB

fiedzia2y ago

It's faster, but misses tons of features, starting with geosearch. Hopefully they will come with wider use.

1 more reply

stopman2y ago

Excited to give this a try.

j / k navigate · click thread line to collapse

71 comments

samokhvalov2y ago

I checked the benchmarks and was surprised to see that native search is (a) so slow (seconds), and (b) demonstrating O(N) behavior – with indexing, it should not happen at all.

https://github.com/paradedb/paradedb/blob/bb4f2890942b85be3e... – here the tsvector is being built. But this is not an index. You need CREATE INDEX ... USING gin(search_vector);

GINs are very fast. They are designed to be very fast for search – but they have a problem with slower UPDATEs in some cases.

philippemnoel2y ago

One of the ParadeDB authors here, hey! Thanks for pointing this out, you're completely right. That's an oversight on our end. We'll update the benchmarks and re-run them to correct this :)

gvkhna2y ago

https://www.crunchydata.com/blog/postgres-full-text-search-a...

1 more reply

some_developer2y ago

I learned the hard way that Gin updates are too slow, and in my case it was not even 100 updates per seconds on average, but could peak to 1000.

How does Pg_bm25 compare here with maintaining the index & performance?

1 more reply

arecurrence2y ago

westurner2y ago

"Preferred Index Types for Text Search" https://www.postgresql.org/docs/current/textsearch-indexes.h... :

gvkhna2y ago

I was looking for comparison against a gin index specifically, without it pros/cons unclear.

dekimir2y ago

I still can't figure out how pg_trgm is supposed to work for multi-term searches and how to ensure the dictionary table it needs stays up-to-date. Is there a good writeup somewhere?

retakeming2y ago

Blog post author and one of the pg_bm25 contributors here. Super excited to see the interest in pg_bm25!

ralusek2y ago

philippemnoel2y ago

Of course if you are 100% attached to AWS RDS itself (rather than the convenience of AWS RDS, which is replicable by ParadeDB), then there's not much we can do here, as we also need to eat :')

1 more reply

wdb2y ago

Yes, I have a similar feeling towards Cloud SQL for Postgres. Would be great if Azure/GCP would be supported in some manner

paulddraper2y ago

What are the features of RDS/Aurora that you need?

Also, it would be possible to set up a logical PG replica.

1 more reply

raybb2y ago

Could this also work as an alternative to Apache Solr? If so might be worth while to market it that way a bit.

rdhatt2y ago

phamilton2y ago

With an AGPL license, does that make it unlikely to be included in hosted environments like RDS?

My understanding of the spirit of the license is that it should be fine as long as modifications are made available. Anyone know of any existing extensions in RDS that are AGPL?

allan_s2y ago

j452y ago

Running it for your own purposes as part of a solution that includes search should be fine under AGPL.

If your product is elastic search built into Postgres as a repackaged and direct competitor to this search plug-in, that’s where my understanding is over the line.

2 more replies

adobrawy2y ago

klysm2y ago

I forget, does AWS let you use custom extensions from pgrx?

adobrawy2y ago

No, they allow use Rust for custom functions (alternatively to PL/SQL) only.

hardwaresofton2y ago

pgrx is one of the greatest enabling innovations in the PG ecosystem in a long time.

Awesome to see so many high quality extensions come out of it.

https://github.com/pgcentralfoundation/pgrx

philippemnoel2y ago

pgrx is awesome and making pg_bm25 would've been infinitely more challenging without it. Check them out if you want to make a Postgres extension, we can't recommend them enough

zombodb2y ago

Thank you. I’ll pass this on to the team.

wkoszek2y ago

ELK has it immediately, but at a cost of being one more thing to maintain, and the whole Logstash thing is clunky. I'd love to use FTS inside of PostgreSQL.

benpacker2y ago

I’m not sure if Postgres could support that type of operation directly via count() since I don’t know if the fact that no other filters are present is available to the Index Access Method API.

It might be possible to do a separate function though, like:

select pg_bm25_direct_count(‘term’)*

dekimir2y ago

If you do that, I can update postgres-searchbox [1] to use it for better frontend experience.

[1] https://www.npmjs.com/package/postgres-searchbox

wkoszek2y ago

That would be fine--basically any way of achieving it would be fine. As of now, in PostgreSQL's FTS, I don't think there's any way to do this fast enough to give it back to the user.

retakeming2y ago

Thanks!

We released support for metrics aggregations a few days ago, including count: https://docs.paradedb.com/aggregations/metrics#count.

francoismassot2y ago

Nice! I would be very interested by your benchmark, don't hesitate to jump in the quickwit discord server to talk about the results. https://discord.quickwit.io/

machty2y ago

philippemnoel2y ago

mkleczek2y ago

This is really exciting and I hope to try it out at my company ASAP.

iamdanieljohns2y ago

Ok, all caught up now. Great work and best of luck!

ckok2y ago

antman2y ago

An important step, could be a good combination with pg_vector if they are fast enough

pritambaral2y ago

I believe the parent project — paradedb — already does that, for their support of HNSW indexes.

philippemnoel2y ago

tristan9572y ago

Interesting that you guys are the same people behind Whist. I once interviewed there at your behest, and never heard back. It seems like that venture fizzled out?

rawsh2y ago

Is it possible to use this for hybrid search in combination with pg_embedding? My understanding is that hybrid search currently requires syncing with Postgres

philippemnoel2y ago

Yes! We have another extension, pg_search, which is specifically for hybrid search using pg_bm25+pgvector. You can find it here: https://github.com/paradedb/paradedb/tree/dev/pg_search

anon3738392y ago

This is very exciting. BM25 in Postgres will enable really nice search experiences to be built in projects where Elasticsearch is just too much complexity.

est2y ago

looks like a cool project https://github.com/paradedb/paradedb

aiunboxed2y ago

I wonder how do legacy search players like elastic / solr compete against the new age startups combining semantic and regular search ?

binarymax2y ago

Lots of reasons:

2) lots of the new search startups don’t offer full feature coverage. So just because a company is the new hotness it doesn’t mean it can fill the need of someone entrenched in Solr/elastic

3) why risk going to a startup when they haven’t proven they’ll be around in 3 to 5 years?

4) incumbent search engines eventually catch up at the speed of the enterprise market. Why spend a year migrating when the engine your using will implement the feature for you within that timeframe?

ntonozzi2y ago

By adding the features that those new age startups launch: https://www.elastic.co/guide/en/elasticsearch/reference/curr...

Building a classic text search engine is way harder than building a KNN engine, and bolting a KNN engine into a term search engine is easier than the other way around.

vb-84482y ago

Reading "legacy" near "elastic" make me feel a little bit old :D :D

BTW, if you are one of the leaders of the market, you don't need to continuously improve, just wait and let your competitors do the research job and implement only when the feature is mature.

aiunboxed2y ago

:D :D

Sorry my question was on the basis of the quality of the results, simply put .. how does players who have good semantic search turn out against "legacy" players who had good text search

jillesvangurp2y ago

philippemnoel2y ago

kriz92y ago

Who is the competition besides Algolia? Last I checked most of the competition is either very expensive or very feature limited compared to Elastic/Solr.

pg_bot2y ago

Meilisearch seems like it is the best open source option.

https://www.meilisearch.com/

aiunboxed2y ago

I think pretty much all the companies who provide vector search are indirect competitors

canadiantim2y ago

ParadeDB and the work they’re doing with this extension is incredibly exciting. Love to see it.

eclectic292y ago

Is BM25 still used by "modern" search engines? I wasn't aware.

mugivarra692y ago

is this better than lucene

benpacker2y ago

The underlying engine, Tantivy, has better performance characteristics than Lucene.

You can compare Lucene to Tantivy and can compare Elasticsearch to pg_bm25 or ParadeDB

fiedzia2y ago

It's faster, but misses tons of features, starting with geosearch. Hopefully they will come with wider use.

1 more reply

stopman2y ago

Excited to give this a try.

j / k navigate · click thread line to collapse