Indeed, looking at the benchmark source code (thanks for providing it!), it completely lacks index for the native case, leading to a false statement the that native full-text search indexes Postgres provides (usually GIN indexes on tsvector columns) are slow.
https://github.com/paradedb/paradedb/blob/bb4f2890942b85be3e... – here the tsvector is being built. But this is not an index. You need CREATE INDEX ... USING gin(search_vector);
This mistake could be avoided if bencharks included query plans collected with EXPLAIN (ANALYZE, BUFFERS). It would quickly become clear that for the "native" case, we're dealing with SeqScan, not IndexScan.
GINs are very fast. They are designed to be very fast for search – but they have a problem with slower UPDATEs in some cases.
Another point, fuzzy search also exists, via pg_trgm. Of course, dealing with these things require understanding, tuning, and usually a "lego game" to be played – building products out of the existing (or new) "bricks" totally makes sense to me.
https://www.crunchydata.com/blog/postgres-full-text-search-a...
How does Pg_bm25 compare here with maintaining the index & performance?
> There are two kinds of indexes that can be used to speed up full text searches: GIN and GiST. Note that indexes are not mandatory for full text searching, but in cases where a column is searched on a regular basis, an index is usually desirable.
I was looking for comparison against a gin index specifically, without it pros/cons unclear.
pg_bm25 is our first step in building an Elasticsearch alternative on Postgres. We built it as a result of working on hybrid search in Postgres and becoming frustrated with Postgres' sparse feature set when it comes to full text search.
To address a few of the discussion points, today pg_bm25 can be installed on self-hosted Postgres instances. Managed Postgres providers like RDS are pretty restrictive when it comes to the Postgres extension ecosystem, which is why we're currently working on a managed Postgres database called ParadeDB which comes with pg_bm25 preinstalled. It'll be available in private beta next week and there's a waitlist on our website (https://www.paradedb.com/).
I understand that it becomes very hard to monetize if you're not able to offer your own hosted service, and I don't have a solution for that, but not supporting RDS is going to really diminish the product for many people.
Of course if you are 100% attached to AWS RDS itself (rather than the convenience of AWS RDS, which is replicable by ParadeDB), then there's not much we can do here, as we also need to eat :')
Also, it would be possible to set up a logical PG replica.
I don't really know much about Solr but just started using it while helping with a project for openlibrary.org and it seems pretty alright but I'm still not totally sure I understand what makes it popular.
My understanding of the spirit of the license is that it should be fine as long as modifications are made available. Anyone know of any existing extensions in RDS that are AGPL?
If your product is elastic search built into Postgres as a repackaged and direct competitor to this search plug-in, that’s where my understanding is over the line.
Awesome to see so many high quality extensions come out of it.
This was the top reason that made us (Segmed.ai) give up on PostgreSQL FTS -- our folks require a very exact count of matches for medical conditions that are present in 20M reports. And doing COUNT() in PostgreSQL was crazy, crazy slow. If your extension could do simple len(invertedindex[word]) that would already be a great improvement.
ELK has it immediately, but at a cost of being one more thing to maintain, and the whole Logstash thing is clunky. I'd love to use FTS inside of PostgreSQL.
It might be possible to do a separate function though, like:
select pg_bm25_direct_count(‘term’)*
We released support for metrics aggregations a few days ago, including count: https://docs.paradedb.com/aggregations/metrics#count.
We haven't gotten around to benchmarking aggregations - that's the focus for next week and we'll publish them once they're done. I would suspect that it's a lot faster than Postgres aggregates since it leverages Tantivy Columnar.
When it comes to the business model: it seems an acqui-hire by Supabase/Neon/etc would be the best bet. It insures the team's focus is on the core product instead of the litany of things to figure out when creating a pg hosting service (payments, downtime, upgrades, customer support, ...) in this highly competitive and demanding market.
1) switching search engines is hard when you’ve built your information needs around one. I’ve led lots of search engine migrations and they’re not fun. I even gave a talk on the problems companies face when doing so. https://haystackconf.com/us2020/search-migration-circus/
2) lots of the new search startups don’t offer full feature coverage. So just because a company is the new hotness it doesn’t mean it can fill the need of someone entrenched in Solr/elastic
3) why risk going to a startup when they haven’t proven they’ll be around in 3 to 5 years?
4) incumbent search engines eventually catch up at the speed of the enterprise market. Why spend a year migrating when the engine your using will implement the feature for you within that timeframe?
Building a classic text search engine is way harder than building a KNN engine, and bolting a KNN engine into a term search engine is easier than the other way around.
BTW, if you are one of the leaders of the market, you don't need to continuously improve, just wait and let your competitors do the research job and implement only when the feature is mature.
Sorry my question was on the basis of the quality of the results, simply put .. how does players who have good semantic search turn out against "legacy" players who had good text search
Also, bm25 holds up well against vector search. A well tuned model can outperform it but many off the shelf models struggle to do that. Vector search is a useful tool but so far it's not a one size fits all solution that "just works". It's something that can work really well if you know what you are doing and with a lot of tuning. With things like Elasticsearch you can try both approaches.
Other people have brought up great points for why or why not to switch. Our vision for this is that ParadeDB is not merely "better" than Elastic, but rather different. Elastic will never be a PostgreSQL database, and we'll never be a NoSQL search engine. If you want one or the other, you'll pick either ParadeDB or Elastic.
You can compare Lucene to Tantivy and can compare Elasticsearch to pg_bm25 or ParadeDB