Show HN: Retake – Open-Source Hybrid Search for Postgres (opens in new tab)

(github.com)

88 pointsphilippemnoel2y ago23 comments

Hey HN! We're Phil and Ming, co-founders of Retake (https://github.com/getretake/retake). Retake is an open source tool that adds keyword and semantic (i.e hybrid) search to databases. We’ve started by extending the capabilities of Postgres with an SDK for lightning-fast queries.

We built Retake to fix two issues: keeping vectors in sync with Postgres in real time is difficult, and most vector databases aren’t built for hybrid search.

A quick refresher: “keyword search” refers to a technique where results are scored based on the appearance of exact words or terms. “Semantic search” uses vector embeddings to understand the meaning behind those words. Hybrid search combines these two approaches to enhance the precision and relevance of results.

To implement semantic or hybrid search today, most organizations run batch jobs that update their search engine or vector database using ETL tools or custom data pipelines. We’ve seen from firsthand experience how time-consuming and costly this can be, as moving vectors often requires re-embedding the entire data source.

We’ve also seen how many vector databases lack crucial features of “traditional” search: keyword-based (BM25) search, faceting/aggregations, highlighting, efficient filtering, etc.

Here’s how Retake works - our core is built on top of OpenSearch, which acts as a search engine and vector database. We leverage logical-replication-based Change Data Capture (CDC) to stay in sync with Postgres, so documents and vectors are updated incrementally and in real time. Finally, Python and Typescript SDKs make it easy to integrate Retake into your application. There’s no need to manage separate vector databases and search engines, upload and embed documents, or run expensive reindexing jobs. All you need to think about is writing search queries.

The easiest way to get started with Retake is by running our Docker Compose stack:

  git clone https://github.com/getretake/retake.git
  cd retake/docker && docker compose up

Retake is Apache licensed and our repo is here: https://github.com/getretake/retake. For next steps, see our quick start guide: https://docs.getretake.com/quickstart

We’d love your feedback on our solution to hybrid search. Our focus right now is on nailing the basics, but we’d also love to hear what you think we should focus on next.

23 comments

pk192382y ago

Does the sync handle deletes? In terms of we delete data from our Postgres database and it will delete from your database as well? Can see this integrating well with our pipeline since we're syncing data from postgres to our own vector database.

pnoel2y ago

Yes it does :)

jph2y ago

Clever idea, good work!

You asked for feedback: I see opportunities for you to nail the basics, by focusing on the value proposition so business-oriented people understand why/how to buy, and on the the middle-tier architecture so technical people understand that you're akin to OpenSearch with Faiss & vectors that auto-update.

My understanding (and please clarify as you wish) of what I've read on your site is this: you're selling the hosted version for enterprises at a price to be discussed with your sales team, and the architecture is something like this...

  ┌──────────────┐    ┌───────────────┐    ┌──────────────┐
  │Search SDK    │    │Search Engine  │    │Data Source   │
  │• Typescript  │    │• OpenSearch   │    │• Postgres    │
  │• Python      │    │• Faiss, KNN   │    │• MySQL (?)   │
  │• Java (soon) │◀──▶│• Keyword, BM25│◀──▶│• Oracle (?)  │
  │• Go (soon)   │    │• Auto-update  │    │• Mongo (?)   │
  │• Etc.        │    │• Etc.         │    │• Etc.        │
  └──────────────┘    └───────────────┘    └──────────────┘

pnoel2y ago

That's a nice diagram! Yeah that's roughly it. We'll be adding support for more sources of truth in the future to expand coverage, like the ones you mention but also NoSQL like MongoDB

isaacfung2y ago

So are you guys using faiss instead of the vector search of postgres?

I think vespa also supports hybrid search(it can also use late interaction model like colbert). How is retake compared to vespa?

Will retake supports sparse vector models like SPLADE(I heard they solve the vocab mismatch problems of keyword search).

How do you guys implement filtering?

1 more reply

noodlesUK2y ago

Unrelated to OP, but how did you create that diagram?

gregsadetsky2y ago

  ╔═════════════════════════════════════════════════════╗
  ║                                                     ║
  ║ https://en.wikipedia.org/wiki/Box-drawing_character ║
  ║                                                     ║
  ╚═════════════════════════════════════════════════════╝

:-)

there was a related thread a few days ago: https://news.ycombinator.com/item?id=37040883

you could use https://asciiflow.com/ (web) or https://monodraw.helftone.com/ (mac) to make such diagrams and paste them here.

use the "code" formatting ("Text after a blank line that is indented by two or more spaces") -- see https://news.ycombinator.com/formatdoc

Palmik2y ago

I would like something that can keep postgres (or other source of truth) in sync with existing search database (like Elastic, Meili, or Qdrant).

But the catch is that it's rare that there's 1:1 mapping between the source of truth and what is indexed. The simplest example would be: You have a document table, but you actually index document chunks.

Therefore I would like something that accepts a preprocessing function and keeps the search data in sync when the source changes. Ideally, it should not reinvent full-text / vector based search and plug in with existing solutions.

mdaniel2y ago

https://github.com/getretake/retake/pull/198 is a refreshing change given the recent rug pulls, so thank you for that

retakeming2y ago

Thanks! We debated what the right decision was in the beginning but are glad to have settled on Apache.

seemaze2y ago

From the product landing page: "By connecting to your sources of truth, Retake unlocks real-time keyword and semantic search over siloed data"

I misread 'siloed data' as 'soiled data' and was like, this product gets me!

pnoel2y ago

Must be some dirty ETL pipelines :')

dev1l2y ago

Awesome project! It would be interesting to see performance tests. I know that a scientific experiment is very difficult, so approximate numbers are enough. Anyway, thanks for your work :)

benjaminsanborn2y ago

Thanks for sharing; this project looks very promising!

What precipitated your fork of pgsync and how do you foresee maintaining compatibility with that project?

retakeming2y ago

Thanks, appreciate it!

We forked pgsync for the silly reason that they hadn't published to PyPi in months, and some of their dependencies were out of date. We haven't made any modifications to pgsync so maintaining compatibility shouldn't be an issue, and we'll likely revert back to the main library once their dependencies are brought up to speed.

jaequery2y ago

Can’t you just use OpenSearch? What is the point of going through Postgres when you already have OpenSearch?

spleen77772y ago

How it handles JSONB fields? Do I need to define all keys in JSONB field to make them indexed?

retakeming2y ago

You don't need to define all keys in a JSON object - by default, new keys will automatically be added to the index mapping when a JSON document containing that key is added to the index.

Details on how to query JSON objects can be found in our docs: https://docs.getretake.com/search/object

nravic2y ago

how often are these batch jobs run? I'm curious to know what the absolute maximum sync frequency can be.

retakeming2y ago

We don't run any batch jobs - Retake streams changes in real time via CDC (change data capture). The only batch job you would need to run is to populate an index when it's first created.

ccleve2y ago

How does it differ from ZomboDB?

pnoel2y ago

Good question -- the primary difference is the method of integration with Postgres. ZomboDB is a Postgres extension, which limits their compatibility with Postgres serivces like AWS RDS, while Retake is compatible with any service where you can enable logical replication

j / k navigate · click thread line to collapse

23 comments

pk192382y ago

pnoel2y ago

Yes it does :)

jph2y ago

Clever idea, good work!

  ┌──────────────┐    ┌───────────────┐    ┌──────────────┐
  │Search SDK    │    │Search Engine  │    │Data Source   │
  │• Typescript  │    │• OpenSearch   │    │• Postgres    │
  │• Python      │    │• Faiss, KNN   │    │• MySQL (?)   │
  │• Java (soon) │◀──▶│• Keyword, BM25│◀──▶│• Oracle (?)  │
  │• Go (soon)   │    │• Auto-update  │    │• Mongo (?)   │
  │• Etc.        │    │• Etc.         │    │• Etc.        │
  └──────────────┘    └───────────────┘    └──────────────┘

pnoel2y ago

That's a nice diagram! Yeah that's roughly it. We'll be adding support for more sources of truth in the future to expand coverage, like the ones you mention but also NoSQL like MongoDB

isaacfung2y ago

So are you guys using faiss instead of the vector search of postgres?

I think vespa also supports hybrid search(it can also use late interaction model like colbert). How is retake compared to vespa?

Will retake supports sparse vector models like SPLADE(I heard they solve the vocab mismatch problems of keyword search).

How do you guys implement filtering?

1 more reply

noodlesUK2y ago

Unrelated to OP, but how did you create that diagram?

gregsadetsky2y ago

  ╔═════════════════════════════════════════════════════╗
  ║                                                     ║
  ║ https://en.wikipedia.org/wiki/Box-drawing_character ║
  ║                                                     ║
  ╚═════════════════════════════════════════════════════╝

:-)

there was a related thread a few days ago: https://news.ycombinator.com/item?id=37040883

you could use https://asciiflow.com/ (web) or https://monodraw.helftone.com/ (mac) to make such diagrams and paste them here.

use the "code" formatting ("Text after a blank line that is indented by two or more spaces") -- see https://news.ycombinator.com/formatdoc

Palmik2y ago

I would like something that can keep postgres (or other source of truth) in sync with existing search database (like Elastic, Meili, or Qdrant).

mdaniel2y ago

https://github.com/getretake/retake/pull/198 is a refreshing change given the recent rug pulls, so thank you for that

retakeming2y ago

Thanks! We debated what the right decision was in the beginning but are glad to have settled on Apache.

seemaze2y ago

From the product landing page: "By connecting to your sources of truth, Retake unlocks real-time keyword and semantic search over siloed data"

I misread 'siloed data' as 'soiled data' and was like, this product gets me!

pnoel2y ago

Must be some dirty ETL pipelines :')

dev1l2y ago

Awesome project! It would be interesting to see performance tests. I know that a scientific experiment is very difficult, so approximate numbers are enough. Anyway, thanks for your work :)

benjaminsanborn2y ago

Thanks for sharing; this project looks very promising!

What precipitated your fork of pgsync and how do you foresee maintaining compatibility with that project?

retakeming2y ago

Thanks, appreciate it!

jaequery2y ago

Can’t you just use OpenSearch? What is the point of going through Postgres when you already have OpenSearch?

spleen77772y ago

How it handles JSONB fields? Do I need to define all keys in JSONB field to make them indexed?

retakeming2y ago

You don't need to define all keys in a JSON object - by default, new keys will automatically be added to the index mapping when a JSON document containing that key is added to the index.

Details on how to query JSON objects can be found in our docs: https://docs.getretake.com/search/object

nravic2y ago

how often are these batch jobs run? I'm curious to know what the absolute maximum sync frequency can be.

retakeming2y ago

We don't run any batch jobs - Retake streams changes in real time via CDC (change data capture). The only batch job you would need to run is to populate an index when it's first created.

ccleve2y ago

How does it differ from ZomboDB?

pnoel2y ago

j / k navigate · click thread line to collapse