We've extended Postgres w/ open source models from Huggingface, as well as vector search, and classical ML algos, so that everything can happen in the same process. It's significantly faster and cheaper, which leaves a large latency budget available to expand model and algorithm complexity. In addition open source models have already surpassed OpenAI's text-embedding-ada-002 in quality, not just speed. [1]
Here is a series of posts explaining how to accomplish the complexity involved in a typical ML powered application, as a single SQL query, that runs in a single process with memory shared between models and feature indexes, including learned embeddings and reranking models.
- Generating LLM embeddings with open source models in the database[2]
- Tuning vector recall [3]
- Personalize embedding results with application data [4]
This allows a single SQL query to accomplish what would normally be an entire application w/ several model services and databases
e.g. for a modern chatbot built across various services and databases
-> application sends user input data to embedding service
<- embedding model generates a vector to send back to application
-> application sends vector to vector database
<- vector database returns associated metadata found via ANN
-> application sends metadata for reranking
<- reranking model prunes less helpful context
-> application sends finished prompt w/ context to generative model
<- model produces final output
-> application streams response to user
[1]: https://huggingface.co/spaces/mteb/leaderboard[2]: https://postgresml.org/blog/generating-llm-embeddings-with-o...
[3]: https://postgresml.org/blog/tuning-vector-recall-while-gener...
[4]: https://postgresml.org/blog/personalize-embedding-vector-sea...