Some gotchas I experienced (but I might be using the wrong embedding/vector DB: spaCy/FAISS):
- Short user questions might result a low signal query vector, e. g. user : "Who is Keanu Reeves?" -> false positives on Wikipedia articles which only contain "Who is"
- Typos and formatting affects the vectorization, a small difference might lead to a miss, e.g. "Who is Keanu Reeves?" -> match, "Who is keanu Reeves?" -> no match, no match with any other capitalization.
If there's only a single document, a simple keyword search might lead to better results.
In my experience, false positives (retrieving an irrelevant text and generating completely wrong answer) are a bigger problem than negatives (not retrieving text, possibly can't answer question).
Has somebody experience with Apache Lucene / Solr or Elasticsearch?
I've been working on a RAG with Solr, and quickly hit some of the issues you describe when dealing with real-world messy data and user input, e.g. using all-MiniLM-L6-v2 and cosine similarity, "Can you summarize Immanuel Kant's biography?" matched a chunk containing just the word "Biography" rather than one which started "Immanuel Kant, born in 1724...", and "How high is Ben Nevis?" matched a chunk of text about someone called Benjamin rather than a chunk about mountains containing the words "Ben Nevis" and its height[0]. Switching embedding model has helped, but still not convinced that vector search alone is the silver bullet some claim it is. Still lots more to try though, e.g. hybrid search[1], query expansion[2], knowledge graphs etc.
[0] https://www.michael-lewis.com/posts/vector-search-and-retrie...
[1] https://sease.io/2023/12/hybrid-search-with-apache-solr.html
It has the downside that an LLM (rather than just a embedding model) is used in the query path, but it has helped me multiple times in the past to strongly reduce problems with RAG like the ones you outlined, where it likes to latch onto individual words.
I have no opinion on your products or your post, but some % of people steer away from companies for such things.
I'll support you, mnd999. I don't work for a graph dB company. We don't use graph dBs, but I'm considering it. Graph dbs are a legitimate source to feed data I to your RAG system. Our RAG system currently used hybrid search: lexical and semantic. We need to expand our sources, too. I would like to see us use LLMs to rephrase our content (we have a lot of code), and index on that. I think we should build a KG on content quality (we have millions of docs) and software out the things no one likes.
I also think a KG on "learning journeys" would be valuable, but really difficult.
It's important we get through the trough of disillusionment quickly. There's a lot of market education needed to know when they're truly needed.
A month in I realize I'm trying to reinvent a search engine. Kinda wonder if I should have just used something like elasticsearch instead.
About the "bad news" section.
You can do that today by just asking the llm using the ReAct pattern. Give it the database schema, a few shots prompt, and will happily decide to build query, read titles, and do more query if the titles aren't relevant enough, and fetch the content of titles that are relevant and use those to form an opinion.
This may not sem fast, but there are 7b token models that can do it today, at 150+token/second.
it's a ReAct loop with search and retrieve action, where I'm simulating the tool by hand. in prod, you'd pick up the output of the Action, run the callback with the LLM input, get the result, and pass the result as 'Observation:' - for the sake of this demo, I'm doing exactly that but manually copy pasting out of wikipedia
works more or less with any backend, and the llm is smart enough to change direction if a search doesn't produce relevant result (and you can see it in the demo). here the loop is cut short because I was running manually, but you can see the important bits.
just implement a retrieve and search function to whatever data source you have, vector or full text, and a couple regex to extract actions and final answer.
pro tip use a expensive llm to run the react loop, and a cheaper llm to summarize articles content after retrieval and before putting it as an observation. ideally you'd want something like "this is a document {document} on this topic: {last_thought}, extract the information relevant to the user question: {question}" trough a cheap llm, so you have the least amount of token into the react loop.
I have so far mostly failed in trying to explain 1/ why search matters and 2/ that not all "search" functionality are equal and that building good search is an art form.
Yeah, it takes an absurd amount of tuning to make search work well. Given how poorly the average search field works in almost anything, it's fair to say this crucial step isn't happening.
I suspect a lot of organizations just don't have workflows that would tolerate someone spending a month tweaking search algorithm parameters. It doesn't look enough like work.
You can use analogies like:
1. Imagine the world before Google. Web search was a pain. <<Search for your company>> would be similarly transformative.
2. Every company has an encyclopedia - the guy who knows about the past efforts and is consulted whenever people are trying something new. Search makes that redundant and reduce times.
3. Same with repetitive work because the employees cannot find where the work was done previously.
search is a feature, and unless you address the central pain point that search solves (in terms of revenue), no one will go for it. When you do, you will end up solving the second problem about how leaders never have the issue but employees do.
https://aclanthology.org/2023.newsum-1.10/
Happy to see that David's excellent work is getting the love that it deserves!
ChatGPT certainly set the tone for the year. Though I will say you haven't heard the last of semantic graphs, semantic paths and some of that work that did happen in late 2022 right before ChatGPT. A bit of a detour? Yes. Perhaps the combination is something that will lead to features even more interesting - time will tell.
Yes, it did. Companies that offer competitive search or recommendation feeds were all using these text models in production.
Using the LLM to mutate the input so it can be used better for search is a path that works very well (ignoring added latency and cost).
I think you can do the same with data you store… summarize it to same number of tokens, then get an embedding for that to save with the original text.
Test! Different combinations of summarizing LLM and embedding generation LLM can get different results. But once you decide, you are locked in the summarizer as much as the embedding generator.
Not sure is this is what the parent meant though.
But be careful because the output is not guaranteed. Which means you have to take care to provide the schema and what you're trying to do within the context window, and validate the output. There is a non-trivial overhead to this.
I have a company finding buyers for commercial real estate. One of the search features are the locations of the buyers (usually family offices etc, always companies they have headquarters, preferences on where to buy etc.). You can then for example calculate the distance to those locations.
LLMs are extremely useful in creating these features from unstructured info on the companies. But just throwing an embedding on this and hoping it works doesn't.
However, embeddings work super well in the parts of the search.
Where vector search excels is that it can encode a complex question as a vector and does a good job bringing back the top n results. Its not impossible to do some of this with keyword search (term expansion, stopwords and so forth). Vector search just makes it easy.
In the end, yes this is a better search system. And thinking about this step is a good point. I would go a step further and say it's also worth thinking about the RAG framework. Lots of examples use a OpenAI/Langchain/Chroma stack. But it's also worth evaluating RAG framework options. There might be frameworks that are easier to integrate and perform better for your use case.
Disclaimer: I am the author of txtai: https://github.com/neuml/txtai
One way of doing it is to embed messages with the added context of previous messages until the topic changes, otherwise, a simple similarity search of user prompt embedding would output messages of irrelevant topics since the context was included from the start.
Then embed the user prompt and perform a similarity search of either the user's query or create a hypothetical statement based on the prompt, also called HyDe approach. You ask an LLM to generate a hypothetical response given the query and then use its vector along with the query vector to enhance search quality.
For example, if the user query is - "find me who is interested in playing Minecraft on Tuesday", the llm will generate a response "I play Minecraft on Tuesdays" and we can search the vector of the llm output in the vector db which is all the messages along with their context.
However, I am not sure how this will work in scenarios where the user has sent a message asking "Will you play Minecraft on Tuesday", and person A has responded with "Yes". how can we have the model find person A? Shall we make a summary of each person based on the conversation with the user?
Also, the whole process might be computationally slow. how do we enhance the speed and performance?
(a noob here who wanted to build a similar solution)
RAG is often helpful and easy to add, but it's fundamentally search - not magic.
I find it helpful to look at the search results before feeding them into the model. Just like the "I'm feeling lucky" button on google doesn't always give the perfect answer. You may have to tweak your search query to improve the result.
I wish I had time to mess with it more. Job and life has taken over. My first goal with AI would be to use it to for key word and phrase extraction and also analyzing all the links I pull in hourly to see if there is a larger story I could make visible.
Vector DBs are critical components in retrieval systems. What most applications need are retrieval systems, rather than building blocks of retrieval systems. That doesn't mean the building blocks are not important.
As someone working on vector DB, I find many users struggling in building their own retrieval systems with building blocks such as embedding service (openai,cohere), logic orchestration framework (langchain/llamaindex) and vector databases, some even with reranker models. Putting them together is not as easy as it looks. A fairly changeling system work. Letting alone quality tuning and devops.
The struggle is no surprise to me, as tech companies who are experts on this (google,meta) all have dedicated teams working on retrieval system alone, making tons of optimizations and develop a whole feedback loop of evaluating and improving the quality. Most developers don't get access to such resource.
No one size fits all. I think there shall exist a service that democratize AI-powered retrieval, in simple words the know-how of using embedding+vectordb and a bunch of tricks to achieve SOTA retrieval quality.
With this idea I built a Retrieval-as-a-service solution, and here is its demo:
https://github.com/milvus-io/bootcamp/blob/master/bootcamp/R...
Or using it in LlamaIndex:
https://github.com/run-llama/llama_index/blob/main/docs/exam...
Curious to learn your thoughts.
https://thenewstack.io/the-transformative-fusion-of-probabil...
Reranking also provide a significant improvement to the response quality.
Another way to improve results for domain specific RAG systems is to use some heuristics to boost results. E.g., penalize results that contain certain negative keywords or boost results with certain patterns.
For RAG, given the limited context size and potential hallucinations, best prompt + best data will provide you with best response.
Prompts can be improved greatly to get the LLM to throw a good response with reduced hallucinations. A lot of techniques are seen on Twitter and can be explored to find a good fit.
I improve my prompts using a GPT assistant that significantly improve the response quality. https://chat.openai.com/g/g-haH111AXX-prompt-optimizer
I feel that a big part of the solution will simply be in the form of increased speeds. If you can ask the model for a strategy and then let it search/process a few times in a loop, responses will improve vastly.
My current solution is to have an nlp pipeline that does so as tokens are returned. Not quite as precise yet but shows promise.
Should be open source sooner rather than later.
https://github.com/langroid/langroid/blob/main/langroid/agen...
Since it usually deals with PDFs and other docs that can be quite big, do they take only the first N tokens? Are abstractive summarisation techniques used?
https://python.langchain.com/docs/modules/data_connection/do...