Isn’t it basically a traditional search (either keyword based, vector based -embeddings have been around for years-, or a combination of both) where you take the top N results (usually not even full docs, but chunks due to context length limitations) and pass them to an LLM to regurgitate a response (hopefully without hallucinations), instead of simply listing the results right away? I think some implementations also ask the LLM to rewrite the user query to “capture the user intent”.
What I’m missing here? What makes it so useful?
One example is in finance, you have a lot of 45 page PDFs laying around and you're pretty sure one of them has the Reg, or info you need. You aren't sure which so you open them one by one and do a search for a word, then jump through a bunch of those results and decide it's not this PDF. You do that till you find the "one". There are a non trivial amount of Executive level jobs that pretty much do this for half of their work week.
RAG purports to let you search one time.
When most people mention RAG, they’re using a vector store to surface results that are semantically similar to the user’s query (the retrieval part). They then pass these results to an LLM for summary (the generation part).
In practice, the problems with RAG are similar to the traditional problems of search: indices, latency, and correctness.
The most useful and verifiable RAG setup I've seen is hooking up a RDBMS and LLM, and asking querying questions in English to retrieve the table data. You can do it in several steps.
1. Extract the metadata of the tables, e.g. table names, columns of each table, related columns of the tables, indexed columns, etc. This is your RAG data.
2. Build the RAG context with the metadata, i.e. listing each table, its columns, relationships, etc.
3. Feed the RAG context and the user's querying questions to the LLM. Tell LLM to generate a SQL for the question given the RAG context.
4. Run the SQL query on the database.
It's uncannily good. And it can be easily verified given the SQL.
Everyone talks about "reducing hallucinations" but from a system perspective, everything a LLM emits is equally hallucinated.
Putting the relevant data in context gets around this and provides actual provenance of information, something that is absolutely required for real "knowledge" and which we often take for granted in practice.
Of course, the ability to do so is entirely reliant on the retrieval's search quality. Tradeoffs abound. But with enough clever tricks it does seem possible to take advantage of both the LLMs broad but unsubstantiated content, and specific fact claims.
It's abstractive- (new) versus extractive (old) summarization.
What makes it useful is that it does the work of synthesizing the information. Imagine you ask a question that involves bits and pieces of numerous articles. In the past you had to read them all and mentally synthesize them.
The overall system suggests degrees of freedom in search that might not have been available. This is by having a knowledge store in a format (vectors) primed for search, then having it be accessible in full or in partitions, by agents, working on one or more concurrent flows around a query.
I also see value in having a full circuit of native-format components that can be pieced together to make higher order constructs. Agents is just the most recent one to emerge and i can easily see a mixture of fine tuned experts alongside stores of relevant material.
/2c
1) are there filters we need to build 2) do we have inventory
We run as many methods as practical in parallel (sql, vector, full text, other methods, etc.) and return the first one that meets our threshold. Vector search is almost never the winner relative to full text.
Instead, I see a lot of people in sister companies using the most robust models they can find and having agents to do chain of thought, while their users are wondering when, if ever, they’ll get a response back.
Full text search is certainly the winner in the time dimension, but can it compete in quality? Presumably which method is likely to provide relevant results depends greatly on the query. Invoking LLMs to pre-process the query and select a retrieval method is going to be quite expensive compared to each of the search methods.
We also have a lot of numbers in our customer requests, which do not typically play to the strengths of the vector searches.
COGs is not a large concern as our audience is internally facing along with a few of our partners, so inference and infrastructure costs are nothing compared to engineering time as we don’t have a way to amortize our costs out across a bunch of customers.
It is also a very high value use case for us.
The other factor is that we’re using fast and cheap models like haiku and mixtral to do the pre processing before we hand things to the retrieval steps, so it’s not much of a cost driver.
We have just found that vector search does not play well with numbers and does not provide consistent results, so we end up needing more chunks which results compounding token usage, slower responses, and higher chances of incorrect responses due to the customer facing model getting confused by similar results. I’m sure we could optimize our approach but full text has worked far more reliably than expected so we have invested more resources into how we handle documents, latency reduction, and pulling in structured data.
For reference our subject matter is engineering specs for high precision electronics manufacturing. We have ~100k products and a lot of them have identical documentation except for a few figures (which make all the difference in the world), so it’s a challenging use case that is very unforgiving. Totally doable though and the basis for a lot of capabilities we’ll be investing in moving forward.
Happy to share as I think we’re ahead in a few areas but believe others will catch up and we’ve learned so much from others willing to share info, so we always try to pay forward.
The options, as far as I can tell, are:
- Re-embed lazily as needed at prompt-time. This should be the cheapest as it minimizes the number of embedding calls, but it's the most expensive in terms of latency.
- Re-embed eagerly after updates (perhaps with some delay and throttling to avoid rapid-fire embed calls). Great for latency, but can get very expensive.
- Some combination of the above two options. This seems to be what many IDE-based AI tools like GH Copilot are doing. An issue with this approach is that it's hard to ever know for sure what's updated in the RAG index and what's stale, and what exactly is getting added to context at any given time.
I'm leaning toward the first option (lazy on-demand embedding) and letting the user decide whether the latency cost is worth it for their task vs. just manually selecting the exact context they want to load.
I've been using this as a starter. https://developers.cloudflare.com/workers-ai/tutorials/build... I put in text but I feel like my conception of what should get high relevancy scorrs doesn't match the percentages that come out.
The article talks about full text search and meta data so maybe that's the path I should be taking instead of vector search? Where would I store the Metadata in this case? A regular db?
I wish articles like this would go into more details about the nitty gritty. But I appreciate high level overview in the article as well.
A good overview is chapter 6 of the Stanford NLP group's IR book [0].
Engineering LLMs still requires a good foundation in the basics of ML/NLP so it's worth the time to catch up a bit.
0. https://web.archive.org/web/20231207074155/https://nlp.stanf...
I'd recommend taking a look at lancedb as they support text, vectors, and sql.
high relevancy scores are not percentages, they only make sense in 'ordering' but 0.7 does not mean relevant.
but .9 vs .7 means 'maybe more relevant.
It generates synthetic questions, tests different embedding models, chunking strategies, etc. You end up with clear data that shows you what will give you the optimal results for your RAG app: https://platform.vectorize.io/public/experiments/ca60ce85-26...
An implementation: github.com/infiniflow/ragflow
Maybe not what’s happening in this case, but it’s what springs to mind.
But yes, this isn't a good HN submission without detail.
With larger-scale real-world enterprise RAG-based applications, you soon realize the enormous time and effort required to experiment with all these levers to optimize the RAG pipeline: which vector DB to use and how, which embedding model to use, pure vector search or hybrid search, chunking strategies, and on and one...
With Vectara's RAG-as-a-service (www.vectara.com) we try to help address exactly this issue: you get an optimized, high performance, secure and scalable RAG pipeline, so you don't need to go through this massive hyper-parameter tuning exercise. Yes, there are still some very useful levers you can experiment with, but only where it really matters.