Systematically Improving Your RAG (opens in new tab)

(jxnl.co)

176 pointsjxnlco2y ago53 comments

53 comments

Why is there so much buzz about RAG?

Isn’t it basically a traditional search (either keyword based, vector based -embeddings have been around for years-, or a combination of both) where you take the top N results (usually not even full docs, but chunks due to context length limitations) and pass them to an LLM to regurgitate a response (hopefully without hallucinations), instead of simply listing the results right away? I think some implementations also ask the LLM to rewrite the user query to “capture the user intent”.

What I’m missing here? What makes it so useful?

ru5522y ago

*What makes it so useful?

One example is in finance, you have a lot of 45 page PDFs laying around and you're pretty sure one of them has the Reg, or info you need. You aren't sure which so you open them one by one and do a search for a word, then jump through a bunch of those results and decide it's not this PDF. You do that till you find the "one". There are a non trivial amount of Executive level jobs that pretty much do this for half of their work week.

RAG purports to let you search one time.

jumploops2y ago

This is true for traditional full-text document search as well.

When most people mention RAG, they’re using a vector store to surface results that are semantically similar to the user’s query (the retrieval part). They then pass these results to an LLM for summary (the generation part).

In practice, the problems with RAG are similar to the traditional problems of search: indices, latency, and correctness.

traverseda2y ago

* indices

Doesn't vector search solve a lot of these problems? These AI vector spaces seem like a really easy win here, and they're reasonably lightweight compared to a full LLM.

* Latency

I don't want to call this a solved problem, but it is one that scales horizontally very easily and that a lot of existing tech is able to take advantage of easily

* Correctness

They LLM tooling doesn't necessarily need to make things worse here, although poorly designed it definitely could. AI can do a first pass at fact checking, even though I suspect we'll need humans in the loop for a long while.

---

I think that vector-space at least bring some big advantages for indexing here, being able to search for more abstract concepts.

1 more reply

cpursley2y ago

Any tips on effectively getting financial data out of PDFs into a RAG system (especially data contained in tables)? And locally, not via proprietary cloud PDF parsing thingy. That's the current nut I'm trying to crack.

rawsh2y ago

https://github.com/VikParuchuri/marker is solid, but slow and needs gpu(s) to be practical

serjester2y ago

You might find my library useful - https://github.com/Filimoa/open-parse

TimeBearingDown2y ago

I’m probably missing the point: doesn’t https://pdfgrep.org solve this problem?

soneca2y ago

What if they don’t remember the regulation code?

”What is the regulation that covers M&A of companies in the pharmaceutical industry?”

It seems much easier to get that response from a LLM than searching words with grep.

rawsh2y ago

I built a web version with WASM at https://pdfgrep.com a few years ago in case it’s helpful to anyone

ww5202y ago

RAG is not just traditional search. It's any augmented data that can be fed to the LLM.

The most useful and verifiable RAG setup I've seen is hooking up a RDBMS and LLM, and asking querying questions in English to retrieve the table data. You can do it in several steps.

1. Extract the metadata of the tables, e.g. table names, columns of each table, related columns of the tables, indexed columns, etc. This is your RAG data.

2. Build the RAG context with the metadata, i.e. listing each table, its columns, relationships, etc.

3. Feed the RAG context and the user's querying questions to the LLM. Tell LLM to generate a SQL for the question given the RAG context.

4. Run the SQL query on the database.

It's uncannily good. And it can be easily verified given the SQL.

Xenoamorphous2y ago

Is that RAG though? Perhaps I’m missing something but I don’t see where the retrieval step is. Extracting the metadata and passing it to the LLM in the context sounds like a non-RAG LLM application. Or you’re saying that the DB schema is so big and/or the LLM context too small so not all the metadata can be passed in one go and there’s some search step to prune the number of tables?

ww5202y ago

RAG is augmenting the llm generation with external data. How the external data is retrieved is irrelevant. A search is not necessary.

Of course you can do a search on the related tables with regard to the question to narrow down the table list to help the llm to come up with the correct answer.

simonw2y ago

That's exactly what it is, and it's useful because when it works it means you can ask a question and get an answer to your question, rather then having to read the documents and then answer that question yourself.

lukev2y ago

It also lets a language model answer questions while citing a source, something it fundamentally cannot do on its own.

Everyone talks about "reducing hallucinations" but from a system perspective, everything a LLM emits is equally hallucinated.

Putting the relevant data in context gets around this and provides actual provenance of information, something that is absolutely required for real "knowledge" and which we often take for granted in practice.

Of course, the ability to do so is entirely reliant on the retrieval's search quality. Tradeoffs abound. But with enough clever tricks it does seem possible to take advantage of both the LLMs broad but unsubstantiated content, and specific fact claims.

esafak2y ago

You just described RAG: augmenting an LLM with external memory. Perhaps the part you are skipping is that the LLM synthesizes the retrieved information with its own knowledge into one coherent whole.

It's abstractive- (new) versus extractive (old) summarization.

What makes it useful is that it does the work of synthesizing the information. Imagine you ask a question that involves bits and pieces of numerous articles. In the past you had to read them all and mentally synthesize them.

thefourthchime2y ago

I've used something like RAG for finding solutions to questions in slack. I take the question, break it into searchable terms, search slack and get a haystack of results. Then I use a LLM to figure out if the results are relivent, finally at the end i take the top 10 results and summarize them and link back to the slack discussion.

dragonwriter2y ago

The intent is usually not to simply regurgitate the results, but to augment the prompt with them to enable a better, focussed answer to the user question than either search or an LLM alone would provide.

ingvar772y ago

The buzz is because it is really one of a most widely used new AI things, easily applicable to millions of businesses. Everyone has some large storage of unstructured data they want to search through and ask questions about - legal docs, candidates, books, articles.. At the same time it’s relatively straightforward to implement so it’s already tens or hundreds of startups / products pushing RAG agenda (all these “it seems easy but it’s not!”). Hopefully soon it will be added as a built in LLM feature - ability to upload own data for LLM to use. It also made many more developers aware of embeddings and vector search, which is great.

oriel2y ago

I'm still building my understanding in this space, but so far I've seen its value when using chains and graphs of agents.

The overall system suggests degrees of freedom in search that might not have been available. This is by having a knowledge store in a format (vectors) primed for search, then having it be accessible in full or in partitions, by agents, working on one or more concurrent flows around a query.

I also see value in having a full circuit of native-format components that can be pieced together to make higher order constructs. Agents is just the most recent one to emerge and i can easily see a mixture of fine tuned experts alongside stores of relevant material.

/2c

jxnlcoOP2y ago

nothing, all i really say is 'add monitoring, do topic clustering' which is how i did 'search' and 'recommendation' systems

1) are there filters we need to build 2) do we have inventory

nutanc2y ago

It's useful because you get to increase your startup valuation if you use "RAG".

rldjbpin2y ago

to me it feels like people are waking up to the fact that with current access to sw/hw, you can now make your own search engine and answering tool based on the data you own.

7thpower2y ago

This is a great intro. I am amazed how many people don’t use the LLMs to analyze the questions themselves and apply filters to avoid pulling back irrelevant documents in the first place.

We run as many methods as practical in parallel (sql, vector, full text, other methods, etc.) and return the first one that meets our threshold. Vector search is almost never the winner relative to full text.

Instead, I see a lot of people in sister companies using the most robust models they can find and having agents to do chain of thought, while their users are wondering when, if ever, they’ll get a response back.

schmidt_fifty2y ago

> Vector search is almost never the winner relative to full text.

Full text search is certainly the winner in the time dimension, but can it compete in quality? Presumably which method is likely to provide relevant results depends greatly on the query. Invoking LLMs to pre-process the query and select a retrieval method is going to be quite expensive compared to each of the search methods.

7thpower2y ago

I mean from a retrieval quality perspective, not a latency perspective. Search latency is not a constraint because the long pole in the tent for us is always the user facing model.

We also have a lot of numbers in our customer requests, which do not typically play to the strengths of the vector searches.

COGs is not a large concern as our audience is internally facing along with a few of our partners, so inference and infrastructure costs are nothing compared to engineering time as we don’t have a way to amortize our costs out across a bunch of customers.

It is also a very high value use case for us.

The other factor is that we’re using fast and cheap models like haiku and mixtral to do the pre processing before we hand things to the retrieval steps, so it’s not much of a cost driver.

treprinum2y ago

We are optimizing for latency and vector search is sufficient in 80-90% of cases and 0.6s is about the threshold for acceptable end-user experience. Hybrid search with SPLADE is marginally better but it limits the number of human languages we can use. I am wondering when is full-text better compared to vector search outside of very specific keywords.

7thpower2y ago

Latency of search isn’t much of a concern, I was speaking to quality but did not word it well.

We have just found that vector search does not play well with numbers and does not provide consistent results, so we end up needing more chunks which results compounding token usage, slower responses, and higher chances of incorrect responses due to the customer facing model getting confused by similar results. I’m sure we could optimize our approach but full text has worked far more reliably than expected so we have invested more resources into how we handle documents, latency reduction, and pulling in structured data.

cpursley2y ago

This sounds really interesting. Do you have any longer-form writeup on this approach (or could you point us towards related info)?

7thpower2y ago

I do not but my twitter handle is in my profile and I am always more than happy to hop on a call and share what I know.

For reference our subject matter is engineering specs for high precision electronics manufacturing. We have ~100k products and a lot of them have identical documentation except for a few figures (which make all the difference in the world), so it’s a challenging use case that is very unforgiving. Totally doable though and the basis for a lot of capabilities we’ll be investing in moving forward.

Happy to share as I think we’re ahead in a few areas but believe others will catch up and we’ve learned so much from others willing to share info, so we always try to pay forward.

danenania2y ago

This all seems pretty sensible. Another area that would be nice to see addressed are strategies for balancing latency/cost/performance when data is frequently updated. I'm building a terminal-based AI coding tool[1] and have been thinking about how to bring RAG into the picture, as it clearly could add value, but the tradeoffs are tricky to get right.

The options, as far as I can tell, are:

- Re-embed lazily as needed at prompt-time. This should be the cheapest as it minimizes the number of embedding calls, but it's the most expensive in terms of latency.

- Re-embed eagerly after updates (perhaps with some delay and throttling to avoid rapid-fire embed calls). Great for latency, but can get very expensive.

- Some combination of the above two options. This seems to be what many IDE-based AI tools like GH Copilot are doing. An issue with this approach is that it's hard to ever know for sure what's updated in the RAG index and what's stale, and what exactly is getting added to context at any given time.

I'm leaning toward the first option (lazy on-demand embedding) and letting the user decide whether the latency cost is worth it for their task vs. just manually selecting the exact context they want to load.

1 - https://github.com/plandex-ai/plandex

3abiton2y ago

Any benchmarks on performance for on-demand embeddings?

hdlothia2y ago

So the part of RAG that's tripping me up right now is vector search and familiarity scores. Does anyone have a good resource to learn more about this?

I've been using this as a starter. https://developers.cloudflare.com/workers-ai/tutorials/build... I put in text but I feel like my conception of what should get high relevancy scorrs doesn't match the percentages that come out.

The article talks about full text search and meta data so maybe that's the path I should be taking instead of vector search? Where would I store the Metadata in this case? A regular db?

I wish articles like this would go into more details about the nitty gritty. But I appreciate high level overview in the article as well.

PheonixPharts2y ago

Once you have vector representations the "similarity" scores are just basic linear algebra. It's fundamentally no different than any other IR/recsys task.

A good overview is chapter 6 of the Stanford NLP group's IR book [0].

Engineering LLMs still requires a good foundation in the basics of ML/NLP so it's worth the time to catch up a bit.

0. https://web.archive.org/web/20231207074155/https://nlp.stanf...

hdlothia2y ago

This is exactly what I was looking for. Thank you so much!

jxnlcoOP2y ago

thanks! this was based on a 30 minute 'crash course on what things they need to look out fo'

I'd recommend taking a look at lancedb as they support text, vectors, and sql.

high relevancy scores are not percentages, they only make sense in 'ordering' but 0.7 does not mean relevant.

but .9 vs .7 means 'maybe more relevant.

psynister2y ago

Most of this can be done automatically using https://vectorize.io

It generates synthetic questions, tests different embedding models, chunking strategies, etc. You end up with clear data that shows you what will give you the optimal results for your RAG app: https://platform.vectorize.io/public/experiments/ca60ce85-26...

demilich2y ago

Try RAPTOR: https://arxiv.org/html/2401.18059v1

An implementation: github.com/infiniflow/ragflow

satisfice2y ago

Not a lot of content, here.

socksy2y ago

I’m always suspicious with low-signal articles written about LLM based systems, as I suspect that the crowd involved are very trigger happy in using an LLM to write human facing text.

Maybe not what’s happening in this case, but it’s what springs to mind.

minimaxir2y ago

The author is legitimately knowledgeable about LLM/RAG systems and develops open-source tooling around it.

But yes, this isn't a good HN submission without detail.

jxnlcoOP2y ago

how can i make it better?

this was a quick post that i wrote up after a 30 minute call with someone, mostly notes to take in preparation for a bigger talk im giving.

2 more replies

ofermend2y ago

Building RAG can be easy for a simple example, but it's much more nuanced than you might think when you try to do it at larger scale.

With larger-scale real-world enterprise RAG-based applications, you soon realize the enormous time and effort required to experiment with all these levers to optimize the RAG pipeline: which vector DB to use and how, which embedding model to use, pure vector search or hybrid search, chunking strategies, and on and one...

With Vectara's RAG-as-a-service (www.vectara.com) we try to help address exactly this issue: you get an optimized, high performance, secure and scalable RAG pipeline, so you don't need to go through this massive hyper-parameter tuning exercise. Yes, there are still some very useful levers you can experiment with, but only where it really matters.

j / k navigate · click thread line to collapse

53 comments

Xenoamorphous2y ago

Why is there so much buzz about RAG?

What I’m missing here? What makes it so useful?

ru5522y ago

*What makes it so useful?

RAG purports to let you search one time.

jumploops2y ago

This is true for traditional full-text document search as well.

In practice, the problems with RAG are similar to the traditional problems of search: indices, latency, and correctness.

traverseda2y ago

* indices

Doesn't vector search solve a lot of these problems? These AI vector spaces seem like a really easy win here, and they're reasonably lightweight compared to a full LLM.

* Latency

I don't want to call this a solved problem, but it is one that scales horizontally very easily and that a lot of existing tech is able to take advantage of easily

* Correctness

---

I think that vector-space at least bring some big advantages for indexing here, being able to search for more abstract concepts.

1 more reply

cpursley2y ago

rawsh2y ago

https://github.com/VikParuchuri/marker is solid, but slow and needs gpu(s) to be practical

serjester2y ago

You might find my library useful - https://github.com/Filimoa/open-parse

TimeBearingDown2y ago

I’m probably missing the point: doesn’t https://pdfgrep.org solve this problem?

soneca2y ago

What if they don’t remember the regulation code?

”What is the regulation that covers M&A of companies in the pharmaceutical industry?”

It seems much easier to get that response from a LLM than searching words with grep.

rawsh2y ago

I built a web version with WASM at https://pdfgrep.com a few years ago in case it’s helpful to anyone

ww5202y ago

RAG is not just traditional search. It's any augmented data that can be fed to the LLM.

The most useful and verifiable RAG setup I've seen is hooking up a RDBMS and LLM, and asking querying questions in English to retrieve the table data. You can do it in several steps.

1. Extract the metadata of the tables, e.g. table names, columns of each table, related columns of the tables, indexed columns, etc. This is your RAG data.

2. Build the RAG context with the metadata, i.e. listing each table, its columns, relationships, etc.

3. Feed the RAG context and the user's querying questions to the LLM. Tell LLM to generate a SQL for the question given the RAG context.

4. Run the SQL query on the database.

It's uncannily good. And it can be easily verified given the SQL.

Xenoamorphous2y ago

ww5202y ago

RAG is augmenting the llm generation with external data. How the external data is retrieved is irrelevant. A search is not necessary.

Of course you can do a search on the related tables with regard to the question to narrow down the table list to help the llm to come up with the correct answer.

simonw2y ago

lukev2y ago

It also lets a language model answer questions while citing a source, something it fundamentally cannot do on its own.

Everyone talks about "reducing hallucinations" but from a system perspective, everything a LLM emits is equally hallucinated.

esafak2y ago

You just described RAG: augmenting an LLM with external memory. Perhaps the part you are skipping is that the LLM synthesizes the retrieved information with its own knowledge into one coherent whole.

It's abstractive- (new) versus extractive (old) summarization.

thefourthchime2y ago

dragonwriter2y ago

ingvar772y ago

oriel2y ago

I'm still building my understanding in this space, but so far I've seen its value when using chains and graphs of agents.

/2c

jxnlcoOP2y ago

nothing, all i really say is 'add monitoring, do topic clustering' which is how i did 'search' and 'recommendation' systems

1) are there filters we need to build 2) do we have inventory

nutanc2y ago

It's useful because you get to increase your startup valuation if you use "RAG".

rldjbpin2y ago

to me it feels like people are waking up to the fact that with current access to sw/hw, you can now make your own search engine and answering tool based on the data you own.

7thpower2y ago

This is a great intro. I am amazed how many people don’t use the LLMs to analyze the questions themselves and apply filters to avoid pulling back irrelevant documents in the first place.

schmidt_fifty2y ago

> Vector search is almost never the winner relative to full text.

7thpower2y ago

I mean from a retrieval quality perspective, not a latency perspective. Search latency is not a constraint because the long pole in the tent for us is always the user facing model.

We also have a lot of numbers in our customer requests, which do not typically play to the strengths of the vector searches.

It is also a very high value use case for us.

The other factor is that we’re using fast and cheap models like haiku and mixtral to do the pre processing before we hand things to the retrieval steps, so it’s not much of a cost driver.

treprinum2y ago

7thpower2y ago

Latency of search isn’t much of a concern, I was speaking to quality but did not word it well.

cpursley2y ago

This sounds really interesting. Do you have any longer-form writeup on this approach (or could you point us towards related info)?

7thpower2y ago

I do not but my twitter handle is in my profile and I am always more than happy to hop on a call and share what I know.

Happy to share as I think we’re ahead in a few areas but believe others will catch up and we’ve learned so much from others willing to share info, so we always try to pay forward.

danenania2y ago

The options, as far as I can tell, are:

- Re-embed lazily as needed at prompt-time. This should be the cheapest as it minimizes the number of embedding calls, but it's the most expensive in terms of latency.

- Re-embed eagerly after updates (perhaps with some delay and throttling to avoid rapid-fire embed calls). Great for latency, but can get very expensive.

1 - https://github.com/plandex-ai/plandex

3abiton2y ago

Any benchmarks on performance for on-demand embeddings?

hdlothia2y ago

So the part of RAG that's tripping me up right now is vector search and familiarity scores. Does anyone have a good resource to learn more about this?

The article talks about full text search and meta data so maybe that's the path I should be taking instead of vector search? Where would I store the Metadata in this case? A regular db?

I wish articles like this would go into more details about the nitty gritty. But I appreciate high level overview in the article as well.

PheonixPharts2y ago

Once you have vector representations the "similarity" scores are just basic linear algebra. It's fundamentally no different than any other IR/recsys task.

A good overview is chapter 6 of the Stanford NLP group's IR book [0].

Engineering LLMs still requires a good foundation in the basics of ML/NLP so it's worth the time to catch up a bit.

0. https://web.archive.org/web/20231207074155/https://nlp.stanf...

hdlothia2y ago

This is exactly what I was looking for. Thank you so much!

jxnlcoOP2y ago

thanks! this was based on a 30 minute 'crash course on what things they need to look out fo'

I'd recommend taking a look at lancedb as they support text, vectors, and sql.

high relevancy scores are not percentages, they only make sense in 'ordering' but 0.7 does not mean relevant.

but .9 vs .7 means 'maybe more relevant.

psynister2y ago

Most of this can be done automatically using https://vectorize.io

demilich2y ago

Try RAPTOR: https://arxiv.org/html/2401.18059v1

An implementation: github.com/infiniflow/ragflow

satisfice2y ago

Not a lot of content, here.

socksy2y ago

I’m always suspicious with low-signal articles written about LLM based systems, as I suspect that the crowd involved are very trigger happy in using an LLM to write human facing text.

Maybe not what’s happening in this case, but it’s what springs to mind.

minimaxir2y ago

The author is legitimately knowledgeable about LLM/RAG systems and develops open-source tooling around it.

But yes, this isn't a good HN submission without detail.

jxnlcoOP2y ago

how can i make it better?

this was a quick post that i wrote up after a 30 minute call with someone, mostly notes to take in preparation for a bigger talk im giving.

2 more replies

ofermend2y ago

Building RAG can be easy for a simple example, but it's much more nuanced than you might think when you try to do it at larger scale.

j / k navigate · click thread line to collapse