Evaluating modular RAG with reasoning models (opens in new tab)

(kapa.ai)

62 pointsemil_sorensen1y ago31 comments

31 comments

We tried something similar and found much better results with o1 pro than o3 mini. RAG seems to require a level of world knowledge that the mini models don’t have.

This comes at the cost of significantly higher latency and cost. But for us, answer quality is a much higher priority.

eternityforest1y ago

RAG seems to work with 0.5 and 1.5B models just fine a lot of the time, it just can't handle anything that's not directly spelled out in the documents.

Or, at least it seems to in the limited amount of testing I did in a weekend. I'm an embedded dev without any real AI experience or an actual use case for building a RAG at the moment.

Foobar85681y ago

RAG is basically a fancy name to augment a prompt with data.

Companies are being sold they can augment their LLM with their unstructured massive dataset but it's all wishful thinking.

1 more reply

serjester1y ago

That's essentially what an embedding model is - a smaller, faster model that's good at finding information quickly. Then you feed that to a larger, more powerful reasoning model to synthesize and you've invented RAG.

1 more reply

emil_sorensenOP1y ago

Super cool! Yep, a lot seems to get lost through distillation.

SubiculumCode1y ago

I found it interesting the parts that discussed current limitations of llm's understanding of tools, despite apparent reasoning abilities, it didn't seem to have an intuitive understanding of when to use the specific search tools.

I wonder whether this would benefit from a fine tuned llm module for that specific step, or even by providing a set of examples in the prompt of when to use what tool?

EngineeringStuf1y ago

Am I correct in reading that the RAG pipeline runs in realtime in response to a user query?

If so, then I would suggest that you run it ahead of time and generate possible questions from the LLM based on the context of the current semantically split chunk.

That way you only need to compare the embeddings at query time and it will already be pre-sorted and ranked.

The trick, of course, is chunking it correctly and generating the right questions. But in both cases I would look to the LLM to do that.

Happy to recommend some tips on semantically splitting documents using the LLM with really low token usage if you're interested.

TechDebtDevin1y ago

> Happy to recommend some tips on semantically splitting documents using the LLM with really low token usage if you're interested.

Go on please :)

triyambakam1y ago

So if the user submitted a question not already generated, would that be like a cache miss and it would instead fall back to a real time query?

EngineeringStuf1y ago

Yes, but you could optimise the generated questions over time to reduce cache-misses.

ekianjo1y ago

> time and generate possible questions from the LLM based on the context of the current semantically split chunk.

Possible but very compute intensive. Imagine if you have hundreds of thousands of chunks...

EngineeringStuf1y ago

The number of chunks would be the same regardless of either approach.

The generation of questions can be done out-of-band by a cheaper model.

Their current implementation approach seems to require some computation per request. It would be a balance to see which strategy provides the most value.

The speed of responses overall would be faster.

aantix1y ago

When aggregating data from multiple systems, how do you handle the case of only searching against data chunks that the user is authorized to view? And if those permissions change?

emil_sorensenOP1y ago

We focus mainly on external use cases (e.g., helping companies like Docker and Monday.com deploy customer facing "Ask AI" assistants) so we don't run into much of that given all data is public.

For internal use cases that require user level permissions that's a freaking rabbit role. I recently heard someone describe Glean as a "permissions company" more so than a search company for that reason. :)

3abiton1y ago

> fine-tuning a model on tool usage could also allow it to gain familiarity with specific retrieval mechanisms.

I am curious if finetuning on specific usecases would outperform RAG approaches, assuming the data is static (say company documentation). I know there has been lots of posts on this, but yet to see quanitifications, especially with o3-mini.

anonymousDan1y ago

Is RAG any good for coding tasks?

mvieira381y ago

For what it's worth, Anthropic decided against using RAG in Claude Code (https://news.ycombinator.com/item?id=43164089)

Workaccount21y ago

I used Claude 3.7 last night to write a program for building tooling code from legacy manufacturing files for cnc and electronics manufacturing. Basically it renders the old files visually (they are sorta like SVGs) and then a human can click through them to create the necessary measurements, which the program then indexes and stores for the user. It has a nice GUI with buttons, highlights your selections, graphically demarcates previous measurements, and shows a running list of calculated outputs based on your selections, which you can delete if incorrect. Then you click export it exports it in the correct modern Place File structure. Totally knocked my socks and feet off too.

There are no programs online which do this (lots of viewers, but not interpreters/converters), and I actually had gotten a quote for proprietary software that can do it, but is $1k/yr to use.

I _did not_ think claude would be able to do it, but thought I would give it a shot. It took 3 prompts to get to 95% of the way there. The last 5% was done by o3mini because Claude ran out of capacity for me.

raggedasil1y ago

I'd say it's essential to provide whatever you're asking context. In fully local environments I've been able to integrate the responses directly without having the generalize -> generate -> de-generalize loop, highly increasing LLM's value for me.

afhammad1y ago

Could you share more on your local setup please?

mkesper1y ago

Latency must be brutal here. This will not be possible for any chat application, I guess.

bauefi1y ago

It depends on how you do retrieval. If you just use dense embeddings for example you can get the latency of one search query down to maybe something like 400ms. In that case multiple sequential look ups would be ok but your embeddings need to be good enough of course

laichzeit01y ago

It's not just the retrieval, tool calls entail another call to the LLM (ToolMessage) and possibly the result will then require other tool calls. Massive latency.

eternityforest1y ago

These ultra fast embeddings are really cool, because you can just spam them at everything and it's pretty much instant.

I was able to get them to answer very simple questions without any vector database or pre indexing, just expanding the search query to synonyms, then using normal fulltext search, using embeddings to match article titles to the query, plus adding a few "Personality documents" that are always in every result set no matter what.

Then I do chunking on the fly based on similarity to to query.

Retrieval takes about 1 second on a CPU, but then the actual LLM call takes 10 to 40 seconds, because you need about 1500 bytes of context to consistently get something that has the answers in it... Not exactly useful at the moment on cheap consumer hardware but still very interesting.

https://huggingface.co/blog/static-embeddings

emil_sorensenOP1y ago

Yep even with a small bump in performance (which we only saw for a subset of coding questions), it wouldn't be worth the huge latency penalty. Though that will surely go down over time.

emil_sorensenOP1y ago

Curious if anyone else has run similar experiments?

zurfer1y ago

Yes. Our main finding was that o3 mini especially is great on paper but surprisingly hard to prompt, compared to non reasoning models. I don't think it's a problem with reasoning, but rather with this specific model. I also suspect that o3 mini is a rather small model and so it can lack useful knowledge for broad applications. Especially for RAG, it seems that larger and fast models (e.g. gpt4o) perform better as of today.

emil_sorensenOP1y ago

I suspect you're right here! Excited to get our hands on the non-distilled o3. :)

j / k navigate · click thread line to collapse

31 comments

serjester1y ago

We tried something similar and found much better results with o1 pro than o3 mini. RAG seems to require a level of world knowledge that the mini models don’t have.

This comes at the cost of significantly higher latency and cost. But for us, answer quality is a much higher priority.

eternityforest1y ago

RAG seems to work with 0.5 and 1.5B models just fine a lot of the time, it just can't handle anything that's not directly spelled out in the documents.

Or, at least it seems to in the limited amount of testing I did in a weekend. I'm an embedded dev without any real AI experience or an actual use case for building a RAG at the moment.

Foobar85681y ago

RAG is basically a fancy name to augment a prompt with data.

Companies are being sold they can augment their LLM with their unstructured massive dataset but it's all wishful thinking.

1 more reply

serjester1y ago

1 more reply

emil_sorensenOP1y ago

Super cool! Yep, a lot seems to get lost through distillation.

SubiculumCode1y ago

I wonder whether this would benefit from a fine tuned llm module for that specific step, or even by providing a set of examples in the prompt of when to use what tool?

EngineeringStuf1y ago

Am I correct in reading that the RAG pipeline runs in realtime in response to a user query?

If so, then I would suggest that you run it ahead of time and generate possible questions from the LLM based on the context of the current semantically split chunk.

That way you only need to compare the embeddings at query time and it will already be pre-sorted and ranked.

The trick, of course, is chunking it correctly and generating the right questions. But in both cases I would look to the LLM to do that.

Happy to recommend some tips on semantically splitting documents using the LLM with really low token usage if you're interested.

TechDebtDevin1y ago

> Happy to recommend some tips on semantically splitting documents using the LLM with really low token usage if you're interested.

Go on please :)

triyambakam1y ago

So if the user submitted a question not already generated, would that be like a cache miss and it would instead fall back to a real time query?

EngineeringStuf1y ago

Yes, but you could optimise the generated questions over time to reduce cache-misses.

ekianjo1y ago

> time and generate possible questions from the LLM based on the context of the current semantically split chunk.

Possible but very compute intensive. Imagine if you have hundreds of thousands of chunks...

EngineeringStuf1y ago

The number of chunks would be the same regardless of either approach.

The generation of questions can be done out-of-band by a cheaper model.

Their current implementation approach seems to require some computation per request. It would be a balance to see which strategy provides the most value.

The speed of responses overall would be faster.

aantix1y ago

When aggregating data from multiple systems, how do you handle the case of only searching against data chunks that the user is authorized to view? And if those permissions change?

emil_sorensenOP1y ago

We focus mainly on external use cases (e.g., helping companies like Docker and Monday.com deploy customer facing "Ask AI" assistants) so we don't run into much of that given all data is public.

3abiton1y ago

> fine-tuning a model on tool usage could also allow it to gain familiarity with specific retrieval mechanisms.

anonymousDan1y ago

Is RAG any good for coding tasks?

mvieira381y ago

For what it's worth, Anthropic decided against using RAG in Claude Code (https://news.ycombinator.com/item?id=43164089)

Workaccount21y ago

There are no programs online which do this (lots of viewers, but not interpreters/converters), and I actually had gotten a quote for proprietary software that can do it, but is $1k/yr to use.

raggedasil1y ago

afhammad1y ago

Could you share more on your local setup please?

mkesper1y ago

Latency must be brutal here. This will not be possible for any chat application, I guess.

bauefi1y ago

laichzeit01y ago

It's not just the retrieval, tool calls entail another call to the LLM (ToolMessage) and possibly the result will then require other tool calls. Massive latency.

eternityforest1y ago

These ultra fast embeddings are really cool, because you can just spam them at everything and it's pretty much instant.

Then I do chunking on the fly based on similarity to to query.

https://huggingface.co/blog/static-embeddings

emil_sorensenOP1y ago

Yep even with a small bump in performance (which we only saw for a subset of coding questions), it wouldn't be worth the huge latency penalty. Though that will surely go down over time.

emil_sorensenOP1y ago

Curious if anyone else has run similar experiments?

zurfer1y ago

emil_sorensenOP1y ago

I suspect you're right here! Excited to get our hands on the non-distilled o3. :)

j / k navigate · click thread line to collapse