This comes at the cost of significantly higher latency and cost. But for us, answer quality is a much higher priority.
Or, at least it seems to in the limited amount of testing I did in a weekend. I'm an embedded dev without any real AI experience or an actual use case for building a RAG at the moment.
Companies are being sold they can augment their LLM with their unstructured massive dataset but it's all wishful thinking.
I wonder whether this would benefit from a fine tuned llm module for that specific step, or even by providing a set of examples in the prompt of when to use what tool?
If so, then I would suggest that you run it ahead of time and generate possible questions from the LLM based on the context of the current semantically split chunk.
That way you only need to compare the embeddings at query time and it will already be pre-sorted and ranked.
The trick, of course, is chunking it correctly and generating the right questions. But in both cases I would look to the LLM to do that.
Happy to recommend some tips on semantically splitting documents using the LLM with really low token usage if you're interested.
Go on please :)
Possible but very compute intensive. Imagine if you have hundreds of thousands of chunks...
The generation of questions can be done out-of-band by a cheaper model.
Their current implementation approach seems to require some computation per request. It would be a balance to see which strategy provides the most value.
The speed of responses overall would be faster.
For internal use cases that require user level permissions that's a freaking rabbit role. I recently heard someone describe Glean as a "permissions company" more so than a search company for that reason. :)
I am curious if finetuning on specific usecases would outperform RAG approaches, assuming the data is static (say company documentation). I know there has been lots of posts on this, but yet to see quanitifications, especially with o3-mini.
There are no programs online which do this (lots of viewers, but not interpreters/converters), and I actually had gotten a quote for proprietary software that can do it, but is $1k/yr to use.
I _did not_ think claude would be able to do it, but thought I would give it a shot. It took 3 prompts to get to 95% of the way there. The last 5% was done by o3mini because Claude ran out of capacity for me.
I was able to get them to answer very simple questions without any vector database or pre indexing, just expanding the search query to synonyms, then using normal fulltext search, using embeddings to match article titles to the query, plus adding a few "Personality documents" that are always in every result set no matter what.
Then I do chunking on the fly based on similarity to to query.
Retrieval takes about 1 second on a CPU, but then the actual LLM call takes 10 to 40 seconds, because you need about 1500 bytes of context to consistently get something that has the answers in it... Not exactly useful at the moment on cheap consumer hardware but still very interesting.