- Hybrid Retrieval (semantic + vector) and then LLM based Reranking made no significant change using synthetic eva-questions
- HyDE decreased answer quality and retrieval quality severly when measured with RAGAS using synthetic eval-questions
(we still have to do a RAGAS eval using expert and real user questions)
So yes, hybrid retrieval is always good - that's no news to anyone building production ready or enterprise RAG solutions. But one method doesn't always win. We found semantic search of Azure AI Search being sufficient as a second method, next to vector similarity. Others might find BM25 great, or a fine tuned query post processing SLM. Depends on the use case. Test, test, test.
Next things we're going to try:
- RAPTOR
- SelfRAG
- Agentic RAG
- Query Refinement (expansion and sub-queries)
- GraphRAG
Learning so far:
- Always use a baseline and an experiment to try to refute your null hypothesis using measures like RAGAS or others.
- Use three types of evaluation questions/answers: 1. Expert written q&a, 2. Real user questions (from logs), 3. Synthetic q&a generated from your source documents
RAPTOR for example is a technique that groups and clusters documents together, summarizes them, and creates embeddings defining a sort of a Tree. Paper: https://arxiv.org/html/2401.18059v1
Agentic RAG is creating an agent that can decide to augment "conversations" (or other LLM tools) with RAG searches and analyze its relevance. Pretty useful, but hard to implement right.
You can google the others, they're all more or less these "techniques" to improve an old-fashioned RAG search.
This is akin to a HN comment asking someone to search the Internet for something on their behalf, while discussing search engine algorithms!
That's priced at around 1/10th of what the prompts would normally cost if they weren't cached, which means that tricks like this (running every single chunk against a full copy of the original document) become feasible where previously they wouldn't have financially made sense.
I bet there are all sorts of other neat tricks like this which are opened up by caching cost savings.
My notes on contextual retrieval: https://simonwillison.net/2024/Sep/20/introducing-contextual... and prompt caching: https://simonwillison.net/2024/Aug/14/prompt-caching-with-cl...
I spend $20/month on ChatGPT plus and $20/month on Claude Pro. I get GitHub Copilot for free as an open source maintainer.
I come from a traditional search back ground. It's quite obvious to me that RAG is a bit of a naive strategy if you limit it to just using vector search with some off the shelf embedding model. Vector search simply isn't that good. You need additional information retrieval strategies if you want to improve the context you provide to the LLM. That is effectively what they are doing here.
Microsoft published an interesting paper on graph RAG some time ago where they combine RAG with vector search based on a conceptual graph that they construct from the indexed data using entity extraction. This allows them to pull in contextually relevant information for matching chunks.
I have a hunch that you could probably get quite far without doing any vector search at all. It would be a lot cheaper too. Simply use a traditional search engine and some tuned query. The trick is of course query tuning. Which may not work that well for general purpose use cases but it could work for more specialized use cases.
For question answering, vector/semantic search is clearly a better fit in my mind, and I can see how the contextual models can enable and bolster that. However, because I’ve implemented and used so many keyword based systems, that just doesn’t seem to be how my brain works.
An example I’m thinking of is finding a sushi restaurant near me with availability this weekend around dinner time. I’d love to be able to search for this as I’ve written it. How I would search for it would be search for sushi restaurant, sort by distance and hope the application does a proper job of surfacing time filtering.
Conversely, this is mostly how I would build this system. Perhaps with a layer to determine user intention to pull out restaurant type, location sorting, and time filtering.
I could see using semantic search for filtering down the restaurants to related to sushi, but do we then drop back into traditional search for filtering and sorting? Utilize function calling to have the LLM parameterize our search query?
As stated, perhaps I’m not thinking of these the right way because of my experiences with existing systems, which I find seem to give me better results when well built
The article claimed other context augmentation fails, and that you are better off paying anthropic to run an LLM on all your data, but it seems quite handwavy. What vector+text search nuance does a full document cache LLM rewrite catch that cheapo methods miss? Reminds me of "It is difficult to get a man to understand something when his salary depends on his not understanding it". (We process enough data that we try to limit LLMs to the retrieval step, and only embeddings & light LLMs to the indexing step, so it's a $$$ distinction for our customers.)
The context caching is neat in general, so I have to wonder if this use case is more about paying for ease than quality, and its value for quality is elsewhere.
I'm not sure what Anthropic is introducing here. I looked at the cookbook code and it's just showing the process of producing said context, but there's no actual change to their API regarding "contextual retrieval".
The one change is prompt caching, introduced a month back, which allows you to very cheaply add better context to individual chunks by providing the entire (long) document as context. Caching is an awesome feature to expose to developers and I don't want to take anything away from that.
However, other than that, the only thing I see introduced is just a cookbook on how to do a particular rag workflow.
As an aside, Cohere may be my favorite API to work with. (no affiliation) Their RAG API is a delight, and unlike anything else provided by other providers. I highly recommend it.
The usual dose for adults is one or two 200mg tablets or
capsules 3 times a day.
It is now something like: # Fever
## Treatment
---
The usual dose for adults is one or two 200mg tablets or
capsules 3 times a day.
This seems to work pretty well, and doesn't require any LLMs when indexing documents.(Edited formatting)
Vector embeddings have bag-of-words compression properties and can over-index on the first newline separated text block to the extent that certain indices in the resulting vector end up much closer to 0 than they otherwise would. With quantization, they can eventually become 0 and cause you to lose out on lots of precision with the dense vectors. IDF search overcomes this to some extent, but not enough.
You can "semantically boost" embeddings such that they move closer to your document's title, summary, abstract, etc. and get the recall benefits of this "context" prepend without polluting the underlying vector. Implementation wise it's a weighted sum. During the augmentation step where you put things in the context window, you can always inject the summary chunk when the doc matches as well. Much cleaner solution imo.
Description of "semantic boost" in the Trieve API[1]:
>semantic_boost: Semantic boost is useful for moving the embedding vector of the chunk in the direction of the distance phrase. I.e. you can push a chunk with a chunk_html of "iphone" 25% closer to the term "flagship" by using the distance phrase "flagship" and a distance factor of 0.25. Conceptually it's drawing a line (euclidean/L2 distance) between the vector for the innerText of the chunk_html and distance_phrase then moving the vector of the chunk_html distance_factorL2Distance closer to or away from the distance_phrase point along the line between the two points.
[1]:https://docs.trieve.ai/api-reference/chunk/create-or-upsert-...
The vector databases are here to store vectors and calculating distance between vectors.
The embeddings model is the model that you pick to generate these vectors from a string or an image.
You give "bart simpson" to an embeddings model and it becomes (43, -23, 2, 3, 4, 843, 34, 230, 324, 234, ...)
You can imagine it like geometric points in space (well, it's a vector though), except that instead of being 2D, or 3D-space, they are typically in higher-number of dimensions (e.g: 768).
When you want to find similar entries, you just generate a new vector "homer simpson" (64, -13, 2, 3, 4, 843, 34, 230, 324, 234, ...) and send it to the vector database and it will return you all the nearest neighbors (= the existing entries with the smallest distance).
To generate these vectors, you can use any model that you want, however, you have to stay consistent.
It means that once you are using one embedding model, you are "forever" stuck with it, as there is no practical way to project from one vector space to another.
My understanding is you want to know "are vector DBs compatible with specific LLMs, or are we stuck with a specific LLM if we want to do RAG once we've adopted a specific vector store?"
And the answer to that is that the LLM never sees the vectors from your DB. Your LLM only sees what you submit as context (ie the "system" and "user" prompts in chat-based models).
The way RAG works is:
1 - end-user submits a query
2 - this query is embedded (with the same model that was used to compile the vector store) and compared (in the vector store) with existing content, to retrieve relevant chunks of data
3 - and then this data (in the form of text segments) is passed to the LLM along with the initial query.
So, in a sense you're "locked in" in the sense that you need to use the same embedding model for storage and for retrieval. But you can definitely swap out the LLM for any other LLM without reindexing.
An easy way to try this behavior out as a layperson is to use AnythingLLM which is an open-source desktop client, that allows you to embed your own documents and use RAG locally with open-weight models or swap out any of the LLM APIs.
Another way to look at it, comments. Imagine every comment under this post is a pointer back to the original post. Some will be close in distance, and others will be farther, due to the perception of the authors of the comments themselves. But if you assign each comment a “parent_id”, your access to the post multiplies.
You can see an example of this technique here [1]. I don’t attempt to mind read what the end user will query for, I simply let them tell me, and then index that as a pointer. There are only a finite number of options to represent a given object. But some representations are very, very, very far from the semantic meaning of the core object.
[1] - https://x.com/yourcommonbase/status/1833262865194557505
An example: let's suppose you're using an LLM to play a multi user dungeon. In the past your character has behaved badly with taxis so that the game has decided to create a rule that says that whenever you try to enter a taxi you're kicked out: "we know who you are, we refuse to have you as a client until you formally apologize to the taxi company director". Upon apologizing, the rule is removed. Note that the director of the taxi company could be another player and be the one who issued the rule in the first place, to be enforced by his NPC fleet of taxis.
I'm wondering how well this could scale (with respect of number of active rules) and to which extent traditional RAG could be applied. It seems deciding whether a rule applies or not is a problem that is more abstract and difficult than deciding whether a chunk of knowledge is relevant or not.
In particular the main problem I have identified that makes it more difficult is the following dependency loop that doesn't appear with knowledge retrieval: you need to retrieve a rule to identify whether it applies or not. Does anyone know how this problem could be solved ?
Example query, with some help from LLama 3.1 8B:
As the dark elven horde closes in on his position, Grimgold Ironfist finds himself in a desperate predicament. His sturdy bearded face is set with determination, but his worn leather apron and mismatched socks are a far cry from the battle-hardened armor he once donned as a proud member of the Dwarven Militia. Now, his tunic is stained with ale and oil from a recent session at the local tavern, and his boots are scuffed from countless miles of adventuring. His health bar, once a proud 100%, now teeters on 35% due to a nasty encounter with a giant spider earlier that day. In his inventory, Grimgold has: a rusty iron pickaxe (degraded), a waterskin (half-full), a chunk of stale bread (half-eaten), and a small pouch containing 17 gold pieces. His trusty hammer, "Mithrilcrusher", lies forgotten in the nearby underbrush, having been left behind in his haste to flee the elven army. With no time to lose, Grimgold spots a lone taxi cab rattling down the road - its golden horse emblem a beacon of hope in this desperate hour. He sprints towards it, hoping against hope that he can somehow sweet-talk the driver into taking him on, despite his...ahem...' checkered past' with the Taxi Guild.
Example rule that would be fetched from the vector store (because there is a vector proximity caused by the character name/attributes and by the mentions of taxis and the Taxi Guild. The Taxi Guild has imposed a strict penalty upon Grimgold: whenever he attempts to hail a cab, he is summarily ejected from the vehicle. The Guild’s decree, inscribed on a parchment of shame, reads:
“Grimgold Ironfist, bearded dwarf of ill repute, henceforth shall not be granted passage in any taxi operated by our members until he has formally apologized to Thorgrim Stonebeard, Director of the Golden Horse Cab Company. Failure to comply with this edict shall result in perpetual exclusion from our services.”I would prefer that anthropic just release their tokeniser so we don't have to make guesses.
"study your data before indexing it"
Does anyone know if the datasets they used for the evaluation are publicly available or if they give more information on the datasets than what's in appendix II?
There are standard publically available datasets for this type of evaluation, like MTEB (https://github.com/embeddings-benchmark/mteb). I wonder how this technique does on the MTEB dataset.
I wonder how it would work if you generated the contexts yourself algorithmically. Depending on how well structured your docs are this could be quite trivial (eg for an html doc insert the title > h1 > h2 > chunk).
This example is well written and documented, easy to understand. Well done.
What exactly is a "failure rate" and how is it computed?
NotebookLM is currently free to use and was so good I almost immediately started paying Google $20 a month to get access to their pro version of Gemini.
I still think the Groq APIs for open weight models are the best value for the money, but the way OpenAI, Google, Anthropic, etc. are productizing LLMs is very impressive.
"Chunk boundaries: Consider how you split your documents into chunks. The choice of chunk size, chunk boundary, and chunk overlap can affect retrieval performance1."