From zero to a RAG system: successes and failures (opens in new tab)

(en.andros.dev)

322 pointsandros2mo ago103 comments

103 comments

>After several weeks, between 2 and 3, the indexing process finished without failures. ... we could finally shut down the virtual machine. The cost was 184 euros on Hetzner, not cheap.

184euro is loose change after spending 3 man weeks working on the process!

mrits2mo ago

That's the budget I'd have for the coffee shop with the team to discuss budget

EGreg2mo ago

That's the budget to discuss and approve the above coffee shop budget

nicbou2mo ago

This one is, oddly enough, higher

davidwritesbugs2mo ago

Pffft, that's the budget for the paperclips to hold the meeting notes together

1 more reply

closeparen2mo ago

184 Euro is about a day's wages for a software developer over there (40k EUR / 220 working days).

_the_inflator2mo ago

I implemented many RAGs and feel sorry for anyone proclaiming "RAG is dead". These folks have never implemented one, maybe followed a tutorial and installed a "Hello World!" project but that's it.

I don't want to go into detail but 100% agree with the author's conclusion: data is key. Data ingestion to be precisely. Simply using docling and transforming PDFs to markdown and have a vector database doing the rest is ridiculous.

For example, for a high precision RAG with 100% accuracy in pricing as part of the information that RAG provided, I took a week to build a ETL for a 20 page PDF document to separate information between SQL and Graph Database.

And this was a small step with all the tweaking that laid ahead to ensure exceptional results.

What search algorithm or: how many? Embeddings, which quality? Semantics, how and which exactly?

Believe me, RAG is the finest of technical masterpiece there is. I have so many respect for the folks at OpenAI and Anthropic for the ingestion processes and tools they use, because they operate on a level, I will never touch with my RAG implementations.

RAG is really something you should try for yourself, if you love to solve tricky fundamental problems that in the end can provide a lot of value to you or your customers.

Simply don't believe the hype and ignore all "install and embed" solutions. They are crap, sorry to say so.

RansomStark2mo ago

I have proclaimed RAG is dead many times, and I stand by it.

RAG is Dead! Long Live Agentic RAG! || Long Live putting stuff in databases where it damn well belongs!

I think you agree with the people saying RAG is Dead, or at least you agree with me and I say RAG is Dead, when you say "Simply using docling and transforming PDFs to markdown and have a vector database doing the rest is ridiculous."

I fully agree, but that was the promise of RAG, chunk your documents into little bits and find the bit that is closet to the users query and add it to the context, maybe leave a little overlap on the chunks, is how RAG was initially presented, and how many vendors implement RAG, looking at tools like Amazon Bedrock Knowledge Bases here.

When I want to know the latest <important financial number>, I want that pulled that from the source of truth for that data, not hopefully get the latest and not last years number from some document chunk.

So, when people, or at least when I say RAG is Dead, it's short hand for: this is really damn complex, and vector search doesn't replace decades of information theory, storage and retrieval patterns.

Hell, I've worked with teams trying to extract everything from databases to push it into vector stores so the LLM can use the data. First, it often failed as they had chunks with multiple rows of data, and the LLM got confused as to which row actually mattered, they hadn't realized that the full chunk would be returned and not just the row they were interested in. Second, the use cases being worked on by these teams were usually well defined, that is, the required data could be deterministically defined before going to the LLM and pulled from a database using a simple script, no similarity required, but that's not the cool way to do it.

joefourier2mo ago

I agree with you that simple vector search + context stuffing is dead as a method, but I think it's ridiculous to reserve the term "RAG" for just the earliest most basic implementation. The definition of Retrieval Augmented Generation is any method that tries to give the LLM relevant data dynamically as opposed to relying purely on it memorising training data, or giving it everything it could possibly need and relying on long context windows.

The RAG system you mentioned is just RAG done badly, but doing it properly doesn't require a fundamentally different technique.

hbrn1mo ago

> it's ridiculous to reserve the term "RAG" for just the earliest most basic implementation

Whether we like it or not, dumb semantic search became the colloquial definition of RAG.

And when you hear someone saying "we use RAG here" 95% of the time this is exactly what they mean.

When you inject user's name into the system prompt, technically you're doing RAG - but nobody thinks about it that way. I think it's one of those case where colloquial definition is actually more useful that the formal one.

> doing it properly doesn't require a fundamentally different technique

But agentic RAG is fundamentally different.

1 more reply

whakim2mo ago

I don't think we should undersell that transformers and semantic search are really powerful information retrieval tools, and they are extremely potent for solving search problems. That being said, I think I agree with you that RAG is fundamentally just search, and the hype (like any hype) elides the fact that you still have to solve all of the normal, difficult search problems.

maCDzP2mo ago

Do you have any good resources for what you are describing?

financltravsty2mo ago

I simply have no idea what you're babbling about. I'm not trying to be rude, but I really cannot parse what you're saying.

Simple RAG is fine for very simple workflows, but semantic similarity vector search has a lot of edge cases and isn't the best tool out there. RIG or even recursive LLMs work better in the general case.

Whatever you're saying, it does not really mesh with my experience.

brianykim2mo ago

Good company-ready RAG benefits a lot from some basic pre-processing/labeling of the data instead of solely dumping unstrucuted data into a vector database and calling it a day. Different heuristics and different schemas of embedded data go a long way in ensuring quality and flexibility of querying.

Then you can do ReAG, which let's you reason on top of the top K intelligently.

And things like memory knowledge graph services as well, can help reduce your search space, and provide extra context over time that gets updated, beyond just treating static docs as sources of truth. You can give it more context as to how it should interpret older docs, vs. newer docs, and allowing users (based on correctness or not) to help audit the what is embedded in your RAG systems.

I appreciate the thorough write up, but doing RAG systems seriously requires much more than just embeddings and a basic chromadb set up.

Happy to share any thoughts here or on a call if anyone wants to chat.

leflob2mo ago

I agree, I attempted a similar project a year ago and the retrival part is so critical. To work half decent you need some serious strategy for metadata, chunking, etc. E.g. how do you deal with tim series data? Like i am not looking for any quarterly numbers but the ones from Q2 2025, Or the research report from 4 weeks ago... And how do you deal with images. We had heaps of companiy knowledge in pptx which you can convert to text but what about pictures in the presentations. Our analyst presentations sometimes consist mostly of charts and visuals, how are they embedded? Also imo for 90% of the time companies dont need a RAG system but a good search / retrival system.

minikomi1mo ago

Yep. Semantically distinct and meaningful chunks wins every time over any kind of windowing or slice and dicing.

Unfortunately, many people are looking for a fire and forget solution over an existing rats nest of documentation debt..

gverrilla2mo ago

> Happy to share any thoughts here

Please do.

maxperience2mo ago

This article is interesting cause of its scale, but does not touch on how to properly use RAG best practices. We wrote up this blog post on how to actually build a smart enterprise AI RAG based on the latest research if it's interesting to anyone: https://bytevagabond.com/post/how-to-build-enterprise-ai-rag...

It's based on different chunking strategies that scale cheaply and advanced retrieval

1 more reply

JKCalhoun2mo ago

And some have been saying that RAGs are obsolete—that the context window of a modern LLM is adequate (preferable?). The example I recently read was that the contexts are large enough for the entire "The Lord of the Rings" books.

That may be, but then there's an entire law library, the entirety of Wikipedia (and the example in this article of 451 GB). Surely those are at least an order of magnitude larger than Tolkien's prose and might still benefit from a RAG.

menaerus2mo ago

The success of the model responding to you with a correct information is a function of giving it proper context too.

That hasn't changed nor I think it will, even with the models having very large context windows (eg Gemini has 2M). It is observed that having a large context alone is not enough and that it is better to give the model sufficiently enough and quality information rather than filling it with virtually everything. Latter is also impossible and does not scale well with long and complicated tasks where reaching the context limit is inevitable. In that case you need to have the RAG which will be smart enough to extract the sufficient information from previous answers/context, and make it part of the new context, which in turn will make it possible for the model to keep its performance at satisfactory level.

alansaber2mo ago

RAG is nowhere near obselete. Model performance on enormous sequences degrades hugely as they are not well represented in training and non quadratic attention approximations are not amazing

Nihilartikel2mo ago

I'm not super deep on LLM development, but with ram being a material bottleneck and from what I've read about DeepSeek's results with offloading factual knowledge with 'engrams' I think that the near future will start moving towards the dense core of LLMs focusing much more on a distillation of universal reasoning and logic while factual knowledge is pushed out into slower storage. IIRC Nvidia's Nemotron Cascade is taking MoE even further in that direction too.

I don't need a coding model to be able to give me an analysis of the declaration of independence in urdu from 'memory' and the price in ram for being able to do that, impressive as it is, is an inefficiency.

axus2mo ago

Were he still corporeal, L. Ron would be all over this AI stuff.

Nihilartikel2mo ago

Very relatedly, I've just started reading the 'Culture' series of sci-fi space operas by Iain M Banks, and the notion of ubiquitous sentient, super-intelligent spacecraft and appliances hits differently than it would have before being faced with the reality of their existence in everyday life.

1 more reply

dgb232mo ago

Also the thing with context is that you want to keep it focused on the task at hand.

For example there's evidence that typical use of AGENTS.md actually doesn't improve outcomes but just slows the LLMs down and confuses them.

In my personal testing and exploration I found that small (local) LLMs perform drastically better, both in accuracy and speed, with heavily pruned and focused context.

Just because you can fill in more context, doesn't mean that you should.

The worry I have is that common usage will lead to LLMs being trained and fined tuned in order to accommodate ways of using them that doesn't make a lot of sense (stuffing context, wasting tokens etc.), just because that's how most people use them.

ravikirany222mo ago

This matches what we've been seeing empirically. The issue isn't just quantity of context — it's staleness. AGENTS.md and CLAUDE.md that reference renamed functions, deleted interfaces, or outdated patterns actively mislead the model with confident but wrong information.We've been auditing TypeScript repos and finding 10-84% of symbol references in AI config files are stale. A model reading a CLAUDE.md that says "use UserService.createUser()" when that function was renamed three weeks ago isn't just getting irrelevant context — it's getting a confident lie.The quality problem is probably as significant as the quantity problem, maybe more so.

dgb232mo ago

Interesting. It seems to me that the right approach is to have a structured way to navigate a codebase and useful, validated docs (with examples that need to pass tests) rather than ad-hoc markdown prompts laying around and are always read. We already have solutions for this like doc comments/strings, meta data etc. The codebase itself needs to be well-maintained.

btown2mo ago

I do think that what we think of as RAG will change!

When any given document can fit into context, and when we can generate highly mission-specific summarization and retrieval engines (for which large amounts of production data can be held in context as they are being implemented)... is the way we index and retrieve still going to be based on naive chunking, and off-the-shelf embedding models?

For instance, a system that reads every article and continuously updates a list of potential keywords with each document and the code assumptions that led to those documents being generated, then re-runs and tags each article with those keywords and weights, and does the same to explode a query into relevant keywords with weights... this is still RAG, but arguably a version where dimensionality is closer tied to your data.

(Such a system, for instance, might directly intuit the difference in vector space between "pet-friendly" and "pets considered," or between legal procedures that are treated differently in different jurisdictions. Naive RAG can throw dimensions at this, and your large-context post-processing may just be able to read all the candidates for relevance... but is this optimal?)

I'm very curious whether benchmarks have been done on this kind of approach.

whakim2mo ago

For technical domains, stuffing the context full of related-and-irrelevant or possibly-conflicting information will lead to poor results. The examples of long-context retrieval like finding a fact in a book really aren't representative of the types of context you'd be working with in a RAG scenario. In a lot of cases the problem is information organization, not retrieval, e.g. "What is the most authoritative type of source for this information?" or "How do these 100 documents about X relate to each other?"

joefourier2mo ago

Some previous techniques for RAG, like directly using a user message’s embedding to do a vector search and stuffing the results in the prompt, are probably obsolete. Newer models work much better if you use tool calls and let them write their own search queries (on an internal database, and perhaps with multiple rounds), and some people consider that “agentic AI” as opposed to RAG. It’s still augmenting generation with retrieved information, just in a more sophisticated way.

esafak2mo ago

How can it be obsolete? Maybe if you only have toy data you picked to write your blog post. Companies have gigabytes, petabytes of data to draw from.

magospietato2mo ago

It's not that the context window is adequate, but rather an agentic LLM can search the source of truth using appropriate tools (SQL, term search, etc.)

RAG made sense when the semantic search was based on human input and happening as a workflow step before populating context. Now it happens inside the agentic loop and the LLM already implicitly has the semantics of the user input.

jgalt2122mo ago

> some have been saying that RAGs are obsolete

I suspect the people saying that have not been transparent with their incentives.

gopalv2mo ago

> Surely those are at least an order of magnitude larger than Tolkien's prose and might still benefit from a RAG.

At some point, this is a distributed system of agents.

Once you go from 1 to 3 agents (1 router and two memory agents), it slowly ends up becoming a performance and cost decision rather than a recall problem.

_the_inflator2mo ago

I have two surprises for you:

1. Don't believe the pundits of RAG. They never implemented one.

I did many times, and boy, are they hard and have so many options that decide between utterly crappy results or fantastic scores on the accuracy scale with a perfect 100% scoring on facts.

In short: RAG is how you fill the context window. But then what?

2. How does a superlarge context window solve your problem? Context windows ain't the problem, accurate matching requirements is. What do your inquiry expect to solve? Greatest context window ever, but what then? No prompt engineering is coming to save you if you don't know what you want.

RAG is in very simple terms simply a search engine. Context window was never the problem. Never. Filling the context window, finding the relevant information is one problem, but also only part of the solution.

What if your inquiry needs a combination of multiple sources to make sense? There is no 1:1 matching of information, never.

"How many cars from 1980 to 1985 and 1990 to 1997 had between 100 and 180PS without Diesel in the color blue that were approved for USA and Germany from Mercedes but only the E unit?"

Have fun, this is a simple request.

joefourier2mo ago

> What if your inquiry needs a combination of multiple sources to make sense? There is no 1:1 matching of information, never.

I don't see the problem if you give the LLM the ability to generate multiple search queries at once. Even simple vector search can give you multiple results at once.

> "How many cars from 1980 to 1985 and 1990 to 1997 had between 100 and 180PS without Diesel in the color blue that were approved for USA and Germany from Mercedes but only the E unit?"

I'm a human and I have a hard time parsing that query. Are you asking only for Mercedes E-Class? The number of cars, as in how many were sold?

mickeyp2mo ago

It doesn't help that academia loooves ColBERT and will happily tell you how amazing -- and, look, for how tiny the models are, 20M params and super fast on a CPU, it is -- they are at seemingly everything if only you...

- Chunk properly;

- Elide "obviously useless files" that give mixed signals;

- Re-rank and rechunk the whole files for top scoring matches;

- Throw in a little BM25 but with better stemming;

- Carry around a list of preferred files and ideally also terms to help re-rank;

And so on. Works great when you're an academic benchmaxing your toy Master's project. Try building a scalable vector search that runs on any codebase without knowing anything at all about it and get a decent signal out of it.

Ha.

mentos2mo ago

I assume it’s not possible to get the same results by fine tuning a model with the documents instead?

notglossy2mo ago

You will still get hallucinations. With RAG you use the vectors to aid in finding things that are relevant, and then you typically also have the raw text data stored as well. This allows you to theoretically have LLM outputs grounded in the truth of the documents. Depending on implementation, you can also make the LLM cite the sources (filename, chunk, etc).

thisischayan1mo ago

The approach that has worked for us in production is correction during generation, not after.

The model verifies its output against the rules in the prompt as it generates and corrects itself within the same API call — no retries, no external validator. If there are still failures the model cannot fix at runtime, those are explicitly flagged instead of silently producing wrong output.

This does not mean hallucinations are completely solved. It turns them into a measurable engineering problem. You know your error rate, you know which outputs failed, and you can drive that rate down over time with better rules. The system can also self-learn and self-improve over time to deliver better accuracy.

tren_hard2mo ago

I’m still learning this advantages and differences between them, would there be benefits to SFT and RAG? Or does RAG make SFT redundant?

1 more reply

pussyjuice2mo ago

> The example I recently read was that the contexts are large enough for the entire "The Lord of the Rings" books.

Not really, though. Not in practice at least, e.g. code writing.

Paste a 200 line React component into your favorite LLM, ask it to fix/add/change something and it will do it perfectly.

Paste a 2000 line one though, and it starts omitting, starts making mistakes, assumptions, re-writing what it already has, and so-on.

So what's going on? It's supposed to be able to hold 1000s of lines in context, but in practice it's only like 200.

What happens is the accuracy and agency drops significantly as you need to pan larger and larger context windows.

And it's not that it's most accurate when the window is smallest either - but there is a sweet spot.

Outside that sweet spot, you will get "unacceptable responses" - slop you can't use.

That's what happens when you paste the 2000 line React component for example. You get a response you can't quite use. Yet the 200 line one is typically perfect.

What would make the 2000 line one usually perfect every time?

We need a way to increase that "accurate window size" lets call it "working memory", so that we can generate more code, more writing, more pixels at acceptable levels of quality. You'd also have enough language space for agents to operate and collaborate sans the amnesia they have today.

RAG is basically the interim workaround for all this. Because you can put everything in a vector DB and search/find what you need in the context when you need it.

So, RAG is a great solution for today's problems: Say you have a bunch of Python code files written in a certain style and the main use case of your LLM is writing Python code in specified ways, with this setup you can probably deliver "better Python code" than your competitor because of RAG - because you have this deterministic supplement to your LLMs outputs to basically do research and augment the output in predetermined ways every time it responds to a prompt.

But eventually, if I don't have to upload "The Lord of the Rings" documents, and vector search to find different areas in order to generate responses, if I can just paste the entire txt into the input, it can generate the answer considering "all of it" not just that little area, it would presumably be a better quality response.

charcircuit2mo ago

It's nonsense as all frontier models are integrated with retrieval engines hooked up to various search engines / their own.

z02d2mo ago

Maybe a bit off-topic: For my PhD, I wanted to leverage LLMs and AI to speed up the literature review process*. Due to time constraints, this never really lifted off for me. At the time I checked (about 6 months ago), several tools were already available (NotebookLM, Anara, Connected Papers, ZotAI, Litmaps, Consensus, Research Rabbit) supporting Literature Review. They have all pros and cons (and different scopes), but my biggest requirement would be to do this on my Zotero bibliographic collection (available offline as PDF/ePub).

ZotAI can use LMStudio (for embeddings and LLM models), but at that time, ZotAI was super slow and buggy.

Instead of going through the valley of sorrows (as threatofrain shared in the blog post - thanks for that), is there a more or less out-of-the-box solution (paid or free) for the demand (RAG for local literature review support)?

*If I am honest, it was rather a procrastination exercise, but this is for sure relatable for readers of HN :-D

bee_rider2mo ago

I tried to do RAG on my laptop just by setting it all up myself, but the actual LLM gave poor results (I have a small thin-and-light fwiw, so I could only run weak models). The vector search itself, actually, ended up being a little more useful.

teleforce2mo ago

Recently there's HN discussions on the topic of local AI/LLM being utilized by researchers from IEEE Spectrum magazine, probably worth a look up [1], [2].

[1] Local AI is driving the biggest change in laptops in decades (260 comments):

https://news.ycombinator.com/item?id=46360856

[2] Your Laptop Isn’t Ready for LLMs. That’s About to ChangeLocal AI is driving the biggest change in laptops in decades:

https://spectrum.ieee.org/ai-models-locally

sthimons2mo ago

Oh! Same! I made an R / Shiny powered RAG/ Researching app that hooks into OpenAlex (for papers) and allows you to generate NotebookLM like outputs. Just got slides with from-paper images to be injected in, super fun. Takes an OpenRouter or local LLMs (if that's your thing). Network graphs too! https://github.com/seanthimons/serapeum/

oceansweep2mo ago

If you don’t mind a little instability while I work out the bugs, might be interested in my project: https://github.com/rmusser01/tldw_server ; it’s not quite fully ready yet but the backend api is functional and has a full RAG system with a customizable and tweakable local-first ETL so you can use it without relying on any third party services.

lukewarm7072mo ago

onyx is good for this, it is standard doc ingestion -> chunk -> embedding -> index -> query -> rerank -> answer.

there are a few other local apps with simple knowledge base type things you can use with pdfs. cherry studio is nice, no reranking though.

maxperience2mo ago

If you want to build a prod ready RAH architecture with decent benchmark scores I can recommend this blog post based on our experiences what techniques actually work: https://bytevagabond.com/post/how-to-build-enterprise-ai-rag...

nithril2mo ago

thank you for sharing your experience

shepherdjerred2mo ago

Is there a 'sqlite equivalent' for RAG? e.g. something I could give Claude w/o a backend and say use command X to add a document, command Y to search, all in a flat file?

tk901mo ago

closest thing I've seen is https://github.com/tobi/qmd, though it requires installing 3 local embedding models (300MB, 640MB, 1.1GB).

mettamage2mo ago

51 visitors in real-time.

I love those site features!

In a submission of a few days ago there was something similar.

I love it when a website gives a hint to the old web :)

1 more reply

abd78942mo ago

What ended up being the main bottleneck in your pipeline—embedding throughput, cost, or something else? Did you explore parallelizing vectorization (e.g., multiple workers) or did that not help much in practice?

whakim2mo ago

I'd argue the author missed a trick here by using a fancy embedding model without any re-ranking. One of the benefits of a re-ranker (or even a series of re-rankers!) is that you can embed your documents using a really small and cheap model (this also often means smaller embeddings).

trgn2mo ago

Odd to me that Elasticsearch isn't finding a second breath in these new ecosystems. It basically is that now, a RAG engine with model integration.

sailfast2mo ago

It’s definitely a use case for this and would’ve saved a lot of pain IMO but also seems like it would have added confusing technology to what was a VERY Python-heavy stack that would’ve benefitted from other elements.

Hardest part is always figuring out your company’s knowledge management has been dogsh!t for years so now you need to either throw most of it away or stick to the authoritative stuff somehow.

Elastic plus an agent with MCP may have worked as a prototype very quickly here, but hosting costs for 500GB worth of indexes sounds too expensive for this person’s use case if $185 is a lot.

trgn2mo ago

ah got it! thanks for the color

mickeyp2mo ago

The old joke Zawinski made about picking regex "and now you have two problems" applies here.

If you pick Elasticsearch, useful as it is, you now have more than two problems. You have Elastic the company; Elasticsearch the tool; and also the clay-footed colossus, Java, to contend with.

mrits2mo ago

The people that survived it aren't willing to give it anymore of their breathing left

trgn2mo ago

haha! it's been ok for me, but a lot of song and dance is required. the saas-version is a black box (in a bad way).

dprkh2mo ago

Why did you opt for semantic search, and not plain old full text search? I built an "AI Agent for a Commerce Website" as a take-home exercise yesterday, and I chose to simply give the model a tool that does a full text search over products, powered by MiniSearch, and I think it works reasonably well. I believe this is also what Claude Code does.

https://github.com/dprkh/fufus/

pussyjuice2mo ago

After a couple years of multi-modal LLM proving out product, I now consider RAG to be essentially "AI Lite", or just AI-inspired vector search.

It isn't really "AI" in the way ongoing LLM conversations are. The context is effectively controlled by deterministic information, and as LLMs continue improve through various context-related techniques like re-prompting, running multiple models, etc. that deterministic "re-basing" of context will stifle the output.

So I say over time it will be treated as less and less "AI" and more "AI adjacent".

The significance is that right now RAG is largely considered to be an "AI pipeline strategy" in its own right compared others that involve pure context engineering.

But when the context size of LLMs grows much larger (with integrity), when it can, say, accurately hold thousands and thousands of lines of code in context with accuracy, without having to use RAG to search and find, it will be doing a lot more for us. We will get the agentic automation they are promising and not delivering (due to this current limitation).

KPGv22mo ago

This article came just in the nick of time. I'm in fandoms that lean heavily into fanfiction, and there's a LOT out there on Ao3. Ao3 has the worst search (and yo can't even search your account's history!), so I've been wanting to create something like this as a tool for the fandom, where we can query "what was the fic about XYZ where ABC happened?" and get hopefully helpful responses. I'm very tired of not being able to do this, and it would be a fun learning experience.

I've already got the data mostly structured because I did some research on the fandom last year, charting trends and such, so I don't even need to massage the data. I've got authors, dates, chapters, reader comments, and full text already in a local SQLite db.

mileycyrusXOXO1mo ago

I did something similar to this for all the Cosmere stuff. I wanted to be able to find answers but only with the information I had read been exposed to so far. I didn't want to risk going to the wiki and getting spoilers for things I haven't read yet. It wasn't anything fancy, it was just giving the agent access to all the text I had read up to my current chapter. Probably too much context for it to handle efficiently - would be awesome to take it one step further and do it proper

nossid2mo ago

If you didn't already see https://news.ycombinator.com/item?id=44878151 (Building a web search engine from scratch in two months with 3 billion neural embeddings), then you might enjoy it, even if it's way overkill for your use case.

overtaxed2mo ago

Reading this blog post scared me a bit. The use case I proposed was building a "simple" RAG chatbot for some (~50 confluence docs and somewhat growing) on elasticsearch and another process that my team handles. I was just planning on using a stack like streamlit, text-embedding-3-small,FAISS for the vector store and it to be driven by a python script.

Didn't seem too expensive or too hard based on the handful of queries my team would be using it for, and it was a "low hanging fruit" pain point for my team that I thought could be improved by a RAG chatbot. That on top of the fact that Atlassian Rovo did not do a good job of not going to external sources when we had the answer in our existing internal docs.

Am I still on the right path?

netghost2mo ago

I think you're operating in a scale that is small enough that there's little risk.

You'll be able to iterate if you run into anything that doesn't work. You should however be clear on what problem you and your team are solving, and not just "get some rag".

overtaxed1mo ago

Sure - I neglected to include the pain point itself. Right now we spend a large amount of time during troubleshooting of a problem (incident) or when working features related to these two systems, and heavily rely on our existing internal documentation. Rather than combing through tons of those docs, a RAG chatbot made sense to me and the team seems to agree. Will move forward- thanks for the input.

civeng2mo ago

Great write-up. Thank you! I’m contemplating a similar RAG architecture for my engineering firm, but we’re dealing with roughly 20x the data volume (estimating around 9TB of project files, specs, and PDFs). I've been reading about Google's new STATIC framework (sparse matrix constrained decoding) and am really curious about the shift toward generative retrieval for massive speedups well beyond this approach. For those who have scaled RAG into the multi-terabyte range: is it actually worth exploring generative retrieval approaches like STATIC to bypass standard dense vector search, or is a traditional sharded vector DB (Milvus, Pinecone, etc.) still the most practical path at this scale?

I would guess the ingestion pain is still the same.

This new world is astounding.

lukewarm7072mo ago

9tb should be fine for vectordb, for sure. google search is many petabytes of index with vector+semantic search, that is using ScaNN.

you could probably use the hybrid search in llamaindex; or elasticsearch. there is an off the shelf discovery engine api on gcp. vertex rag engine is end to end for building your own. gcp is too expensive though. alibaba cloud have a similar solution.

physicsguy2mo ago

We did it in an engineering setting and had very mixed results. Big 800 page machine manuals are hard to contextualise.

te_chris2mo ago

There’s turbopuffer

Horatius772mo ago

Great writeup but ... pretty sure ChromaDB is open source and not "Google's database"?

nalinidash2mo ago

ChromaDB is open source with Apache-2.0 license.

https://github.com/chroma-core/chroma

threatofrain2mo ago

I'm afraid this hits the credibility of the article for me, that's a pretty weird mistake to make. It's like paying for a Model 3 while thinking it comes from Ford.

androsOP2mo ago

Thank you for your feedback!

sota_pop2mo ago

Nice writeup. I’m curious why you went with chromadb and not pgvector. I haven’t built a rag system myself, but I’ve always understood the initial doc parsing to be a major challenge alone, so kudos there!

Additionally, I also thought it was customary to store a pointer to the source in the same row as the vector (i.e. vector+ doc path + page#/paragraph/etc.) OR just store the original text chunk (though based on your disk reqs doesn’t sound like it would have been feasible).

Glad you’re having good results! Maybe you’ve inspired me to finally try out a similar setup myself!

smrtinsert2mo ago

What would it look like to regularly react to source data changes? Seems like a big missing piece. Event based? regular cadence? Curious what people choose. Great post though.

ravikirany222mo ago

For code specifically this is the hardest part — the "source data" (the codebase) changes constantly with every commit, but the AI config files that describe it don't update automatically.The approach that works best is AST-diffing rather than hash-based reindexing — you can detect semantic changes (function renamed, interface deleted) rather than just textual changes, which gives you much more precise invalidation signals.

stingraycharles2mo ago

Depends on the use case, ie frequency and impact of changes.

Typically you would have a reindex process, and you keep track of hashes of chunks to check if you’ve already calculated this exact block before to avoid extra costs. And then run such a reindex process pretty frequently as it’s cheap / costs nothing when there are no changes.

smrtinsert2mo ago

makes great sense, thanks!

lucfranken2mo ago

Cool work! Would be so interested in what would happen if you would put the data and you plan / features you wanted in a Claude Code instance and let it go. You did carefully thinking, but those models now also go really far and deep. Would be really interested in seeing what it comes up with. For that kind of data getting something like a Mac mini or whatever (no not with OpenClaw) would be damn interesting to see how fast and far you can go.

tom13372mo ago

But where is the fun with that?

lucfranken2mo ago

Being curious is always fun right.

ozim1mo ago

So 95% of the post is „regular software engineering” like, yes you cannot just process 1TB of data, you need to split it up then even if you split it up you might have limited budget for processing so think how you fit in that, make checkpoints and make sure you have logs.

Not dismissing the value of the blog post. Just underlining for „non engineers”.

alansaber2mo ago

Think that's the first time i've seen someone write about checkpointing, definitely worth doing for similar projects.

aledevv2mo ago

I made something similar in my project. My more difficult task has been choice the right approach to chunking long documents. I used both structural and semantic chunking approach. The semantic one helped to better store vectors in vectorial DB. I used QDrant and openAi embedding model.

supermooka2mo ago

Thanks for an interesting read! Are you monitoring usage, and what kind of user feedback have you received? Always curious if these projects end up used because, even with the perfect tech, if the data is low quality, nobody is going to bother

throw8312mo ago

Can anyone suggest a RAG pipeline that is production ready?

Also I wonder if it's now better to use Claude Agent SDK instead of RAG. If anyone has tried this, I would be interested in hearing more.

fb032mo ago

Quick Q: OP told he used Llama 3.2:3b which is a pretty old model. What would be a good modern model to substitute it? Qwen3.5:4b or something?

redwood2mo ago

Cool to see Nomic embeddings mentioned. Though surpriser you didn't land on Voyage.

Did you look at Turbopuffer btw?

austinthetaco2mo ago

i assume based on their concerns of the hetzner pricing that they didnt want to pay for voyage/turbopuffer. unless there are free versions of those products that I'm unaware of, but I'm only seeing paid.

brcmthrowaway2mo ago

What was the system prompt?

j / k navigate · click thread line to collapse

103 comments

diarmuidc2mo ago