wehadit on Hacker News

Stop Blaming Embeddings, Most RAG Failures Come from Bad Chunking

Everyone keeps arguing about embeddings, vector DBs, and model choice, but in real systems, those aren’t the things breaking retrieval. Chunking drift is. And almost nobody monitors it. A tiny formatting change in a PDF or HTML file silently shifts boundaries. Overlaps become inconsistent. Semantic units get split mid-thought. Headings flatten. Cross-format differences explode. By the time retrieval quality drops, people start tweaking the model… while the actual problem happened upstream. If you diff chunk boundaries across versions or track chunk-size variance, the drift is obvious. But most teams don’t even version their chunking logic, let alone validate segmentation or check adjacency similarity. The Industry treats chunking like a trivial preprocessing step. It’s not. It’s the single biggest source of retrieval collapse, and it’s usually invisible. Before playing with new embeddings, fix your segmentation pipeline. Chunking is repetitive, undifferentiated engineering, but if you don’t stabilize it, the rest of your RAG stack is built on sand.

Most Agentic AI failures I've debugged turned out to be ingestion drift

Over the last few months, we’ve been working on creating an autonomous Agentic AI, and something unexpected kept showing up. I went in thinking the issues were with embeddings or the retriever, but the root cause was usually ingestion drifting upstream.

Some patterns that kept repeating: • PDFs extracting differently after a small template or export tool change • headings collapsing or shifting levels • hidden characters creeping into tokens • tables losing their structure • documents updated without being re-ingested • different converters producing slightly different text layouts

We only noticed the drift once we started diffing extraction output week-to-week and tracking token count variance. Running two extractors on the same file also revealed inconsistencies that weren’t obvious from looking at the text.

Even with pinned extractor versions, mixed-format sources (Google Docs, Word, Confluence exports, scanned PDFs) still drifted subtly over time. The retriever was doing exactly what it was told, the input data just wasn’t consistent anymore.

Curious if others have seen this. How do you keep ingestion stable in production RAG/Agentic AI systems?

Is generation of reliable tailored code helpful?

Hey devs, I am building an agentic AI to be the AI tool for developers. I was wondering if it'll be helpful if the agentic AI generates the 1st working ver of app in one prompt with re-usable code that is tailored to project specs from Figma/Motiff, Postman or requirements docs. Additionally, what if the agentic AI also helps with elevating coding skills on every project, does code review, creates unit test cases, helps with task management.

Does this type of an Agentic AI/AI tool help developers?

We need more than AI auto complete to do what matters most

The reality is that since the 90s, we've all gone from spending 70% of our time on things we enjoy to 30%. This shift is due to info overload, multitude of apps and processes creating mundane tasks a necessity or 40% of our life. At 9 to 5 is now such a phased out term that it irks people , it's since we struggle to make time for things we enjoy. The same has happened on our personal lives as well. There's a solution in AI, definitely.

If corporates can profit from it by automating most complex processes, then why can't we as individuals take advantage as well. It's time to empower ourselves by making the power of AI accessible to us for work & personal tasks. The power of desktop is now old.

Imagine us having a personal AI that recommends vacation plans by automatically working with your preferences, budget and prior reservations - we are now spending the 5hrs travel research time with our friends/family.

OR a personal AI that extracts UI elements from Figma, requirements of the UI from a doc and API from Postman to generate your 1st working ver of the app using your coding standards. The time saved can now help you spend time on what you enjoy the most - applying your hard earned skills on complex/customization/challenging tasks.

It's not only an AI tool for developers, it's also a tool for a developer who is a traveler. This is the tech world we should have. Let's just not AI auto complete, let's empower ourselves with such personal AI. Do you agree there's this gap I speak of and power of personal AI (not generic AI) will fill this gap?

1wehadit1y ago0

Stop Blaming Embeddings, Most RAG Failures Come from Bad Chunking

Most Agentic AI failures I've debugged turned out to be ingestion drift

Curious if others have seen this. How do you keep ingestion stable in production RAG/Agentic AI systems?

Is generation of reliable tailored code helpful?

Does this type of an Agentic AI/AI tool help developers?

We need more than AI auto complete to do what matters most

wehadit

Recent submissions

Stop Blaming Embeddings, Most RAG Failures Come from Bad Chunking

Most Agentic AI failures I've debugged turned out to be ingestion drift

Is generation of reliable tailored code helpful?

We need more than AI auto complete to do what matters most

Recent submissions

Stop Blaming Embeddings, Most RAG Failures Come from Bad Chunking

Most Agentic AI failures I've debugged turned out to be ingestion drift

Is generation of reliable tailored code helpful?

We need more than AI auto complete to do what matters most