RAG Using Unstructured Data and Role of Knowledge Graphs (opens in new tab)

(kuzudb.com)

177 pointssemihsalihoglu2y ago31 comments

31 comments

When I started working in search 10+ years ago, people would build a beautiful UI, and then, only on shipping, realize the search results were trash + irrelevant. They imagined a search system like Elasticsearch was basically Google. When in reality, Elasticsearch is just a bit of infrastructure. A framework, not a solution.

There's a similar thing happening on RAG. Where people think building the chat interaction is the hard thing. The hard thing is extracting + searching to get relevant context. A lot of founders I talk to suddenly realize this at the last minute, right before shipping, similar to search back in the day. It's harder than just throwing chunks in a vector DB. It involves a lot of different backend data sources potentially, and is in many ways harder than a standard search relevance problem (which is itself hard enough).

dbish2y ago

Yep, we're doing RAG-ish search and ranking across many context types and modalities, you definitely can't just use a vectordb and do some chunking/search, there are a wide variety of search-like ranking, clustering, etc. and domain specific work for relevance and it's very hard to measure and prove improvements.

It's going to just evolve into recreating the various search and ranking processes of old just on top of a bit more semantic understanding with some smarter NLG layered in :). It won't be just LLMs, we'll have intent classification, named entity recognition, a personalization layer, reranking, all that fun stuff again.

MattDaEskimo2y ago

Especially considering the additional logic that some queries require. Stacked questions, comparative questions, recommendations, questions that assume information found in previous statements / questions.

It becomes a very frustrating experience matching the inherent chaos of a conversation.

softwaredoug2y ago

Yeah, and to do it well you have to focus on a subset of tasks. Then find a way to gracefully reject anything you can't retrieve well.

In many ways it makes the chat more Siri-like than ChatGPT like. Which may not be what users actually expect.

Keyframe2y ago

Great observation. I've seen it often in tech, across the board. It's no better, maybe a step up, than 'idea guy' who 'just' needs someone to build his idea. Hand-waving or complete lack of awareness on the actual value (hard) part.

hobs2y ago

I spent 8 months telling people this before I got laid off while the CEO continues to chase LLM money with no new ideas or even the talent to solve the problem.

They spent so much time on the UI and basically left the actual search to the last minute, and it was a hilarious failure on launch.

laminarflow0272y ago

Very good points. Have you seen any examples of systems (or projects) that successfully combine multiple backend data sources, including databases, that perform better than the single backend alone? This seems like an important enough question that it ought to have been documented somewhere.

moralestapia2y ago

Hmm, RAG is not "the chat interaction", that's GPT or any other "brain" you choose.

Last week I finished building my 3rd RAG stack for legal document retrieval. Almost-vanilla RAG got me 90-95% of the way. Only drawback is cost, still 10x-100x above the ideal price point; but that will only improve in the future.

opisthenar842y ago

True. Pure vectorstores seem limited and kind of overrated. Combining many sources of data is challenging but the right thing to do.

hackernoteng2y ago

This is a great comment. Good search is really hard. RAG is much harder. At least with search user can pick the best result manually or refine their search. With RAG you pass topK to the LLM and assume its good results. The assumption is that its "semantic search" with vectors so it will just work... wrong.

semihsalihogluOP2y ago

This is a post that summarizes some reading that I had done in the space of LLMs + Knowledge Graphs with the goal of identifying technically deep and interesting directions. The post cover retrieval augmented generation (RAG) systems that use unstructured data (RAG-U) and the role folks envision knowledge graphs to play in it. Briefly the design spectrum of RAG-U systems have two dimensions: 1) What additional data to put into LLM prompts: such as, documents, or triples extracted from documents. 2) How to store and fetch that data: such as a vector index, gdbms, or both.

The standard RAG-U uses vector embeddings of chunks, which are fetched from a vector index. An envisioned role of knowledge graphs is to improve standard RAG-U by explicitly linking the chunks through the entities they mention. This is a promising idea but one that need to be subjected to rigorous evaluation as done in prominent IR publications, e.g., SIGIR.

The post then discusses the scenario when an enterprise does not have a knowledge graph and discuss the ideal of automatically extracting knowledge graphs from unstructured pdfs and text documents. It covers the recent work that uses LLMs for this task (they're not yet competitive with specialized models) and highlights many interesting open questions.

Hope this is interesting to people who are interested in the area but intimidated because of the flood of activity (but don't be; I think the area is easier to digest than it may look.)

kordlessagain2y ago

Knowledge graphs improve vector search by providing a "back of the book" index for the content. This can be done using knowledge extraction from an LLM during indexing, such as pulling out keyterms of a given chunk before embedding, or asking a question of the content and then answering it using the keyterms in addition to the embeddings. One challenge I found with this is determining keyterms to use with prompts that have light context, but using a time window helps with this, as does hitting the vector store for related content, then finding the keyterms for THAT content to use with the current query.

sroussey2y ago

What open source model is good at pulling keyterms?

2 more replies

daxfohl2y ago

Having just started from zero, I agree on the easy to digest point. You can get a pretty good understanding of how most things work in a couple days, and the field is moving so fast that a lot of papers are just exploring different iterative improvements on basic concepts.

mark_l_watson2y ago

I really liked the idea of creating linked data to connect chunks. That is an idea that deserves some play time (I just added it to my TODO list). Thanks for the good ideas!

dmezzetti2y ago

If you're interested in graphs + RAG and want an alternate approach, txtai has a semantic graph component.

https://neuml.hashnode.dev/introducing-the-semantic-graph

https://github.com/neuml/txtai

Disclaimer: I'm the primary author of txtai

Der_Einzige2y ago

Note for those who aren't aware, a "Semantic Graph" means a knowledge graph built using a "sentence(pooled) transformer" language model to draw edges between the vertices (text data at whatever granularity the user decides) according to semantic similarity.

What's awesome about them is that they essentially form in my mind the "extractive" analogue to LLMs "generative" nature.

Semantic Graphs give every single graph theory algorithm a unique epistemological twist given any particular dataset. In my case, I've built and released pre-trained semantic graphs for my debate evidence. I observe that path traversals form "debate cases", and that graph centrality in this case finds the most "generic/universally applicable" evidence. Given a different dataset, the same algorithms will have different interpretations.

What makes txtai so awesome is that it creates a synchronized interface between an underlying vector DB, SQL DB, and a semantic knowledge graph. The flexibility and power this offers compared to other vector DB solutions is simply unparalleled. I have seen zero meaningful competition from a vectorDB industry which is flooded with money despite little product differentiation among themselves.

Disclaimer: I wrote an NLP paper with dmezzetti as my co-author about semantic graphs: https://aclanthology.org/2023.newsum-1.10.pdf

dmezzetti2y ago

Thank you for taking the time to share these excellent additional details!

bryan02y ago

This is really cool, I'm surprised I never heard of this project before. The examples look really clean.

Most RAG tools seem to start with the LLM and add Vector building and retrieval around it, while this tool seems like it started with Vector / Graph building and retrieval, then added LLM support later.

dmezzetti2y ago

Thanks, that's an accurate assessment. The main reason for this approach is that txtai has been around since 2020 before the LLM era.

Oras2y ago

The article is a good summary of RAG in the enterprise. It shed some light for me on the quality of building KG using LLMs, as recently, it is an approach that Neo4j was proposing [0].

According to the article, it is either costly (if using OpenAI), or slow using open source AI models. In both cases, predicting the quality of generated KG using LLMs is hard.

[0] https://github.com/neo4j/NaLLM

laminarflow0272y ago

This is an excellent article that asks some much-needed questions on the literature that exists connecting LLMs and RAGs on unstructured data, with knowledge graphs in between. We've seen plenty of articles that speculate on how one can build a simple retrieval system on top of a KG, but there are two challenges: a) constructing a high quality KG isn't easy, and b) keyword or phrase embedding on metadata for pre-filtering on relevant sections of the graph is required.

As some others here have pointed out, information extraction and searching with relevant context are the hardest parts of any search system, and it's clear that simply chunking vectors up and throwing them into a vector DB has limitations, no matter what the vector DB vendors tell you. Just like this article says, I hope that 2024 is the year where we actually get some papers that perform more rigorous evaluations of systems that use vector DBs, graph DBs, or a combination of them for building RAGs.

formercoder2y ago

It’s interesting to see more developed KG + LLM use cases that aren’t just NL to Graph DB Query Lang.

laminarflow0272y ago

Totally agree! The wave of blog posts and examples one sees where it's just text-to-SQL or text-to-Cypher or any other query lang aren't really exploring the topic at any level of technical depth, and we need to see more evaluations and technical papers that characterize them, so that we can understand how to build better systems.

semihsalihogluOP2y ago

I think even on the LLMS + KGs space the depth is not very deep. In fact there is more technical depth in the text-to-SQL than anything else I have seen on LLMs. Maybe the COLBERT-like matrix-models is another topic where there is good technical depth.

iAkashPaul2y ago

One quick check for any RAG system is to ask what all can the bot answer about. Generating scalable metadata at ingestion along with knowledge graphs make for a good closed domain experience

hall0ween2y ago

Dear author, please define your acronyms or variables.

j / k navigate · click thread line to collapse

31 comments

softwaredoug2y ago

dbish2y ago

MattDaEskimo2y ago

It becomes a very frustrating experience matching the inherent chaos of a conversation.

softwaredoug2y ago

Yeah, and to do it well you have to focus on a subset of tasks. Then find a way to gracefully reject anything you can't retrieve well.

In many ways it makes the chat more Siri-like than ChatGPT like. Which may not be what users actually expect.

Keyframe2y ago

hobs2y ago

I spent 8 months telling people this before I got laid off while the CEO continues to chase LLM money with no new ideas or even the talent to solve the problem.

They spent so much time on the UI and basically left the actual search to the last minute, and it was a hilarious failure on launch.

laminarflow0272y ago

moralestapia2y ago

Hmm, RAG is not "the chat interaction", that's GPT or any other "brain" you choose.

opisthenar842y ago

True. Pure vectorstores seem limited and kind of overrated. Combining many sources of data is challenging but the right thing to do.

hackernoteng2y ago

semihsalihogluOP2y ago

Hope this is interesting to people who are interested in the area but intimidated because of the flood of activity (but don't be; I think the area is easier to digest than it may look.)

kordlessagain2y ago

sroussey2y ago

What open source model is good at pulling keyterms?

2 more replies

daxfohl2y ago

mark_l_watson2y ago

I really liked the idea of creating linked data to connect chunks. That is an idea that deserves some play time (I just added it to my TODO list). Thanks for the good ideas!

dmezzetti2y ago

If you're interested in graphs + RAG and want an alternate approach, txtai has a semantic graph component.

https://neuml.hashnode.dev/introducing-the-semantic-graph

https://github.com/neuml/txtai

Disclaimer: I'm the primary author of txtai

Der_Einzige2y ago

What's awesome about them is that they essentially form in my mind the "extractive" analogue to LLMs "generative" nature.

Disclaimer: I wrote an NLP paper with dmezzetti as my co-author about semantic graphs: https://aclanthology.org/2023.newsum-1.10.pdf

dmezzetti2y ago

Thank you for taking the time to share these excellent additional details!

bryan02y ago

This is really cool, I'm surprised I never heard of this project before. The examples look really clean.

dmezzetti2y ago

Thanks, that's an accurate assessment. The main reason for this approach is that txtai has been around since 2020 before the LLM era.

Oras2y ago

The article is a good summary of RAG in the enterprise. It shed some light for me on the quality of building KG using LLMs, as recently, it is an approach that Neo4j was proposing [0].

According to the article, it is either costly (if using OpenAI), or slow using open source AI models. In both cases, predicting the quality of generated KG using LLMs is hard.

[0] https://github.com/neo4j/NaLLM

laminarflow0272y ago

formercoder2y ago

It’s interesting to see more developed KG + LLM use cases that aren’t just NL to Graph DB Query Lang.

laminarflow0272y ago

semihsalihogluOP2y ago

iAkashPaul2y ago

One quick check for any RAG system is to ask what all can the bot answer about. Generating scalable metadata at ingestion along with knowledge graphs make for a good closed domain experience

hall0ween2y ago

Dear author, please define your acronyms or variables.

j / k navigate · click thread line to collapse