Show HN: Tonic Validate Metrics – an open-source RAG evaluation metrics package (opens in new tab)

(github.com)

40 pointsEphil0122y ago17 comments

Hey HN, Joe and Ethan from Tonic.ai here. We just released a new open-source python package for evaluating the performance of Retrieval Augmented Generation (RAG) systems.

Earlier this year, we started developing a RAG-powered app to enable companies to talk to their free-text data safely.

During our experimentation, however, we realized that using such a new method meant that there weren’t industry-standards for evaluation metrics to measure the accuracy of RAG performance. We built Tonic Validate Metrics (tvalmetrics, for short) to easily calculate the benchmarks we needed to meet in building our RAG system.

We’re sharing this python package with the hope that it will be as useful for you as it has been for us and become a key part of the toolset you use to build LLM-powered applications. We also made Tonic Validate Metrics open-source so that it can thrive and evolve with your contributions!

Please take it for a spin and let us know what you think in the comments.

Docs: https://docs.tonic.ai/validate

Repo: https://github.com/TonicAI/tvalmetrics

Tonic Validate: https://validate.tonic.ai

17 comments

d4rkp4ttern2y ago

Related — are there any good end to end benchmark datasets for RAG? End to end meaning not just (context, question, answer) tuples (which ignores retrieval) but (Document , question, answer). I know NQ (Natural Questions) is one such dataset:

https://ai.google.com/research/NaturalQuestions

But I do t see this dataset mentioned much in RAG discussions.

joewferrara2y ago

It's true that there are not a lot of datasets for benchmarking RAG. RAG applications are so tailored to the specific data being used as well as the use case, that a benchmark dataset is not useful across different RAG applications. The data used for a RAG application could be slack messages, technical documentation, insurance policies, internal company microsoft word documents, or a combination of these. For each of these different data source examples, the benchmark dataset would need to be very different.

We recommend that when building a RAG application, the developers build a benchmark dataset specifically tailored to the data being used for the RAG application, and the use case of the RAG application.

d4rkp4ttern2y ago

Well we could say the same about question answering or Info Retrieval, and even LLMs , yet there are plenty of benchmarks for these. The point of a benchmark is not that it covers all use cases, but that it is agreed upon to be a meaningful way to compare different approaches. I suppose I should dig into RAG papers accepted into ICLR/ICML/NeurIPS and look at their experiments section.

elyase2y ago

How does it compare to https://github.com/explodinggradients/ragas

joewferrara2y ago

tvalmetrics is similar to ragas for sure, and we really like ragas. tvalmetrics has structural differences as well as differences in the specific metrics when compared to ragas.

With regards to the metrics, we have an end to end metric that scores how well your RAG response matches a reference correct response called answer similarity score. Last I saw, ragas did not have a score like this as they focus on scoring RAG responses and context without a reference correct answer. We also have a retrieval k-recall score that involves comparing the relevance of the retrieved context to the relevance of the top k context chunks where k is larger than the number of retrieved context chunks. Retrieval k-recall is a good score for tuning how many context chunks your RAG system should retrieve. I do not believe ragas has a score like this.

Structurally, tvalmetrics does not use langchain, while ragas does use langchain. We chose not to use langchain for our LLM API calls to keep the code in the package clear and make it easy for a user to understand exactly how the LLM-assisted evaluation works for each metric. Of course, the drawback of not using langchain means that our package is not integrated with as many LLMs. Currently, we only support using Open AI chat completion models as LLM evaluators to calcaulate the metrics. We plan on adding support for additional LLM evaluators very soon.

Ephil012OP2y ago

Hi all, if anyone has any questions about the open source library, Joe and I will be around today to answer them.

bratton882y ago

no questions - just want to say thanks for sharing!

Ephil012OP2y ago

No problem! If you have any questions in the future, feel free to open a issue on Github. Also, we got a free UI for visualizing the metric logs (no paid tier or anything like that at the moment) that we'd love people to try out. https://validate.tonic.ai

rwojo2y ago

This package suggests building a dataset and then using LLM-assisted evaluation via GPT-3.5/4 to evaluate your RAG pipeline on the dataset. It relies heavily on GPT-4 (or an equivalent model) to provide realistic scores. How safe is that approach?

IggyMac2y ago

Using LLMs to evaluate other LLMs sounds like a it would be dumb, but LLMs work in mysterious ways. I’ve found this approach useful though. In the context of RAG, using an LLM to evaluate whether a context chunk is relevant to answer a question is a nice complement to using the vector embedding semantic similarity search. Sometimes prompting the LLM gives better results than vector similarity.

joewferrara2y ago

Joe here. It's difficult to evaluate natural language responses that come from LLM applications - there are not hard metrics to measure performance like there are in say supervised machine learning tasks. For RAG, you have the response to evaluate as well as the retrieved context chunks. We found that using gpt-4 as an evaluator to measure the quality of RAG responses and the relevance of the context chunks gave similar results to using human evaluators at Tonic to do the same task. Some research also agrees that using LLMs as an evaluator for natural language tasks gives similar results to using human evaluators https://arxiv.org/abs/2306.05685.

As far as whether using gpt-4 is a safe approach, the best you could ask for is that gpt-4's evaluations match those of human evaluators, and that's what we've found as well as this research.

capybara2y ago

This is cool. What are your plans for supporting and building upon this going forward?

Ephil012OP2y ago

Right now, we are planning on adding more metrics in the future. We also have considered adding the ability of using vector embeddings instead of LLMs for analyzing the responses (which should bring some efficiency gains). Plus, we might add the ability to analyze the distribution of vectors in your vector db to compare it to the distribution of vectors for user queries.

We also have a UI for visualizing all of these metrics that is free at the moment with no paid tier at all. We are planning to add more features into that UI for better visualizations along with improving the ability to keep track of different versions of your LLM system for comparison.

HenryBemis2y ago

So disapointed :( I saw Metric and RAG (I thought it would be Red-Amber-Green) and I was hoping for some cool metrics/heatmap thingie...

I wish you the best though!

agautsc2y ago

if you build a dataset of question with responses to test you rag app with this metrics package, how do you know whether the distribution of questions match in any way with the distribution of question you'll get from the app in production? using a hand made dataset of questions and responses could introduce a lot of bias into your rag app.

joewferrara2y ago

Joe here. This is a good question, and always a challenge when using a test set to measure the performance of an LLM/AI/ML application. The answer is to make the test dataset consist of questions that come from users of your app - this is the best way to get a representative sample of questions to test your RAG application with. If you do this though, your test dataset will consist of questions withouth reference "correct" answers. In this case, you can still use tvalmetrics to evaluate the responses by using the metrics in tvalmetrics that do rely on reference "correct" answers.

tvalmetrics introduces 6 RAG metrics: answer similarity, retrieval precision, augmentation precision, augmentation accuracy, answer consistency, and retrieval k-recall. Of these 6 metrics, only answer similarity requires reference answers, so you can use the other metrics to measure the performance of your RAG system when you have a test dataset of questions without reference "correct" answers.

yukichi2y ago

very cool! looking forward to trying it

j / k navigate · click thread line to collapse

17 comments

d4rkp4ttern2y ago

https://ai.google.com/research/NaturalQuestions

But I do t see this dataset mentioned much in RAG discussions.

joewferrara2y ago

d4rkp4ttern2y ago

elyase2y ago

How does it compare to https://github.com/explodinggradients/ragas

joewferrara2y ago

tvalmetrics is similar to ragas for sure, and we really like ragas. tvalmetrics has structural differences as well as differences in the specific metrics when compared to ragas.

Ephil012OP2y ago

Hi all, if anyone has any questions about the open source library, Joe and I will be around today to answer them.

bratton882y ago

no questions - just want to say thanks for sharing!

Ephil012OP2y ago

rwojo2y ago

IggyMac2y ago

joewferrara2y ago

As far as whether using gpt-4 is a safe approach, the best you could ask for is that gpt-4's evaluations match those of human evaluators, and that's what we've found as well as this research.

capybara2y ago

This is cool. What are your plans for supporting and building upon this going forward?

Ephil012OP2y ago

HenryBemis2y ago

So disapointed :( I saw Metric and RAG (I thought it would be Red-Amber-Green) and I was hoping for some cool metrics/heatmap thingie...

I wish you the best though!

agautsc2y ago

joewferrara2y ago

yukichi2y ago

very cool! looking forward to trying it

j / k navigate · click thread line to collapse