undefined | Better HN

0 pointsjonathan-adly1y ago0 comments

So, there is a whole world with vision based RAG/search.

We have a good open-source repo here with a ColPali implementation: https://github.com/tjmlabs/ColiVara

0 comments

Thanks for the link to the ColPali implementation - interesting! I am specifically interested in evaluation benchmarks for different image embedding models.

I see the ColiVara-Eval repo in your link. If I understand correctly, ColQwen2 is the current leader followed closely by ColPali when applying those models for RAG with documents.

But how do those models compare to each other and to the llama3.2-vision embeddings when applied to, for example, sentiment analysis for photos? Do benchmarks like that exist?

jonathan-adlyOP1y ago

The “equivalent” here would be Jina-Clip (architecture-wise), not necessarily performance.

The ColPali paper(1) does a good job explaining why you don’t really want to directly use vision embeddings; and how you are much better off optimizing for RAG with a ColPali like setup. Basically, it is not optimized for textual understanding, it works if you are searching for the word bird; and images of birds. But doesn’t work well to pull a document where it’s a paper about birds.

1. https://arxiv.org/abs/2407.01449

ResearchAtPlay1y ago

Makes sense. My main takeaway from the ColPali paper (and your comments) is that ColPali works best for document RAG, whereas vision model embeddings are best used for image similarity search or sentiment analysis. So to answer my own question: The best model to use depends on the application.

j / k navigate · click thread line to collapse

0 comments

ResearchAtPlay1y ago

Thanks for the link to the ColPali implementation - interesting! I am specifically interested in evaluation benchmarks for different image embedding models.

I see the ColiVara-Eval repo in your link. If I understand correctly, ColQwen2 is the current leader followed closely by ColPali when applying those models for RAG with documents.

But how do those models compare to each other and to the llama3.2-vision embeddings when applied to, for example, sentiment analysis for photos? Do benchmarks like that exist?

jonathan-adlyOP1y ago

The “equivalent” here would be Jina-Clip (architecture-wise), not necessarily performance.

1. https://arxiv.org/abs/2407.01449

ResearchAtPlay1y ago

j / k navigate · click thread line to collapse