Hey, pretty nice work!
Are you using any CLIP-like model for image retrieval? If so, would you try FG-CLIP 2 (
https://360cvgroup.github.io/FG-CLIP) and see how it'll improve the search results.
We just open-sourced this model, which excels in fine-grained image-text understanding, and I met your post right before I post our work on HN.