Disclaimer: I've made some contributions to it.
I recently built a similarity search application that recommends new Pinterest users channels to follow based on liked images using Milvus (https://github.com/milvus-io/milvus) as a backend. Similarity learning is a huge part of it, and I'm glad more and more tools like Quaterion are being released to help make this kind of tech ubiquitous.
What is a realistic minimum viable dataset for an approach like this? When is it not advisable? How does it compare to other more basic approaches?
I could see it making sense for complex unstructured data — Qdrant seems to point in that direction.
More specifically, I'm interested in deriving distances between writing style, arguing style, etc.
Basically, you can collect text from different authors, then you can use authors names as labels to train a similarity learning with it. My suggestion would be finetune a Transformer model with a specific head and an ArcFace loss.
> From 0.5.0, Finetuner computing is hosted on Jina Cloud. The last local version is 0.4.1, one can install it via pip or check out git tags/releases here.
But there are some cool ideas implemented there as well, I encourage you to try both!
"From 0.5.0, Finetuner computing is hosted on Jina Cloud. THe last local version is 0.4.1, one can install it via pip or check out git tags/releases here."