Similarity Learning lacks a framework. So we built one (opens in new tab)

(blog.qdrant.tech)

59 pointsgenerall3y ago24 comments

24 comments

Not a full featured framework, but pytorch-metric-learning has data loaders, lossess, etc. to facilitate similarity learning: https://github.com/KevinMusgrave/pytorch-metric-learning

Disclaimer: I've made some contributions to it.

dimitry123y ago

Seconding this. PML is high quality, active, and well documented.

generallOP3y ago

PML is a great collection of implementations, but not the best framework. Also you can use PML with Quaterion: https://github.com/qdrant/quaterion/blob/master/examples/tra...

dmos623y ago

Found the wiki article more useful in describing what Similarity Learning and Metric Learning are: https://en.wikipedia.org/wiki/Similarity_learning

fzliu3y ago

Great article. I've been working in and around this space since 2014, and I think similarity learning, vector search, and embedding management will be a core part of future applications that leverage ML.

I recently built a similarity search application that recommends new Pinterest users channels to follow based on liked images using Milvus (https://github.com/milvus-io/milvus) as a backend. Similarity learning is a huge part of it, and I'm glad more and more tools like Quaterion are being released to help make this kind of tech ubiquitous.

nocturnal_pt3y ago

Is it somehow connected to Qdrant similarity search engine? Is there a default integration for it?

generallOP3y ago

The idea is to fine-tune model, which can be used to produce embeddings required for Qdrant. Our design approach is to make things as modular as possible, so you can use framework and engine independently. But we are working on integrating components as well

usgroup3y ago

I’m familiar with metric learning within the Mahalanobis family for kNN oriented applications . I’m not getting what use cases this framework targets? Is it custom image search type stuff which may benefit from fine tuning?

What is a realistic minimum viable dataset for an approach like this? When is it not advisable? How does it compare to other more basic approaches?

monatis3y ago

The main idea is to train a deep learning model to encode a high-dimensional sample to a low-dimensional vector in a latent space. Then it can be used in various downstream tasks such as KNN applications, semantic search, multimodal retrieval, recommendation systems, anomaly detection etc. It's not limited to the image domain --it can be also audios, texts, videos, or more specific entities such as authors, soccer players, songs etc. The size of the dataset can be thought of being similar to other deep learning methods, but you can make a choice among various similarity learning methods based on the size of your dataset or according to whether it's labeled or not. A common approach is (1) to train a base model by using a self-supervised method with a bulk amount of unlabeled data and (2) to finetune it on a more specific domain or task with a smaller labeled dataset. If you can start with a pretrained model such as ResNet or BERT, you can skip the first step.

usgroup3y ago

I’d be surprised if it’d be useful for something like cars or soccer players, or really anything that may not have a continuous mapping. I guess more generally whenever the underlying “true” similarity function is not differentiable — categorical data springs to mind (cars, football players…).

I could see it making sense for complex unstructured data — Qdrant seems to point in that direction.

1 more reply

usgroup3y ago

To answer my own question:

https://qdrant.tech/

vervez3y ago

Very cool. Can you comment on how this compares with tensorflow similarity? https://blog.tensorflow.org/2021/09/introducing-tensorflow-s...

monatis3y ago

I'm one of the authors of Quaterion and also a contributor of TF Similarity. First of all, Quaterion is based on PyTorch. From a more technical perspective, TF Similarity is currently stronger on the self-supervised methods for images and lacks some more generic losses while Quaterion puts a bigger emphasis on finetuning models from any domain efficiently for practical purposes. For example, Quaterion has a intermediate caching mechanism that accelerates finetuning considerably. Overall, They both do a good job for their own use cases.

dmos623y ago

I realise this is an overly-broad question, but any insight into what's the state-of-art in Similarity Learning for article-type text?

More specifically, I'm interested in deriving distances between writing style, arguing style, etc.

monatis3y ago

There's study here: http://cs230.stanford.edu/projects_spring_2021/reports/57.pd...

Basically, you can collect text from different authors, then you can use authors names as labels to train a similarity learning with it. My suggestion would be finetune a Transformer model with a specific head and an ArcFace loss.

generallOP3y ago

It is definitely possible to do, if you have a proper training set. You would need to somehow give model a signal, that you are interesting in e.g. arguing style specifically, and not the topic of the text.

artex_xh3y ago

There is one https://github.com/jina-ai/finetuner pretty well-designed and also gives SOTA performance from its docs

generallOP3y ago

Starting from 0.5.0 finetuner is no longer an open-source.

> From 0.5.0, Finetuner computing is hosted on Jina Cloud. The last local version is 0.4.1, one can install it via pip or check out git tags/releases here.

But there are some cool ideas implemented there as well, I encourage you to try both!

kacperlukawski3y ago

Unfortunately, it's no longer open source, but requires using their cloud.

"From 0.5.0, Finetuner computing is hosted on Jina Cloud. THe last local version is 0.4.1, one can install it via pip or check out git tags/releases here."

artex_xh3y ago

but does being "opensource" naturally make a software good in quality & performance? In the end, people try to solve a problem right?

2 more replies

binbag3y ago

The title is written in a clickbait format. So I had to point it out.

j / k navigate · click thread line to collapse

24 comments

flaviojuvenal3y ago

Not a full featured framework, but pytorch-metric-learning has data loaders, lossess, etc. to facilitate similarity learning: https://github.com/KevinMusgrave/pytorch-metric-learning

Disclaimer: I've made some contributions to it.

dimitry123y ago

Seconding this. PML is high quality, active, and well documented.

generallOP3y ago

PML is a great collection of implementations, but not the best framework. Also you can use PML with Quaterion: https://github.com/qdrant/quaterion/blob/master/examples/tra...

dmos623y ago

Found the wiki article more useful in describing what Similarity Learning and Metric Learning are: https://en.wikipedia.org/wiki/Similarity_learning

fzliu3y ago

nocturnal_pt3y ago

Is it somehow connected to Qdrant similarity search engine? Is there a default integration for it?

generallOP3y ago

usgroup3y ago

What is a realistic minimum viable dataset for an approach like this? When is it not advisable? How does it compare to other more basic approaches?

monatis3y ago

usgroup3y ago

I could see it making sense for complex unstructured data — Qdrant seems to point in that direction.

1 more reply

usgroup3y ago

To answer my own question:

https://qdrant.tech/

vervez3y ago

Very cool. Can you comment on how this compares with tensorflow similarity? https://blog.tensorflow.org/2021/09/introducing-tensorflow-s...

monatis3y ago

dmos623y ago

I realise this is an overly-broad question, but any insight into what's the state-of-art in Similarity Learning for article-type text?

More specifically, I'm interested in deriving distances between writing style, arguing style, etc.

monatis3y ago

There's study here: http://cs230.stanford.edu/projects_spring_2021/reports/57.pd...

generallOP3y ago

artex_xh3y ago

There is one https://github.com/jina-ai/finetuner pretty well-designed and also gives SOTA performance from its docs

generallOP3y ago

Starting from 0.5.0 finetuner is no longer an open-source.

> From 0.5.0, Finetuner computing is hosted on Jina Cloud. The last local version is 0.4.1, one can install it via pip or check out git tags/releases here.

But there are some cool ideas implemented there as well, I encourage you to try both!

kacperlukawski3y ago

Unfortunately, it's no longer open source, but requires using their cloud.

"From 0.5.0, Finetuner computing is hosted on Jina Cloud. THe last local version is 0.4.1, one can install it via pip or check out git tags/releases here."

artex_xh3y ago

but does being "opensource" naturally make a software good in quality & performance? In the end, people try to solve a problem right?

2 more replies

binbag3y ago

The title is written in a clickbait format. So I had to point it out.

j / k navigate · click thread line to collapse