Finding near-duplicates with Jaccard similarity and MinHash (opens in new tab)

(blog.nelhage.com)

247 pointsbrianyu81y ago36 comments

36 comments

One thing many people miss about set based metrics like the jaccard similarity (aka Tanimoto coefficient) and F1 score (aka Dice coefficient), is that they can also be used identically with fuzzy sets.

The only complication is that you then need to choose a suitable T-Norm / T-Conorm pair, which express the concept of intersection and union for fuzzy sets, and there's an infinite family of them. But that's a good thing, since you get to pick the pair with your desired semantics.

I wrote about this ([0][1]) in the context of validating medical image segmentations when both the segmentation and ground truth are probabilistic/fuzzy rather than binary masks.

Otherwise what most people do instead is to simply threshold at 0.5 to obtain binary sets for use with the binary variants of the jaccard / dice coefficients. Which apparently decreases the precision of your validation operator by 2 orders of magnitude. It's like, you publish your algorithm claiming it's better than SOTA by 0.001, ignoring the fact that your validation operator has an error margin of 0.1 ...

[0] https://link.springer.com/chapter/10.1007/978-3-319-46723-8_...

[1] https://ora.ox.ac.uk/objects/uuid:dc352697-c804-4257-8aec-08...

BiteCode_dev1y ago

I worked with a client that implemented their own Python version of this to deduplicate citizen entries in a big french gov database. It worked well.

Of course, nowaday I would probably just tell them to use datasketch (https://pypi.org/project/datasketch/).

With this trip to memory lane, I looked around a little, and noticed people are still creating new stuff on the topic. E.G:

https://pypi.org/project/rensa/

Which is basically a more specialized but faster version of datasketch minhash, written in rust, with a little python on top.

RobinL1y ago

For deduplicating people, the Fellegi Sunter model is also a powerful approach. Splink[0] is a free Python library that implements this for big datasets. Probably you could combine parts of both approaches as well. Full disclosure: I am the lead author.

I've also written up some interactive tutorials on how the method works [1] if anyone's interested

[0]https://github.com/moj-analytical-services/splink [1]https://www.robinlinacre.com/intro_to_probabilistic_linkage/

bugglebeetle1y ago

Wonderful library and a really great set of explainers! Thanks for sharing this. Wish I had this for quite a few past projects…

BiteCode_dev1y ago

Interesting.

What would you say are the main differences with other approaches?

1 more reply

molodec1y ago

There is also gaoya ( I am the author ), which is also written in Rust and has python bindings. datasketch is great, but it is not performant enough for my use case. gaoya is used in a production system for large scale clustering https://github.com/serega/gaoya

pkeenan111y ago

Holy synchronicity Batman! I just implemented a minhash system that some may find interesting.

The problem is to find (pseudo) inverses of many different proper submatrices of a big square matrix.

Certain matrix identities (Woodbury, banachiewicz) allow you to update the inverse of a “close” submatrix to cheaply calculate a new one.

So you store the inverses you’ve calculated already, with their row/column indices as keys. Then for each new submatrix you want to find a close already-computed inverse to update from.

This is solved with minhashing. You minwise hash the indices so that close matrices are likely to have the same hash.

In my particular implementation I did a “multi-resolution” hash so that I can change the selectiveness of the search as the number of prior computed inverses grows

a-dub1y ago

a little backstory (that seems to be missing from the blog post). this is a technique that was invented in the early days of google for deduplicating the crawl set (turns out that building an llm and building a plain ol' web text index look awfully similar; i wonder why that is?!). you can read about it in depth in jeffrey ullman's free book "mining massive datasets" which describes many of the really cool and impressive techniques that were used to prepare an index build for the entire internet when such a thing was really hard.

you can find the relevant material for free by searching "chapter 3 pdf mmds ullman"

enjoy!

edit: oh no! i'm wrong! according to wikipedia it was invented at dec for altavista. https://en.wikipedia.org/wiki/MinHash either way there's a nice description in the ullman book and they do describe how it was used at google as well.

carnewal1y ago

As I got into Minhash & its variants I found it hard to wrap my head around this, so I'm building an online visualizer

https://websla.sh/tools/minhash

It's not really complete, I'd like to show Jaccard Similarity calculations among other things, but this already allows you to enter multiple strings and see with your own eyes what a "minhash" actually is.

brianyu8OP1y ago

This visualization is super cool! Thanks for sharing.

ashvardanian1y ago

Hashing or tiny neural nets combined with a Vector Search engine with Tanimoto/Jaccard is a very common deduplication strategy for large datasets. It might be wiser than using linear-complexity MapReduce operations.

There is a nice Google project using 0.5 M parameter RETSim model and the USearch engine for that: https://github.com/google/unisim

vivzkestrel1y ago

I have this problem right now in postgres. I have 600000 feed_items with the schema (feed_item_id uuid, author varchar, content text, guid varchar, link varchar, title varchar, summary text, feed_id integer) The content and summary columns especially for some of the news items are highly similar but not equal. For any given 2 such news items, I am trying to cut them down to 1. Any ideas?

dleeftink1y ago

I've implemented a minhash-like system in Bigquery, and was able to calculate cosine similarities (within a certain similarity threshold) between all Stack Overflow in reasonable time. To give you a broad idea of the procedure:

1. Concatenate and split all text fields into an array of ngrams (2...n chars)

2. Declare two global arrays A & B of length-n and fill them with random 32-64 bit integers

3. Hash each ngram to a 32-64 bit integer, multiply the resulting hash by each random value in array A, and modulo the resulting output with each random value in array B. Take the minimum value. Per row, your aim is to end up with an array of 'minhashed' integers of equal length to the arrays in step 2. If you declare the global arrays to be length of 64, the minhashed array for each row will also be of length 64.

4. Bucketise the resulting hash arrays by summing N consecutive minhash values using a window function (e.g. sum each 4 consecutive rows)

If all went well, you can now unnest these arrays (call them 'source rows') and self join the dataset on each bucketed minhash value (resulting in an additional column of 'target rows'). Group by source, target columns and count the occurances to get an indication of how likely two rows are similar.

In essence, the more often two items hash to a similar bucket, the more similar they are, and it's up to you to define the cut-off from where to calculate actual pairwise Jaccard (or cosine) similarities between items.

RobinL1y ago

One useful technique here could be to use text embeddings and cosine similarity: https://simonwillison.net/2023/Oct/23/embeddings/

swasheck1y ago

love this and have been using tf/idf for embeddings and various measures of similarity for some personal pet projects. one thing i came across in my research is that cosine similarity was more useful for vectors of different lengths and that euclidean distance was useful for vectors of similar length but simon alludes to a same-length requirement. i’m not formally trained in this area so i was hoping someone could shed some light on this for me.

1 more reply

gpderetta1y ago

You can use min hash to avoid the full O(N^2) distance matrix, but with just 600000 items you might just brute force compute the full matrix for simiplicity. What's your time budget?

probably_wrong1y ago

If you think two items cover the same highly similar _keywords_ then Jaccard distance should work.

If you think two items share highly similar _text_ then give Levenshtein distance a try.

kordlessagain1y ago

Have an LLM generate a reverse index for the entries, but force it to keep the cardinality low. Then you can use a Jaccard similarity.

ryantwolf1y ago

Love the article! My team at nvidia recently released a GPU accelerated version of the fuzzy deduplication algorithm described, and I figure this community might be interested.

Here's the repo: https://github.com/NVIDIA/NeMo-Curator/

Some documentation on the fuzzy dedup scripts: https://docs.nvidia.com/nemo-framework/user-guide/latest/dat...

And a Python example: https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/fu...

Would be interested in any feedback from the folks here.

dekhn1y ago

This is one of those techniques I don't understand when I read it as text, but will instantly absorb when I have a working code example and pass some of my own data through it a few times, inspecting the working process.

I first learned about this technique from Douglas Eck (https://research.google/people/douglas-eck/), who used it for song clustering at Google. I remember him saying something about hashing and random vectors, and was puzzled because I figured that less random optimiztion would work better.

a-dub1y ago

i think the key insight, at least for me, is that if you break things into piles of lots of little pieces and then you come up with n different ways of ordering those piles, similar things are going to have the same piece show up on top across those orderings.

add in the banding stuff and some simple probability and voila! a cheap and embarrassingly parallel way to approximate jaccard similarity across huge datasets.

derefr1y ago

As a document clustering / dataset deduplication technique, how well does "throwing ML at the problem" (e.g. using a pre-trained LLM encoder to generate vector embeddings of your documents, and then sticking those vectors into a vector DB and doing k-means to cluster the vectors) compare, quality-wise and performance-wise, to these simpler discrete-algorithm methods?

gpderetta1y ago

Using an LLM is just one of the ways to generate embedding. To do k-means you still need to pick a distance function, like jaccard; k-means is probably not ideal for near duplicates, and you can use min-hash to speed up k-means as a pre-pass.

I don't think the vector DB adds much. You could use it to speed up the lookup of the min-hash sketches if you have hundreds of millions of documents, but it is probably overkill.

derefr1y ago

I was picturing doing the deduplication as a map-reduce process over a huge (petabytes) dataset, where every worker is blind to what embeddings other workers have already generated. In such a case, shoving the embeddings generated by each worker into a shared vector DB, and having it (maybe incrementally) clustering the vectors as it receives them, would be acting as the "reduce" step.

fantispug1y ago

I have seen it work better than LSH.

Each time you embed a document you search for approximate nearest neighbours before adding it, so it is O(N) like MinHash. Vector indexes like HNSW and PQ have better performance/quality tradeoffs than SimHash LSH which is the analogue of MinHash for cosine distance.

The quality depends on what you mean by near duplicate and the embedding model you use. Current models work well, and if you have labelled data you can fine tune them to be better.

The main drawback is the additional cost of embedding all the documents, especially for longer documents. But this cost has dropped really quickly with smaller models, better optimisations, and faster hardware.

derefr1y ago

If this is true, then why do you think that — as the OP article states — the developers of GPT3 chose to use non-ML-based techniques to deduplicate their dataset, when they would be the most equipped to use ML-based approaches?

Just the pure compute cost of needing to run an ML encoder over petabytes of data?

Or maybe because for their use-case — eliminating redundancy to reduce total dataset size and therefore training time — a non-domain-specific vectorization with a high-false-negative cluster-discovery rate was acceptable, because it just meant they'd "compress" the dataset slightly less well, and so get slightly more training time? (At the expense of increased bias in training toward the saliency of the features that weren't dedup'ed out; but that was going to happen regardless, and they likely already had a fully-general technique later in the pipeline for countering that.)

ColinWright1y ago

In case the author is following this:

Typo:

It’s worth noticing that this definition is not in general transitive: we may well have three documents A,B,C such that S(A,B)≥S_{crit} and S(B,C)≥S_{crit} but S(A,B)<S _{crit}

That last S(A,B) should be S(A,C).

(Unless I'm very much mistaken ... and I could be)

is_true1y ago

I had to do this with sport teams but I used levenshtein for the names. I ended up creating a vector with other data (country, gender) and using that vector to calculate the distance (similarity).

motohagiography1y ago

there was a really simple string sort by similarity oneliner I can't find in my history, but the basic idea was to take a wordlist, split each word in half, reverse the halves, concatenate them back together, sort the resulting strings, then flip the strings back to their original words.

this simple shuffle was a poor man's similarity sort that I used to find sound-alike words. the thing about the quality of similarity is that it's hard to know for sure if it worked, but this was sufficient.

ptspts1y ago

Do the minhash-based algorithms in the article give exactly the same results (but faster) as pairwise calculation of J(a, b), or are the minhash-based results approximate?

gpderetta1y ago

minhash is probabilistic.

j / k navigate · click thread line to collapse

36 comments

tpoacher1y ago

I wrote about this ([0][1]) in the context of validating medical image segmentations when both the segmentation and ground truth are probabilistic/fuzzy rather than binary masks.

[0] https://link.springer.com/chapter/10.1007/978-3-319-46723-8_...

[1] https://ora.ox.ac.uk/objects/uuid:dc352697-c804-4257-8aec-08...

BiteCode_dev1y ago

I worked with a client that implemented their own Python version of this to deduplicate citizen entries in a big french gov database. It worked well.

Of course, nowaday I would probably just tell them to use datasketch (https://pypi.org/project/datasketch/).

With this trip to memory lane, I looked around a little, and noticed people are still creating new stuff on the topic. E.G:

https://pypi.org/project/rensa/

Which is basically a more specialized but faster version of datasketch minhash, written in rust, with a little python on top.

RobinL1y ago

I've also written up some interactive tutorials on how the method works [1] if anyone's interested

[0]https://github.com/moj-analytical-services/splink [1]https://www.robinlinacre.com/intro_to_probabilistic_linkage/

bugglebeetle1y ago

Wonderful library and a really great set of explainers! Thanks for sharing this. Wish I had this for quite a few past projects…

BiteCode_dev1y ago

Interesting.

What would you say are the main differences with other approaches?

1 more reply

molodec1y ago

pkeenan111y ago

Holy synchronicity Batman! I just implemented a minhash system that some may find interesting.

The problem is to find (pseudo) inverses of many different proper submatrices of a big square matrix.

Certain matrix identities (Woodbury, banachiewicz) allow you to update the inverse of a “close” submatrix to cheaply calculate a new one.

So you store the inverses you’ve calculated already, with their row/column indices as keys. Then for each new submatrix you want to find a close already-computed inverse to update from.

This is solved with minhashing. You minwise hash the indices so that close matrices are likely to have the same hash.

In my particular implementation I did a “multi-resolution” hash so that I can change the selectiveness of the search as the number of prior computed inverses grows

a-dub1y ago

you can find the relevant material for free by searching "chapter 3 pdf mmds ullman"

enjoy!

carnewal1y ago

As I got into Minhash & its variants I found it hard to wrap my head around this, so I'm building an online visualizer

https://websla.sh/tools/minhash

brianyu8OP1y ago

This visualization is super cool! Thanks for sharing.

ashvardanian1y ago

There is a nice Google project using 0.5 M parameter RETSim model and the USearch engine for that: https://github.com/google/unisim

vivzkestrel1y ago

dleeftink1y ago

1. Concatenate and split all text fields into an array of ngrams (2...n chars)

2. Declare two global arrays A & B of length-n and fill them with random 32-64 bit integers

4. Bucketise the resulting hash arrays by summing N consecutive minhash values using a window function (e.g. sum each 4 consecutive rows)

RobinL1y ago

One useful technique here could be to use text embeddings and cosine similarity: https://simonwillison.net/2023/Oct/23/embeddings/

swasheck1y ago

1 more reply

gpderetta1y ago

You can use min hash to avoid the full O(N^2) distance matrix, but with just 600000 items you might just brute force compute the full matrix for simiplicity. What's your time budget?

probably_wrong1y ago

If you think two items cover the same highly similar _keywords_ then Jaccard distance should work.

If you think two items share highly similar _text_ then give Levenshtein distance a try.

kordlessagain1y ago

Have an LLM generate a reverse index for the entries, but force it to keep the cardinality low. Then you can use a Jaccard similarity.

ryantwolf1y ago

Love the article! My team at nvidia recently released a GPU accelerated version of the fuzzy deduplication algorithm described, and I figure this community might be interested.

Here's the repo: https://github.com/NVIDIA/NeMo-Curator/

Some documentation on the fuzzy dedup scripts: https://docs.nvidia.com/nemo-framework/user-guide/latest/dat...

And a Python example: https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/fu...

Would be interested in any feedback from the folks here.

dekhn1y ago

a-dub1y ago

add in the banding stuff and some simple probability and voila! a cheap and embarrassingly parallel way to approximate jaccard similarity across huge datasets.

derefr1y ago

gpderetta1y ago

I don't think the vector DB adds much. You could use it to speed up the lookup of the min-hash sketches if you have hundreds of millions of documents, but it is probably overkill.

derefr1y ago

fantispug1y ago

I have seen it work better than LSH.

The quality depends on what you mean by near duplicate and the embedding model you use. Current models work well, and if you have labelled data you can fine tune them to be better.

derefr1y ago

Just the pure compute cost of needing to run an ML encoder over petabytes of data?

ColinWright1y ago

In case the author is following this:

Typo:

It’s worth noticing that this definition is not in general transitive: we may well have three documents A,B,C such that S(A,B)≥S_{crit} and S(B,C)≥S_{crit} but S(A,B)<S _{crit}

That last S(A,B) should be S(A,C).

(Unless I'm very much mistaken ... and I could be)

is_true1y ago

I had to do this with sport teams but I used levenshtein for the names. I ended up creating a vector with other data (country, gender) and using that vector to calculate the distance (similarity).

motohagiography1y ago

ptspts1y ago

Do the minhash-based algorithms in the article give exactly the same results (but faster) as pairwise calculation of J(a, b), or are the minhash-based results approximate?

gpderetta1y ago

minhash is probabilistic.

j / k navigate · click thread line to collapse