The only complication is that you then need to choose a suitable T-Norm / T-Conorm pair, which express the concept of intersection and union for fuzzy sets, and there's an infinite family of them. But that's a good thing, since you get to pick the pair with your desired semantics.
I wrote about this ([0][1]) in the context of validating medical image segmentations when both the segmentation and ground truth are probabilistic/fuzzy rather than binary masks.
Otherwise what most people do instead is to simply threshold at 0.5 to obtain binary sets for use with the binary variants of the jaccard / dice coefficients. Which apparently decreases the precision of your validation operator by 2 orders of magnitude. It's like, you publish your algorithm claiming it's better than SOTA by 0.001, ignoring the fact that your validation operator has an error margin of 0.1 ...
[0] https://link.springer.com/chapter/10.1007/978-3-319-46723-8_...
[1] https://ora.ox.ac.uk/objects/uuid:dc352697-c804-4257-8aec-08...
Of course, nowaday I would probably just tell them to use datasketch (https://pypi.org/project/datasketch/).
With this trip to memory lane, I looked around a little, and noticed people are still creating new stuff on the topic. E.G:
https://pypi.org/project/rensa/
Which is basically a more specialized but faster version of datasketch minhash, written in rust, with a little python on top.
I've also written up some interactive tutorials on how the method works [1] if anyone's interested
[0]https://github.com/moj-analytical-services/splink [1]https://www.robinlinacre.com/intro_to_probabilistic_linkage/
What would you say are the main differences with other approaches?
The problem is to find (pseudo) inverses of many different proper submatrices of a big square matrix.
Certain matrix identities (Woodbury, banachiewicz) allow you to update the inverse of a “close” submatrix to cheaply calculate a new one.
So you store the inverses you’ve calculated already, with their row/column indices as keys. Then for each new submatrix you want to find a close already-computed inverse to update from.
This is solved with minhashing. You minwise hash the indices so that close matrices are likely to have the same hash.
In my particular implementation I did a “multi-resolution” hash so that I can change the selectiveness of the search as the number of prior computed inverses grows
you can find the relevant material for free by searching "chapter 3 pdf mmds ullman"
enjoy!
edit: oh no! i'm wrong! according to wikipedia it was invented at dec for altavista. https://en.wikipedia.org/wiki/MinHash either way there's a nice description in the ullman book and they do describe how it was used at google as well.
https://websla.sh/tools/minhash
It's not really complete, I'd like to show Jaccard Similarity calculations among other things, but this already allows you to enter multiple strings and see with your own eyes what a "minhash" actually is.
There is a nice Google project using 0.5 M parameter RETSim model and the USearch engine for that: https://github.com/google/unisim
1. Concatenate and split all text fields into an array of ngrams (2...n chars)
2. Declare two global arrays A & B of length-n and fill them with random 32-64 bit integers
3. Hash each ngram to a 32-64 bit integer, multiply the resulting hash by each random value in array A, and modulo the resulting output with each random value in array B. Take the minimum value. Per row, your aim is to end up with an array of 'minhashed' integers of equal length to the arrays in step 2. If you declare the global arrays to be length of 64, the minhashed array for each row will also be of length 64.
4. Bucketise the resulting hash arrays by summing N consecutive minhash values using a window function (e.g. sum each 4 consecutive rows)
If all went well, you can now unnest these arrays (call them 'source rows') and self join the dataset on each bucketed minhash value (resulting in an additional column of 'target rows'). Group by source, target columns and count the occurances to get an indication of how likely two rows are similar.
In essence, the more often two items hash to a similar bucket, the more similar they are, and it's up to you to define the cut-off from where to calculate actual pairwise Jaccard (or cosine) similarities between items.
If you think two items share highly similar _text_ then give Levenshtein distance a try.
Here's the repo: https://github.com/NVIDIA/NeMo-Curator/
Some documentation on the fuzzy dedup scripts: https://docs.nvidia.com/nemo-framework/user-guide/latest/dat...
And a Python example: https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/fu...
Would be interested in any feedback from the folks here.
I first learned about this technique from Douglas Eck (https://research.google/people/douglas-eck/), who used it for song clustering at Google. I remember him saying something about hashing and random vectors, and was puzzled because I figured that less random optimiztion would work better.
add in the banding stuff and some simple probability and voila! a cheap and embarrassingly parallel way to approximate jaccard similarity across huge datasets.
I don't think the vector DB adds much. You could use it to speed up the lookup of the min-hash sketches if you have hundreds of millions of documents, but it is probably overkill.
Each time you embed a document you search for approximate nearest neighbours before adding it, so it is O(N) like MinHash. Vector indexes like HNSW and PQ have better performance/quality tradeoffs than SimHash LSH which is the analogue of MinHash for cosine distance.
The quality depends on what you mean by near duplicate and the embedding model you use. Current models work well, and if you have labelled data you can fine tune them to be better.
The main drawback is the additional cost of embedding all the documents, especially for longer documents. But this cost has dropped really quickly with smaller models, better optimisations, and faster hardware.
Just the pure compute cost of needing to run an ML encoder over petabytes of data?
Or maybe because for their use-case — eliminating redundancy to reduce total dataset size and therefore training time — a non-domain-specific vectorization with a high-false-negative cluster-discovery rate was acceptable, because it just meant they'd "compress" the dataset slightly less well, and so get slightly more training time? (At the expense of increased bias in training toward the saliency of the features that weren't dedup'ed out; but that was going to happen regardless, and they likely already had a fully-general technique later in the pipeline for countering that.)
Typo:
It’s worth noticing that this definition is not in general transitive: we may well have three documents A,B,C such that S(A,B)≥S_{crit} and S(B,C)≥S_{crit} but S(A,B)<S _{crit}
That last S(A,B) should be S(A,C).
(Unless I'm very much mistaken ... and I could be)
this simple shuffle was a poor man's similarity sort that I used to find sound-alike words. the thing about the quality of similarity is that it's hard to know for sure if it worked, but this was sufficient.