Similarity join (Min-hash) (opens in new tab)

(blog.yellowflash.in)

60 pointsyellowflash3y ago9 comments

9 comments

Brilliant stuff. Isn't the XORing essentially just equivalent to a 1-time pad -- which isn't very hash-like. I'd think using a PRNG [1] with the initial hash value as the seed, to generate more values, would be more effective.

https://en.wikipedia.org/wiki/Pseudorandom_number_generator

davesque3y ago

Yeah, I had a similar intuition about that approach possibly being weak or problematic. However, if my cursory reading of the algorithm is at all correct, the goal may simply be to consistently choose another hash value (a different min hash) and it may work fine. In other words, the same hash values would be considered the minimum after XORing with a random number and they will be different from the initial hash that was selected as the minimum without XORing. Those are the behaviors that matter and not so much that the hashes that result from the XORing are cryptographically secure.

davesque3y ago

Can't edit the post. I meant to say "the same two hash values."

kwillets3y ago

MinHash is a CPU hog due to precisely this step -- cheap rehashes are the order of the day.

I don't think xor works, but simple arithmetic, IIRC h_i(key) = i*h(key) + C mod 2^32, is viable and SIMD-friendly. Look up minwise-independent hash functions.

maest3y ago

> The probability of C also having true in the row would be equal to Jaccard’s similarity!

That's clear, call this P1

> The probability that two documents A and B having the same representative token is, equal again to Jaccard’s similarity

That's less clear (call this P2) and not equivalent to the first statement, afaict. In fact, this probability seems lower than the previous one. Consider the table:

    token     A     B
        a False  True
        b  True  True

This counts as matching under P1, but not under P2.

What am I missing here?

In order words, the number of cases where `reptoken(A) = reptoken(B)` is a subset of cases where `reptoken(A) is in B`

yellowflashOP3y ago

The probability of the `b` being chosen for A is 1 (No other choice) and for B it is 1/2. The probability of `b` being chosen for both (The only common token), is |{(b,b)}| / |{(a, a), {(a, b)}| = 1/2 which is jaccard's similarity |{b}|/|{a, b}|

I could have explained that a bit better I suppose.

kwillets3y ago

The net result of the hashing etc. is to shuffle the unique elements of A union B. In that shuffled union, the first element is in at least one of A or B; if it's in both it's in the intersection. The chance of that is J.

maest3y ago

Maybe I'm missing something, but I don't think that addresses my concerns around P1 and P2 being different.

sulam3y ago

I understand the theory here, and it seems like it works, but it sure would have been nice seeing some actual examples from a corpus of docs.

j / k navigate · click thread line to collapse

9 comments

goldenkey3y ago

https://en.wikipedia.org/wiki/Pseudorandom_number_generator

davesque3y ago

Can't edit the post. I meant to say "the same two hash values."

kwillets3y ago

MinHash is a CPU hog due to precisely this step -- cheap rehashes are the order of the day.

I don't think xor works, but simple arithmetic, IIRC h_i(key) = i*h(key) + C mod 2^32, is viable and SIMD-friendly. Look up minwise-independent hash functions.

maest3y ago

> The probability of C also having true in the row would be equal to Jaccard’s similarity!

That's clear, call this P1

> The probability that two documents A and B having the same representative token is, equal again to Jaccard’s similarity

That's less clear (call this P2) and not equivalent to the first statement, afaict. In fact, this probability seems lower than the previous one. Consider the table:

    token     A     B
        a False  True
        b  True  True

This counts as matching under P1, but not under P2.

What am I missing here?

In order words, the number of cases where `reptoken(A) = reptoken(B)` is a subset of cases where `reptoken(A) is in B`

yellowflashOP3y ago

I could have explained that a bit better I suppose.

kwillets3y ago

maest3y ago

Maybe I'm missing something, but I don't think that addresses my concerns around P1 and P2 being different.

sulam3y ago

I understand the theory here, and it seems like it works, but it sure would have been nice seeing some actual examples from a corpus of docs.

j / k navigate · click thread line to collapse