I don't think xor works, but simple arithmetic, IIRC h_i(key) = i*h(key) + C mod 2^32, is viable and SIMD-friendly. Look up minwise-independent hash functions.
That's clear, call this P1
> The probability that two documents A and B having the same representative token is, equal again to Jaccard’s similarity
That's less clear (call this P2) and not equivalent to the first statement, afaict. In fact, this probability seems lower than the previous one. Consider the table:
token A B
a False True
b True True
This counts as matching under P1, but not under P2.What am I missing here?
In order words, the number of cases where `reptoken(A) = reptoken(B)` is a subset of cases where `reptoken(A) is in B`
I could have explained that a bit better I suppose.