How many hash functions are required in a minhash algorithm

I am keen to try and implement minhashing to find near duplicate content. http://blog.cluster-text.com/tag/minhash/ has a nice write up, but there the question of just how many hashing algorithms you need to run across the shingles in a document to get reasonable results.

The blog post above mentioned something like 200 hashing algorithms. http://blogs.msdn.com/b/spt/archive/2008/06/10/set-similarity-and-min-hash.aspx lists 100 as a default.

Obviously there is an increase in the accuracy as the number of hashes increases, but how many hash functions is reasonable?

To quote from the blog

It is tough to get the error bar on our similarity estimate much smaller than [7%] because of the way error bars on statistically sampled values scale — to cut the error bar in half we would need four times as many samples.

Does this mean that mean that decreasing the number of hashes to something like 12 (200 / 4 / 4) would result in an error rate of 28% (7 * 2 * 2)?

Answers


Pretty much.. but 28% would be the "error estimate", meaning reported measurements would frequently be inaccurate by +/- 28%.

That means that a reported measurement of 78% could easily come from only 50% similarity.. Or that 50% similarity could easily be reported as 22%. Doesn't sound accurate enough for business expectations, to me.

Mathematically, if you're reporting two digits the second should be meaningful.

Why do you want to reduce the number of hash functions to 12? What "200 hash functions" really means is, calculate a decent-quality hashcode for each shingle/string once -- then apply 200 cheap & fast transformations, to emphasise certain factors/ bring certain bits to the front.

I recommend combining bitwise rotations (or shuffling) and an XOR operation. Each hash function can combined rotation by some number of bits, then XORing by a randomly generated integer.

This both "spreads" the selectivity of the min() function around the bits, and as to what value min() ends up selecting for.

The rationale for rotation, is that "min(Int)" will, 255 times out of 256, select only within the 8 most-significant bits. Only if all top bits are the same, do lower bits have any effect in the comparison.. so spreading can be useful to avoid undue emphasis on just one or two characters in the shingle.

The rationale for XOR is that, on it's own, bitwise rotation (ROTR) can 50% of the time (when 0 bits are shifted in from the left) converge towards zero, and that would cause "separate" hash functions to display an undesirable tendency to coincide towards zero together -- thus an excessive tendency for them to end up selecting the same shingle, not independent shingles.

There's a very interesting "bitwise" quirk of signed integers, where the MSB is negative but all following bits are positive, that renders the tendency of rotations to converge much less visible for signed integers -- where it would be obvious for unsigned. XOR must still be used in these circumstances, anyway.

Java has 32-bit hashcodes builtin. And if you use Google Guava libraries, there are 64-bit hashcodes available.

Thanks to @BillDimm for his input & persistence in pointing out that XOR was necessary.


One way to generate 200 hash values is to generate one hash value using a good hash algorithm and generate 199 values cheaply by XORing the good hash value with 199 sets of random-looking bits having the same length as the good hash value (i.e. if your good hash is 32 bits, build a list of 199 32-bit pseudo random integers and XOR each good hash with each of the 199 random integers).

Do not simply rotate bits to generate hash values cheaply if you are using unsigned integers (signed integers are fine) -- that will often pick the same shingle over and over. Rotating the bits down by one is the same as dividing by 2 and copying the old low bit into the new high bit location. Roughly 50% of the good hash values will have a 1 in the low bit, so they will have huge hash values with no prayer of being the minimum hash when that low bit rotates into the high bit location. The other 50% of the good hash values will simply equal their original values divided by 2 when you shift by one bit. Dividing by 2 does not change which value is smallest. So, if the shingle that gave the minimum hash with the good hash function happens to have a 0 in the low bit (50% chance of that) it will again give the minimum hash value when you shift by one bit. As an extreme example, if the shingle with the smallest hash value from the good hash function happens to have a hash value of 0, it will always have the minimum hash value no matter how much you rotate the bits. This problem does not occur with signed integers because minimum hash values have extreme negative values, so they tend to have a 1 at the highest bit followed by zeros (100...). So, only hash values with a 1 in the lowest bit will have a chance at being the new lowest hash value after rotating down by one bit. If the shingle with minimum hash value has a 1 in the lowest bit, after rotating down one bit it will look like 1100..., so it will almost certainly be beat out by a different shingle that has a value like 10... after the rotation, and the problem of the same shingle being picked twice in a row with 50% probability is avoided.


What you want can be be easily obtained from universal hashing. Popular textbooks like Corman et al as very readable information in section 11.3.3 pp 265-268. In short, you can generate family of hash functions using following simple equation:

h(x,a,b) = ((ax+b) mod p) mod m
  • x is key you want to hash
  • a is any odd number you can choose between 1 to p-1 inclusive.
  • b is any number you can choose between 0 to p-1 inclusive.
  • p is a prime number that is greater than max possible value of x
  • m is a max possible value you want for hash code + 1

By selecting different values of a and b you can generate many hash codes that are independent of each other.

An optimized version of this formula can be implemented as follows in C/C++/C#/Java:

(unsigned) (a*x+b) >> (w-M)

Here, - w is size of machine word (typically 32) - M is size of hash code you want in bits - a is any odd integer that fits in to machine word - b is any integer less than 2^(w-M)

Above works for hashing a number. To hash a string, get the hash code that you can get using built-in functions like GetHashCode and then use that value in above formula.

For example, let's say you need 200 16-bit hash code for string s, then following code can be written as implementation:

public int[] GetHashCodes(string s, int count, int seed = 0)
{
    var hashCodes = new int[count];
    var machineWordSize = sizeof(int);
    var hashCodeSize = machineWordSize / 2; 
    var hashCodeSizeDiff = machineWordSize - hashCodeSize;
    var hstart = s.GetHashCode();
    var bmax = 1 << hashCodeSizeDiff;
    var rnd = new Random(seed);     

    for(var i=0; i < count; i++) 
    {
        hashCodes[i] = ((hstart * (i*2 + 1)) + rnd.Next(0, bmax)) >>  hashCodeSizeDiff;
    }
}

Notes:

  1. I'm using hash code word size as half of machine word size which in most cases would be 16-bit. This is not ideal and has far more chance of collision. This can be used by upgrading all arithmetic to 64-bit.
  2. Normally you want to select a and b both randomly within above said ranges.

Just use 1 hash function! (and save the 1/(f ε^2) smallest values.)

Check out this article for the state of the art practical and theoretical bounds. It has this nice graph (below), explaining why you probably want to use just one 2-independent hash function and save the k smallest values.

When estimating set sizes the paper shows that you can get a relative error of approximately ε = 1/sqrt(f k) where f is the jaccard similarity and k is the number of values kept. So if you want error ε, you need k=1/(fε^2) or if your sets have similarity around 1/3 and you want a 10% relative error, you should keep the 300 smallest values.


It seems like another way to get N number of good hashed values would be to salt the same hash with N different salt values.

In practice, if applying the salt second, it seems you could hash the data, then "clone" the internal state of your hasher, add the first salt and get your first value. You'd reset this clone to the clean cloned state, add the second salt, and get your second value. Rinse and repeat for all N items.

Likely not as cheap as XOR against N values, but seems like there's possibility for better quality results, at a minimal extra cost, especially if the data being hashed is much larger than the salt value.


Need Your Help

Why can't a compiler have a "shift/shift" conflict?

parsing compiler-construction compilation lr

I currently am studying about compilers and as I understand in LR(0) there are cases where we have "shift/reduce" or "reduce/reduce" conflicts, but it's impossible to have "shift/shift" conflicts! ...

How to get the npm global path prefix

node.js npm

I believe that in the past I've discovered an option to pass to npm to get it to output the path it is using for -g (global modules). Does anyone know such an option? Or any other method to get the