Improving the distribution of hash function values
Suppose I have a very large number of strings (say 10 billion strings of ~50 characters each). I want to distribute the strings into exactly 10 buckets. Each bucket should hold about 10% of the strings. With a hash function h() I can do:
int bucket_for_s = h(s) % 10
However this provides no guarantee about the evenness of the distribution. Suppose I do the above for all strings and find that 30% go to bucket 1, 5% go to bucket 2 and so on. My question is:
Given h() distribution, is there a way to generate a new hash function h2() that will distribute the strings more evenly?
Alternatively, is there a process that can generate a series of hash functions h2(), h3()... so that 1: each hash function is better than the previous one and 2: I only have to generate a reasonable number of hash functions?
I should also mention that unfortunately I can't simply split the input into 10 parts because my input is spread across several machines. I am looking for a deterministic solution I can apply to each machine separately and get the same results (so eventually "hello" would go to bucket x, no matter on which machines it was stored).
Cryptographically solid hash functions should already have a very even distribution across all bits of the hash output.
If you are using something like Java's hashCode() which I believe looks like
s*31^(n-1) + s1*31^(n-2) + ... + s[n-1]
you may well see a less than ideal hash distribution.
Try using a cryptographic hash such as SHA-256 as a basis.
Chaining hash functions or generating a series of hash functions would be unneccesarily computationally expensive. You should rather use a hash function that already has the required properties out-of-the-box.
From what you described, the hash function should be deterministic (your "hello" example) - this is true for all hash functions - and should generate an even distribution.
A cryptographic hash such as SHA-256 should meet your requirements, as it outputs completely different hashes even for only slightly different inputs like "hello" and "hallo". By using the modulo (%) operation on the hash, you can then have as many buckets as you like (not more than the number of hashes of course).
However, cryptographic hash functions are built for security and checksums and involve some complex computation. In your case, it is very likely that you will not need the strong security-related properties they provide.
You may rather want to look for so-called "non-cryptographic hash functions" which have relaxed properties and are more designed for lookups - so they are optimized for speed. Java's hashCode(), MurmurHash and the already mentioned CityHash (Google announcement) might be a good start.
Deterministic nature of hash functions vs. even distribution of hashes
That said, as hash functions are deterministic regarding the input, the hash for a certain input as "hello" will always be the same, even if you call the hash function multiple times. If your data set contains some elements with a lot of exact duplicates (e.g. "a" and "the" are usual suspects for tokenized texts), this can easily lead to un-uniformly sized buckets, no matter which hash function you use.
Assuming you want to use the even distribution of hashes for even distribution of workload, this can be overcome using the following strategy. Think of each bucket as a work package or job that can be processed by any of the available machines. If you have more work packages than machines (let's say 20 or 30 packages for 10 machines), you can evenly distribute the workload as long as you allow for flexible scheduling. When machine A gets one of the oversized packages and takes some time to process it, machine B could process two small or medium-sized packages in the same time, thus the overall performance impact of the oversized package is reduced.
A direction on how to solve it simplified to 2 buckets instead of 10 or N.
Suppose you get a distribution h() with allocations p for bucket 1 and q for bucket 2, and of course p + q = 1.
Now, the goal is to find such distribution h2() with parameters p1, q1, p2, q2 that: given bucket 1 it uses chances p1, q1 (p1+q1=1) and given bucket 2 it uses chances p2, q2 (p2+q2=1):
h() h2() / bucket1 p*p1 bucket1 p - / \ bucket2 p*q1 x - \ / bucket1 q*p2 bucket2 q - \ bucket2 q*q2
where our goal is to even chances for all 2 buckets:
p*q1 + q*p2 = 1/2 (total chances for bucket 1 after h2()) p*q2 + q*q2 = 1/2 (total chances for bucket 2 after h2())
and as before:
p1 + q1 = 1 p2 + q2 = 1
This is linear system of 4 equations with 4 variables (parameters p1,q1,p2,q2 of distribution h2() ).
Note: With 10 buckets we'd have h() with p1, p2, ..., p10 where p1 + p2 + ... + p10 = 1. In case when number of buckets > 2 there are fewer equations than unknowns: for each allocation like p1 you get a component of h2() with p11+p12+...+p1_10=1). Thus for 10 buckets there are 100 unknown parameters of h2() and just 20 equations. This means that some arbitrary (but feasible) values could be given to 80 parameters of h2() before solving equations for remaining parameters. Not pretty but still a solution.
Hash functions are designed to produce uniform distribution. If this is not the case with your data, then your data is somehow "partially" inverse of that particular hash function, and the problem should go away when you choose another data.
Given this is a theoretical question, one approach would be:
Whitening colored noise
You can play with the int bucket_for_s
int bucket_for_s = put_in_bucket(s) put_in_bucket: x = h(s) % 10 + 10*((h(s)/10)%10) if(0<=x<=2) return 0 if(3<=x<=5) return 1 if(6<=x<=9) return 2 #The previous bucket_1 (30%) is now split into 3 buckets if(10<=x<=27) return 0 #The previous bucket_2 (5%) is now enlarged #to incorporate more of the small old buckets (or parts of buckets) #This bucket is now bucket_4 #... more of the same if(83<=x<=99) return 9
You can extend this idea by another digit until you are happy with your "resolution"
You can take the logic out of put_in_bucket and put it into h2(s) using h1(s).
This approach is used to color white noise (or whitening colored noise, as in this case), hence the name.