The Rolling Hash
Now that there is a way to remove the first character and append a new last character (in constant time) to a string hash value, we can also use this hash function suite to perform substring searches in linear time.
To find a substring of length m in a string of length n you might proceed as follows:
Start at the first character of the string
Compare the substring to the string for m characters.
If same, then the substring is found.
else, repeat from the next character of the string (stop when there are less than m characters left in the string)
In practice this works well since it is unlikely that more than a few character will match each round so the effective time efficiency is much less than O(N x M). But for some strings, such as DNA which have small alphabets, the number of matches in each round could be large.
By using a rolling hash this can be reduced to linear-time.
Calculate the hash of the substring we are searching for.
Start with a hash of the first m characters of the string
If the hash values match then the substring is found - probably.
else remove the first character of the string from the hash and add the m+1 th character to the hash.
Now the hash is of m characters from the next character of the string.
This process is O(N).
The catch is that this finds a substring with high probability. To ensure that it is a match we need to check each character.
A Good Hash Function
My next question was 'is this hash good enough?'.
After some research I found that hashing is similar to a problem called Balls and Bins. The idea is to analyse mathematically the result of randomly throwing m balls into n bins.
It turns out that if a hash function is 'good' then it should produce results similar to randomly throwing balls (strings) into bins (string hash numbers).
I am no mathematician so this was slow going and a bit frustrating since I don't know all the tricks used to simplify problems.
I found that for a good hash function, the number of unused hash numbers after generating hash numbers from n random-like strings should be n/e. This is easy to test: I made an array of size n, generated random strings, hashed them, and incremented the element of the array indexed by the hash number.
(To keep the array size small, I first reduced the 28 bit hash number to a 16 bit number by mod 65536).
The probability that a hash number was used once is also n/e. And likewise it is easy to calculate the probabilities of a hash number being used k times:
Pr[ k ] = 1/ek!
And the probability that a hash number is used k or more times is roughly
Pr[ >=k ] ~ ((e/k)**k)/e
(to be completed)