Saturday, 17 April 2010

Simple Rolling Hash- Part 3


The Rolling Hash

Now that there is a way to remove the first character and append a new last character (in constant time) to a string hash value, we can also use this hash function suite to perform substring searches in linear time.

To find a substring of length m in a string of length n you might proceed as follows:
Start at the first character of the string
Compare the substring to the string for m characters.
If same, then the substring is found.
else, repeat from the next character of the string (stop when there are less than m characters left in the string)

In practice this works well since it is unlikely that more than a few character will match each round so the effective time efficiency is much less than O(N x M). But for some strings, such as DNA which have small alphabets, the number of matches in each round could be large.

By using a rolling hash this can be reduced to linear-time.

Calculate the hash of the substring we are searching for.
Start with a hash of the first m characters of the string
If the hash values match then the substring is found - probably.
else remove the first character of the string from the hash and add the m+1 th character to the hash.
Now the hash is of m characters from the next character of the string.

This process is O(N).

The catch is that this finds a substring with high probability. To ensure that it is a match we need to check each character.

A Good Hash Function

My next question was 'is this hash good enough?'.

After some research I found that hashing is similar to a problem called Balls and Bins. The idea is to analyse mathematically the result of randomly throwing m balls into n bins.

It turns out that if a hash function is 'good' then it should produce results similar to randomly throwing balls (strings) into bins (string hash numbers).

I am no mathematician so this was slow going and a bit frustrating since I don't know all the tricks used to simplify problems.

I found that for a good hash function, the number of unused hash numbers after generating hash numbers from n random-like strings should be n/e. This is easy to test: I made an array of size n, generated random strings, hashed them, and incremented the element of the array indexed by the hash number.

(To keep the array size small, I first reduced the 28 bit hash number to a 16 bit number by mod 65536).

The probability that a hash number was used once is also n/e. And likewise it is easy to calculate the probabilities of a hash number being used k times:

Pr[ k ] = 1/ek!

And the probability that a hash number is used k or more times is roughly

Pr[ >=k ] ~ ((e/k)**k)/e

(to be completed)