Friday, 16 April 2010

Simple Rolling Hash- Part 1



To process strings quickly I decided to try some hashing techniques to quickly append strings and make sub-strings (or slices) in time O(1) - constant time.


I also wanted to be able to search a string for some usually smaller string or word in linear time. This technique is called a rolling hash.




Background

A 'hash' is a number that is generated from some data in such a way that the distribution of the numbers looks random and there is a low probability that the same number is re-used.

For example, a hash of a string could be to multiply the length of the string by the sum of all the characters (this probably isn't a good hash method, but it explains the concept).

eg.

hash(abc) = 3 * 97+98+99 = 588

My Requirement - O(1) String Comparison

To speed-up string comparisons I was looking to use a 28-bit hash.

My aim was to have string comparison performed in roughly constant-time. To do this, I couldn't simply scan both strings looking for a mis-match as this would take O(N) time where N is the length of the two strings (unequal length strings can never match).

Instead I had to use some form of a hash of each string and only scan the strings if the hashes were equal. 

I looked at the DJB and SDBM hash functions for their simplicity. SDBM seems the better one by all accounts, but both would do for my simple purpose.

My Strings

My strings are not simply an array of characters. 

I divide each string into non-mutable fixed-length strings - currently 64 x 8-bit characters or 16/21 Unicode characters.

(My aim is to pick a fixed-length size that is 'just right' most of the time. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39.6999&rep=rep1&type=pdf )

This design, I was hoping, would reduce the cost of appending strings by allowing the new string to be formed as a linked list of the two source strings.

So, 

"abc" + "defg" -> ( "abc" , "defg" ) 

As a side benefit, this also allowed for a more straight-forward memory management and garbage collection scheme - if all memory allocations are the same size, then they can be easily and cheaply recycled, and memory is much less likely to become fragmented.

Although this would be fast - constant-time O(1) - I would have to re-calculate the hash for the new string which would result in linear-time O(N) behaviour and therefore negate any real performance gains: appending strings would still be O(N).

What I needed was a hash that could be generated from the individual hashes of the two source strings. 

With a little modulo arithmetic I was able to determine that if the hash functions were of the form:

hash(n) = k * hash(n-1) + char(n), and h(0)=0

then the resulting string would have a hash of

hash( s1+s2 ) = k ** length( s2 ) * hash( s1 ) + hash( s2 )

So now, I could append two strings and calculate the hash of the new string in constant-time.

Continued in Part 2.