Friday, 16 April 2010

Simple Rolling Hash- Part 2


String Slicing

I also wanted to extract substrings of appended or fixed-length strings quickly as well. 

Since most strings would be less than 64 characters (I reasoned), this too would be virtually constant-time. 

In the case of appended strings, the slicing would be proportional to the number of appended strings which would be roughly O(N) but with a very small constant factor in the order of at least 1/10 th on average.

Again, to ensure that the hash calculation would not harm performance, I needed a way to adjust the hash of a string by the hash of the leading or trailing substrings.

eg. 

hash( "def" ) = some_function( hash ( "abcdef" ) , hash( "abc" ) )
hash( "abc" ) = some_function( hash ( "abcdef" ) , hash( "def" ) )
To do this I needed to choose carefully the value of k such that k * inv(k) = 2**28 + 1, where inv(k) is the multiplicative inverse of k - ie. k and inv(k) are factors of 2**28+1.

Lucky Break

This part is simply a fluke. I needed my hash to be 28 bits so that I can store type information in the other 4 bits of a 32-bit value. 

I could use a hash that is less that 28 bits, but not more. Fortunately, 2**28 +1, which is 268,435,457 has two prime factors: 17 and 15790321.

It is also fortunate that 17 is a reasonable multiplier for a hash as well since, I reasoned, it preserves the low 4 bits of each character which contain the most information for the letters and numbers and punctuation characters.

I have found another hash function that uses k=17.


Removing Characters and Calculating the New Hash in O(1)

To remove a leading character from a string of length n, you need to remove it's component of the hash which has been multiplied by k, n-1 times. So the new hash is

hash( s' ) = hash( s ) - k**(n-1) * c, where c is the first character of the string s

This can be done in constant-time, but removing M characters takes linear-time, O(M).

In the more general case, removing a leading substring of m characters follows the same pattern: here the leading substring has been multiplied by k, n - m times.

hash( s' ) = hash( s ) - k**(n-m) * hash( f ), where f represents the leading m characters of the string s.

To remove the trailing character from a string of length n, you need to subtract it from the hash and then divide the hash of the string by k. Division can be performed by multiplying by the inverse of k which is 15790321.

hash( s' ) = ( hash( s ) - c ) * inv( k ), where c is the last character of the string s.

To remove the trailing m characters from a string of length n, you need to subtract the hash of the m characters and then divide the hash of the string by k**m. Again, division can be performed by multiplying by the inverse of k which is 15790321.

hash( s' ) = ( hash( s ) - hash( t ) ) * inv( k )**m, where t is the last m characters of the string s.

Unfortunaely in the general case, to remove substrings from a string, I need the hash of each substring which takes linear-time.

Reducing Substring Hash Time

Here is where the fixed-length strings help: The only substrings that need to be calculated are the first and last fixed-length strings of a slice. And these substrings are between 1 and 63 characters long. 


eg. Assume the fixed-length strings are 3 characters long and slice a string to return 6 characters from the second.
( "abc" , "def" , "ghi" ) :2:6 -> ( "bc" , "def" , "g" )
In this case, the hash for "bc" and "g" have to be calculated, but the hash for "def" remains the same.

We can always guarantee minimal time by observing that if the substring to be removed is greater than half the length of the fixed-length string then it is better to re-calculate the hash of the resulting substring. This way, worst case is calculating the hash of fixed-length characters, or 64 in my case.

If it happens that the hash of the substrings is known then we get back to constant-time.