Phil at Warrimoo: April 2010

Saturday, 17 April 2010

Simple Rolling Hash- Part 3

The Rolling Hash

Now that there is a way to remove the first character and append a new last character (in constant time) to a string hash value, we can also use this hash function suite to perform substring searches in linear time.

To find a substring of length m in a string of length n you might proceed as follows:

Start at the first character of the string

Compare the substring to the string for m characters.

If same, then the substring is found.

else, repeat from the next character of the string (stop when there are less than m characters left in the string)

In practice this works well since it is unlikely that more than a few character will match each round so the effective time efficiency is much less than O(N x M). But for some strings, such as DNA which have small alphabets, the number of matches in each round could be large.

By using a rolling hash this can be reduced to linear-time.

Calculate the hash of the substring we are searching for.

Start with a hash of the first m characters of the string

If the hash values match then the substring is found - probably.

else remove the first character of the string from the hash and add the m+1 th character to the hash.

Now the hash is of m characters from the next character of the string.

This process is O(N).

The catch is that this finds a substring with high probability. To ensure that it is a match we need to check each character.

A Good Hash Function

My next question was 'is this hash good enough?'.

After some research I found that hashing is similar to a problem called Balls and Bins. The idea is to analyse mathematically the result of randomly throwing m balls into n bins.

It turns out that if a hash function is 'good' then it should produce results similar to randomly throwing balls (strings) into bins (string hash numbers).

I am no mathematician so this was slow going and a bit frustrating since I don't know all the tricks used to simplify problems.

I found that for a good hash function, the number of unused hash numbers after generating hash numbers from n random-like strings should be n/e. This is easy to test: I made an array of size n, generated random strings, hashed them, and incremented the element of the array indexed by the hash number.

(To keep the array size small, I first reduced the 28 bit hash number to a 16 bit number by mod 65536).

The probability that a hash number was used once is also n/e. And likewise it is easy to calculate the probabilities of a hash number being used k times:

Pr[ k ] = 1/ek!

And the probability that a hash number is used k or more times is roughly

Pr[ >=k ] ~ ((e/k)**k)/e

(to be completed)

Friday, 16 April 2010

Simple Rolling Hash- Part 2

String Slicing

I also wanted to extract substrings of appended or fixed-length strings quickly as well.

Since most strings would be less than 64 characters (I reasoned), this too would be virtually constant-time.

In the case of appended strings, the slicing would be proportional to the number of appended strings which would be roughly O(N) but with a very small constant factor in the order of at least 1/10 th on average.

Again, to ensure that the hash calculation would not harm performance, I needed a way to adjust the hash of a string by the hash of the leading or trailing substrings.

eg.

hash( "def" ) = some_function( hash ( "abcdef" ) , hash( "abc" ) )

hash( "abc" ) = some_function( hash ( "abcdef" ) , hash( "def" ) )

To do this I needed to choose carefully the value of k such that k * inv(k) = 2**28 + 1, where inv(k) is the multiplicative inverse of k - ie. k and inv(k) are factors of 2**28+1.

Lucky Break

This part is simply a fluke. I needed my hash to be 28 bits so that I can store type information in the other 4 bits of a 32-bit value.

I could use a hash that is less that 28 bits, but not more. Fortunately, 2**28 +1, which is 268,435,457 has two prime factors: 17 and 15790321.

It is also fortunate that 17 is a reasonable multiplier for a hash as well since, I reasoned, it preserves the low 4 bits of each character which contain the most information for the letters and numbers and punctuation characters.

I have found another hash function that uses k=17.

http://www.codeproject.com/KB/recipes/hash_functions.aspx

Removing Characters and Calculating the New Hash in O(1)

To remove a leading character from a string of length n, you need to remove it's component of the hash which has been multiplied by k, n-1 times. So the new hash is

hash( s' ) = hash( s ) - k**(n-1) * c, where c is the first character of the string s

This can be done in constant-time, but removing M characters takes linear-time, O(M).

In the more general case, removing a leading substring of m characters follows the same pattern: here the leading substring has been multiplied by k, n - m times.

hash( s' ) = hash( s ) - k**(n-m) * hash( f ), where f represents the leading m characters of the string s.

To remove the trailing character from a string of length n, you need to subtract it from the hash and then divide the hash of the string by k. Division can be performed by multiplying by the inverse of k which is 15790321.

hash( s' ) = ( hash( s ) - c ) * inv( k ), where c is the last character of the string s.

To remove the trailing m characters from a string of length n, you need to subtract the hash of the m characters and then divide the hash of the string by k**m. Again, division can be performed by multiplying by the inverse of k which is 15790321.

hash( s' ) = ( hash( s ) - hash( t ) ) * inv( k )**m, where t is the last m characters of the string s.

Unfortunaely in the general case, to remove substrings from a string, I need the hash of each substring which takes linear-time.

Reducing Substring Hash Time

Here is where the fixed-length strings help: The only substrings that need to be calculated are the first and last fixed-length strings of a slice. And these substrings are between 1 and 63 characters long.

eg. Assume the fixed-length strings are 3 characters long and slice a string to return 6 characters from the second.

( "abc" , "def" , "ghi" ) :2:6 -> ( "bc" , "def" , "g" )

In this case, the hash for "bc" and "g" have to be calculated, but the hash for "def" remains the same.

We can always guarantee minimal time by observing that if the substring to be removed is greater than half the length of the fixed-length string then it is better to re-calculate the hash of the resulting substring. This way, worst case is calculating the hash of fixed-length characters, or 64 in my case.

If it happens that the hash of the substrings is known then we get back to constant-time.

Simple Rolling Hash- Part 1

To process strings quickly I decided to try some hashing techniques to quickly append strings and make sub-strings (or slices) in time O(1) - constant time.

I also wanted to be able to search a string for some usually smaller string or word in linear time. This technique is called a rolling hash.

http://en.wikipedia.org/wiki/Rolling_hash

Background

A 'hash' is a number that is generated from some data in such a way that the distribution of the numbers looks random and there is a low probability that the same number is re-used.

For example, a hash of a string could be to multiply the length of the string by the sum of all the characters (this probably isn't a good hash method, but it explains the concept).

eg.

hash(abc) = 3 * 97+98+99 = 588

My Requirement - O(1) String Comparison

To speed-up string comparisons I was looking to use a 28-bit hash.

My aim was to have string comparison performed in roughly constant-time. To do this, I couldn't simply scan both strings looking for a mis-match as this would take O(N) time where N is the length of the two strings (unequal length strings can never match).

Instead I had to use some form of a hash of each string and only scan the strings if the hashes were equal.

I looked at the DJB and SDBM hash functions for their simplicity. SDBM seems the better one by all accounts, but both would do for my simple purpose.

My Strings

My strings are not simply an array of characters.

I divide each string into non-mutable fixed-length strings - currently 64 x 8-bit characters or 16/21 Unicode characters.

(My aim is to pick a fixed-length size that is 'just right' most of the time. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39.6999&rep=rep1&type=pdf )

This design, I was hoping, would reduce the cost of appending strings by allowing the new string to be formed as a linked list of the two source strings.

So,

"abc" + "defg" -> ( "abc" , "defg" )

As a side benefit, this also allowed for a more straight-forward memory management and garbage collection scheme - if all memory allocations are the same size, then they can be easily and cheaply recycled, and memory is much less likely to become fragmented.

Although this would be fast - constant-time O(1) - I would have to re-calculate the hash for the new string which would result in linear-time O(N) behaviour and therefore negate any real performance gains: appending strings would still be O(N).

What I needed was a hash that could be generated from the individual hashes of the two source strings.

With a little modulo arithmetic I was able to determine that if the hash functions were of the form:

hash(n) = k * hash(n-1) + char(n), and h(0)=0

then the resulting string would have a hash of

hash( s1+s2 ) = k ** length( s2 ) * hash( s1 ) + hash( s2 )

So now, I could append two strings and calculate the hash of the new string in constant-time.

Continued in Part 2.

Saturday, 10 April 2010

Apple's new iPhone license agreement

ars technica covers it well

http://arstechnica.com/apple/news/2010/04/apple-takes-aim-at-adobe-or-android.ars?utm_source=rss&utm_medium=rss&utm_campaign=rss

Apple's new API license includes the following:

The new version of 3.3.1 reads:

3.3.1 — Applications may only use Documented APIs in the manner prescribed by Apple and must not use or call any private APIs. Applications must be originally written in Objective-C, C, C++, or JavaScript as executed by the iPhone OS WebKit engine, and only code written in C, C++, and Objective-C may compile and directly link against the Documented APIs (e.g., Applications that link to Documented APIs through an intermediary translation or compatibility layer or tool are prohibited).

This seems to mean that developers can not use tools that translate programs from one platform to the iPhone, or frameworks that try to make all smart-phones look-alike.

This is clearly a closed, dictatorial stance.

There are already many iPhone apps and games that are openly built this way - what do they do?

xmlvm, howto, xokoban, legions

But how can they test that a code-generator/cross compiler was used?

Perhaps this will spawn a new breed of code-generators that actually translate code from one language to another with same variable names, comments, parameters, etc.

Now that would be cool.