Phil at Warrimoo: 2010

Thursday, 21 October 2010

Barlow St. Sydney Tram Tracks

Energy Australia are doing some work in Barlow St. Sydney.

Their contractors(?) have uncovered some old Tram tracks. I managed to get a photo of one the rails and a partly visible sleeper.

I asked the friendly contractors for a small piece of rail - worth a try I think.

Wednesday, 20 October 2010

Blue-Green Water Results

After almost 4 weeks, our blue-green stain has not returned.

Treated Tank Water Alkalinity

The addition of 1000 mL of lime raised the pH to nearly 11. Over time this has reduced to below 10 since about 10,000 L of new rain has been collected.

Rain Water Alkalinity

I also noted that the rain water had a pH of about 6. This was a little surprising since our tank water was pH 4.5.

Stale Treated Tank Water Alkalinity

In monitoring the treated tank water I have found that the alkalinity of the water decreases to nearly 7 when left for 24 hours or more. I suspect that this is due to the water absorbing CO2 from the air and this may also explain why rain water has a higher pH compared to our tank water - if left, the pH increases as the water absorbs CO2.

CO2 Absorption of Large and Small Bodies of Water

To help confirm this theory, I left two containers of treated tank water to stand. One was over 300mL (pH=9.94) and the other was less than 60 mL (pH=9.83). Both containers had similar surface areas. The idea is that given the similar surface areas, the two systems would absorb CO2 at the same rate but due to the different amounts of water, the larger body would change pH more slowly. This is what I observed: the smaller decreased pH to 7.17 in less than 24 hours. The larger body had a pH of 7.36. Not much difference, but the temporal reading indicate that the larger body of water lags the smaller.

Note: Although the two samples should have started with the same pH I suspect the smaller sample had already absorbed significant CO2 before I measured it. This was about 15 minutes.

Boiled Water

I also boiled some treated water, let it cool and checked its pH. The water was originally about 9.6 pH. After boiling it dropped to about 8 pH. I have no theory for this at present.

Saturday, 25 September 2010

Blue-Green Water

We collect rainwater from our roof into two large 30,000 litre tanks. A pump takes the water from near the bottom of one tank and feeds the house.

From the time all the down-pipes drained into the tanks, we have not run out of water. Previously less than one third of the roof was connected to the tanks and we ran out of water three times in about three years (all in the August - October months).

The Stain

At some point we noticed that our bath would become stained with a blue-green film. Over time the thickness of the film increases. The stain was on the taps, tiles, bath-base, shower curtain, and anything left in the bath for a long time. Being white, it was most noticeable on the bath-base.

The Experiment

Late last year, we ran an experiment: we cleaned the blue-green film from the bath and switched to town water.

The stain did not return.

I had a number of theories.

1. It was a bio-slim similar to that which would grow in a sand filter. I had experimented with filtering washing machine water through a sand filter and it develops a blue-grey slim on the surface of the sand.

2. It was some other mould.

I initially thought that the town water - being high in chorine - was keeping the mould/bio-film under control. But bleach/exit mould would not shift it so these theories did not make sense.

After a recent bath cleaning, I did some more research. The only mention of blue-green stains was in the context of copper stains - but we had rainwater which was 'pure' water I thought.

Rainwater is Acidic

It turns out that rainwater is acidic due to CO2 - carbonic acid.

Acidic water dissolves the protective copper oxide layer inside the pipes and in an alkaline environment, copper hydroxide will come out of solution.

The bath is probably an alkaline environment due to the soaps and shampoos.

I checked the toilet cistern - no blue-green stain. Probably because it is not an alkaline environment.
I checked the washing machine - no blue-green stain, but there is a yellow-brown stain (could this be some other copper product?).

Data Collection

I needed to measure pH. So I purchased a pH meter, buffer solutions and de-ionised water.
I also purchased some universal paper, but this does not seem to work very well.

All samples were taken after running the taps for a period long enough to ensure all stagnant water was flushed.

Tank Water

Water in tanks: 4.5 pH - acidic
Water from kitchen tap: 4.9-5.0 pH
Water from bath tap: 4.89 pH
Hot water from kitchen tap: 5.35 pH at 50 degrees C
Hot water from bath tap: 5.4 pH at 47 degrees C

Our rainwater is very acidic, but the more copper pipe it travels through the less acidic it gets.
Also, the hot water is less acidic than the cold water.

In both cases it seems that the acidity is being reduced as copper is being dissolved and this process is accelerated by high temperature.

Town Water

At meter: 7.75 pH

Water from kitchen tap: 7.75 pH

Town water is alkaline.

Stagnant Water

0 min: 5.11 pH
11 min: 5.56
13 min: 5.7
15 min: 5.7
17 min: 5.73
20 min: 5.82

It again looks like the acidity drops when the water is sitting in the pipe. I suspect that this is because the acidity is dissolving the copper pipe and it seems to happen quickly.

Research

I found that blue-green water is a mystery. There does not seem to be a single cause, but acidity, O2, CO2 and temperature are all suspects.

The CSIRO has done some research.

They suggested that Microbiologically Induced Corrosion (MIC) should be researched but they did not seem to offer a cause.

A massive Thesis by Owais E. Farooqi might be interesting to others as it contains a lot of data, statistical analysis and mathematical modelling of various forms of copper corrosion.

I did find one paper on copper corrosion in particle accelerators.

The paper shows that:

Increasing temperature increases copper solubility.
Very low O2 (less than 30 ppb) and very high O2 concentrations ( greater than 1000 ppb) decrease copper corrosion, but lower than 30 ppb is best.
pH less than 7 (caused by CO2) increases copper corrosion.

So, I need to increase pH to 7.5 or higher. A pH of 9 seems ideal. Town water seems to be above 7.5, so this indicates that the town water minimises copper corrosion.

At pH greater than 7, the corrosion due to temperature is minimised.

At pH greater than 8.5 and less than 9.5 minimises corrosion.

Treatment

My inlet is very low in the tank. This is supposed to be anaerobic so it should be low in O2 - but how low? I lowered the inlet pipe some time ago. Could this be contributing to copper corrosion by inadvertently ending up with a O2 concentration between 30 and 1000 ppb?

According to wikipedia, fresh water has 6mL per litre or 6,000,000 ppb.

After some experimentation I decided that 500ml of lime power added to one tank would be my first step.

This seemed to take the pH from about 5 to 5.8 almost immediately. But I needed the pH to be over 7.5 and ideally 9. So I added more lime and the pH went to 10.8.

I can not take the lime out, so I will monitor the pH over the next few weeks and see what it does.

Testing

The bath is virtually clear of the blue-green stain so time will tell if the stain re-appears. If it does not return I may have solved our problem.

Thursday, 26 August 2010

How to remove commas from quoted strings in csv files

I needed to remove commas (,) from within double quotes (") in a Comma Separated Variable (CSV) file.

For example, you can use cut -d, -f2,4,5 to extract fields 2, 4 and 5.

But if there is a comma in the text of a field like this "hello, world" you are generally stuck.

Also SQLite can also import csv files, but again commas in quotes cause problems.

I was stuck until I managed to create this sed script that will work in many cases.

# This bash function uses sed to remove up to 4 individual commas
# sequences from a quoted string in a csv file.

# eg. "Hello,, World, nice day." -> "Hello World nice day."

function removeCommas(){
  while read data; do
   echo "$data" | sed -e 's/^/,/g' | sed -e "s/$/,/g" \
   | sed -e 's/$,\"[^,]*$,*$[^,]*$,*$[^,]*$,*$[^,]*$,*$.*\",$/\1\2\3\4\5/g' \
   | sed -e 's/^,//g' | sed -e 's/,$//g'
  done
}

A friend mentioned that the sed commands can be combined as follows:

   echo "$data" \
   | sed \
   -e 's/^/,/g' \
   -e "s/$/,/g" \
   -e 's/$,\"[^,]*$,*$[^,]*$,*$[^,]*$,*$[^,]*$,*$.*\",$/\1\2\3\4\5/g' \
   -e 's/^,//g' \
   -e 's/,$//g'

How it works

The engine uses sed.

This sed statement takes input from stdin and replaces the regular expression 'from' with 'to' for any and all occurrences of 'from'.

sed -e 's/from/to/g'

First, I add a comma to the start and end of each line with this:

sed -e 's/^/,/g' | sed -e "s/$/,/g"

These make the function work for special cases and they get removed after the commas have been removed.

The 'from' string starts of by finding the beginning of a quoted string ',\"' then all non-comma characters '[^,]*'.

This pattern is enclosed in brackets '$' and '$' to assign the matching pattern to, in this case, part 1.

Then it matches the next one or more commas ',*'. This is not in brackets since we don't want them.

The next part of the pattern '$[^,]*$' is like the first: it finds a string of non-comma characters and keeps the values as part 2.

Then we skip over any commas again.

This sequence can be repeated as many times as you like. I did it 3 times.

The pattern ends like it starts. But this time it reads any character including commas up to the trailing quote using '.*'. Then it reads the trailing quote '\"' and comma ','. All of this makes part 5.

This means that it will only filter the first 4 sequences of commas. To make it do more, repeat the middle pattern ',*$[^,]*$' and increase the output parts (below).

The 'to' expression is simply the concatenation of the 5 parts of the quoted string that are guaranteed to not contain a comma (well, except possibly for the last part), where '\N' is the nth part.

\1\2\3\4\5

Testing

I tested the function with this.

function testthis(){

echo -n "\"$1\" --> \"$2\"..."

echo "$1" | removeCommas | grep -qE "$2" && echo "Pass" \

|| echo "FAIL"

}

# test removeCommas

testthis "this test should fail to test the 'tester'" "anything but this"

testthis "" ""

testthis "," ","

testthis ",," ",,"

testthis ",\"\"," ",\"\","

testthis ",\",\"," ",\"\","

testthis "1,2,\"3,4 5 6 7\",8,9,10" "1,2,\"34 5 6 7\",8,9,10"

testthis "1,2,\"3,4,5 6 7\",8,9,10" "1,2,\"345 6 7\",8,9,10"

testthis "1,2,\"3,4,5,6 7\",8,9,10" "1,2,\"3456 7\",8,9,10"

testthis "1,2,\"3,4,5,6,7\",8,9,10" "1,2,\"34567\",8,9,10"

testthis "1,2,\"3,,4,,5,,6,,7\",8,9,10" "1,2,\"34567\",8,9,10"

testthis "1,2,\"3,,4,,5,,6,,7,8\",8,9,10" "1,2,\"34567,8\",8,9,10"

testthis "\"tricky one where the quoted string starts the line 3,,4,,5,,6,,7,8\",8,9,10" "\"tricky one where the quoted string starts the line 34567,8\",8,9,10

testthis "1,2,\"3,,4,,5,,6,,7,8 tricky one where the quoted string ends the line \"" "1,2,\"34567,8 tricky one where the quoted string ends the line \""

Saturday, 17 July 2010

Firewall Rule Testing with BASH and TCPTraceRoute

I wanted to block the use of any DNS server except those that I select (google, openDNS and my router).

I also wanted to make sure that these DNS servers work and that others do not.

So I made a BASH script to verify my firewall rules.

#!/bin/bash

# only particular DNS servicer are allowed to be contacted.

# this tests that this is so

LOCAL="the.IP.address.of.your.router"

ALLOW="8.8.8.8 8.8.4.4 208.67.220.220 208.67.222.222"

BLOCK="220.233.0.4 61.88.88.88 202.139.83.3 61.9.194.49 61.9.195.193 61.9.133.193 61.9.134.49 203.161.158.2"

echo "Testing DNS servers via UDP that are allowed to work..."

for d in $LOCAL $ALLOW; do

dig @$d somename +time=1 +tries=1 +notcp > /dev/null

[ $? -ne 0 ] && echo "FAIL: Failed to get response from $d via UDP." && exit 1

echo "PASS: $d responded via UDP."

done

echo "Testing DNS servers via TCP that are allowed to work..."

for d in $LOCAL $ALLOW; do

dig @$d somename +time=1 +tries=1 +tcp > /dev/null

[ $? -ne 0 ] && echo "FAIL: Failed to get response from $d via TCP." && exit 1

echo "PASS: $d responded via TCP."

done

echo "Testing DNS servers via UDP that are NOT allowed to work..."

for d in $BLOCK; do

dig @$d somename +time=1 +tries=1 +notcp > /dev/null

[ $? -eq 0 ] && echo "FAIL: Got response from blocked DNS server $d via UDP." && exit 1

echo "PASS: DNS server $d via UDP was correctly blocked."

done

echo "Testing DNS servers via TCP that are NOT allowed to work..."

for d in $BLOCK; do

dig @$d somename +time=1 +tries=1 +tcp > /dev/null

[ $? -eq 0 ] && echo "FAIL: Got response from blocked DNS server $d via TCP." && exit 1

echo "PASS: DNS server $d via TCP was correctly blocked."

done

echo ""

echo "Firewall PASSed."

Then I thought that I might be able to test other TCP blocking rules by setting the IP packet's time-to-live (TTL) to a small number and looking for ICMP time expired packets. To do this I needed to use tcptraceroute to get the core functionality. On the Mac I got this from fink.

If I get some response, the firewall is NOT blocking an outgoing port. If I get stars (* * *) then it probably is blocking the port.

#!/bin/bash

echo "Testing blocked outgoing port..."

# set TTL to 2 hops: host to ADSL router, ADSL router to ISP gateway

# if the ISP responds to TCP TTL timeouts then a blocked port should get '2 *'

# whereas an open outgoing port should get something more complicated like this:

# '2 37.1.233.220.static.exetel.com.au (220.233.1.37) 23.176 ms'

#sudo tcptraceroute -q 1 -w 1 -f 2 -m 2 www.google.com 79 | grep '2 *' && echo "blocked"

BLOCK="135 136 137 138 139 445 593 1863 110 9000 5190 23 1503 1720 53"

VICTIM="www.some.real.site.com"

# The victim should not get any packets if the firewall rules are right.

for p in $BLOCK; do

sudo tcptraceroute -q 1 -w 1 -f 2 -m 2 $VICTIM $p | grep '2 \*'

[ $? -ne 0 ] && echo "FAIL: Port $p is open for outgoing traffic." && exit 1

echo "PASS: Port $p is blocked for outgoing traffic."

echo ""

done

ALLOW="80 8080 443 25 21 119 22 123"

VICTIM="www.some.real.site.com"

# again, the victim should not get any packets since TTL is so small.

for p in $ALLOW; do

sudo tcptraceroute -q 1 -w 1 -f 2 -m 2 $VICTIM $p | grep '2 \*'

[ $? -eq 0 ] && echo "FAIL: Port $p is blocked for outgoing traffic." && exit 1

echo "PASS: Port $p is open for outgoing traffic."

echo ""

done

echo ""

echo "Firewall PASSed."

You will need to modify the script to suit your firewall rules.

Tuesday, 29 June 2010

A Post from the command line using GoogleCL

One small step for Google. One giant leap for scripting.

Sunday, 20 June 2010

Google Captured Data and Passwords in Australia - an Estimate

Most would have heard by now that Google's StreetView cars have been taking pictures of our streets and at the same time collecting WiFi SSIDs. But they have also been collecting other data that has got them into trouble with Governments and privacy groups.

http://googleblog.blogspot.com/search/label/privacy

Google sponsored report

http://www.google.com/googleblogs/pdfs/friedberg_sourcecode_analysis_060910.pdf

In Australia it is no different. Senator Conroy has been very vocal about this.

http://www.abc.net.au/news/stories/2010/05/25/2908415.htm

I was chatting to a friend and a quick calculation seemed to indicate that the amount of data and passwords captured must be small.

So, with lots of assumptions I have looked at two cases: the 'best' case and the 'worse' case.

To do any estimation I needed numbers. Fortunately our Bureau of Statistics (ABS) provided what I needed.

Here are my assumptions:

1. From the ABS December 2009, 5.2M ADSL and Cable subscribers. http://abs.gov.au/ausstats/abs@.nsf/mf/8153.0/

2. 114,400 TB downloaded data per year (ABS)

3. Uploaded data is 5 - 20% of downloaded data.

4. Between 50 and 80% of all subscribers use encrypted WiFi.

5. WiFi range from an indoor household Access Point is +-100m to +-250m.

6. The StreetView car samples 5 channels per second (See report).

7. Data in overlapping channels can be received in channels 1, 6, 11.

8. StreetView car travels at 30 - 50 km/h between 8:00 and 16:00. The car need bright light to take photos and for safety (fatigue) reasons they would only do 8 hour shifts - probably with breaks every 2 hours or so.

9. Uploaded data is sent evenly throughout the day from domestic homes and that between 8:00 and 16:00 the upload data rates are average.

10. Households send 1 - 10 non-secured passwords per day.

11. Household use the internet between 16 and 20 hours per day.

WiFi Reception Performance

One important variable that I did not model was the error rate of received frames that is related to distance: the further away from a WiFi access point, the lower the change of receiving a frame without error. The assumption that you can receive all frames within 100m (worse case) or within 250m (best case) is, frankly, silly and unrealistic. This will mean that any result is going to establish an upper limit for both cases.

WiFi Channels

So basically I assume the street car samples 3 channels at the rate of 5 per second. I conservatively assume that data on all the other WiFi channels can read using just these 3 channels. I doubt this is correct so again it will establish an upper limit for both cases.

Reception Period

I calculate the time the car is in range of a WiFi base is between 14 and 60 seconds and can sample between 5 and 20 seconds worth of data. The car probably travels at a speed between 30 and 50 km/h and can collect WiFi frames from any given access point for one third of the time.

Data Collected

By assuming that households transmit data continuously, and knowing the average amount of data sent each second I can estimate the amount of data collected on average per WiFi access point.

Australians downloaded 114,000 TB of data in 2009. Most web browsing is downloading, especially when we are talking about reading bank accounts and email.

Assuming uploaded data is between 5 and 20% of the amount we download, I arrived at an upload data amount of between 5700 - 23,000 TB per year or 400 - 1400 bps.

This means that they can record between 250 and 3400 Bytes of data per WiFi access point (SSID).

Password Capture

I have assumed that passwords are sent in the clear (unencrypted) between 1 to 10 times per day. Most sites (such as banks) use HTTPS and email hosting services generally support encrypted SMTP/POP/IMAP to send account and password information. But some sites may still allow access to mail and other discussion sites using unencrypted passwords. It is these passwords that Google could have captured on unencrypted WiFi access points.

The probability of capturing a password is between 0.002% and 0.14% so this means that between 17 and 3600 passwords would be captured Australia-wide as the StreetView car drove by.

Resulting Estimate

Data collected: 250 - 3400 Bytes per WiFi access point
Passwords collected: 17 - 3600 Australia-wide

The difference between my best and worse case is just over 1 order of magnitude for bytes collected and just over 2 orders of magnitude for passwords collected. This reflects the level of uncertainty on my estimates.

I think the real values will be much lower for these reasons:

1. During the day is not the peak time for downloads from households. On weekdays, on average, many family members will be at school or at work so it stands to reason that less internet activity will take place.

2. Encrypted WiFi access points seems to be closer to 80% rather than 50%. People are more aware about security and ISPs have done a lot to encourage the security of wireless networks.

3. Few services use unencrypted passwords - I can not think of any except for POP based email. All banks use some for of encryption - to do otherwise would be incompetent. Unfortunately it may be the small businesses that allow staff to access their email via unencrypted POP that are letting their employees down. GMail and Yahoo only seem to allow encrypted authentication so your account name and password are safe.

4. WiFi range is probably not even 100m and hardly 250m and the ability to pickup a transmission from a laptop/mobile at these distances is low. The further away from a WiFi access point, the higher the probability of receiving an errored frame that would contain no reliable data.

5. Uploaded data is probably less that 5% of Downloaded data. A packet containing an account name and password is small compared to the resulting page that gets downloaded.

6. Even when a WiFi access point is unencrypted, typically traffic to sites that require privacy are encrypted. So the actual unencrypted data is publicly available web pages, images, video and javascript code.

My Guess

The above best and worse case set an upper limit. But the worse case is too optimistic regarding the amount of uploaded traffic, WiFi range and the number of unencrypted access points. So I would suggest the number of passwords collected to be around 17 - say 10 to 100 - Australia-wide and the amount of unencrypted personal data to be much less - say 10 to 100 bytes per household.

IPhoto Script to Remove Missing Photos

For some reason when you try to open photos you get a big exclamation mark! This seems to happen when the actual photo is missing. Perhaps it has been deleted through the file system or perhaps iPhoto is confused or perhaps because iPhoto crashed during some operation. Who knows.

I wrote this script to find these 'photos' and to move them to the trash.

tell application "iPhoto"

set curPhotos to selection

if (count of curPhotos) ≤ 0 then

display alert "You need to select the photos you want me to process."

else

set countPhotos to count of items in curPhotos

repeat with i from 1 to countPhotos

set thisPhoto to item i of curPhotos

try

set t to info for (image path of thisPhoto as POSIX file)

on error eStr number eNum partial result rList from badObj to expectedType

log eStr

select thisPhoto

remove thisPhoto

end try

end repeat

end if

end tell

Open Script Editor, cut and paste the above script into the editor, Compile it to check for errors, and save it to a file - perhaps on your desktop but anywhere is fine.

To run it, open iPhoto, select the Photos Library, Select all photos you want to process (Edit - Select All is what I usually do), and then switch back to the Script Editor and press Run.

When it finishes you may have some missing photos in your trash. You can decide what to do with them at this point - I just empty the trash.

If you use it, write a short comment about whether it was helpful or not or whether it worked or not. If you improve the script, let me know as well.

Saturday, 1 May 2010

Steve Jobs on Flash

Steve Jobs has posted his thoughts on Adobe's Flash and why Apple have not allowed Flash to be installed on iPhones and iPads.

Adobe have, probably accidently, developed something like HTMLv5 years (8?) ahead of W3C. They saw the need for a standard, OS-agnostic platform for the development of applications and the presentation of content including video, audio, animation and interactivity.

Adobe's Actionscript was also fast. It seems that all other browsers felt that there was no need to work on Javascript speed because CPU's were getting faster each year - as a consequence of Moore's Law.

But something happened early this decade - CPU speed (clock rates) started to slow and to compensate, manufacturers began to introduce multi-core processors.

Web pages, however, seem to be hard to render using multiple threads and so Javascript performance began to stagnate.

Enter Google. They believe in open standards, an open web and everything running in the browser, sourced from the internet. To make this possible, Javascript needed to be fast so they started the Chrome browser project which incorporated a new and fast V8 Javascript engine. Shortly afterwards, it seemed, Webkit (Apple) and Mozilla began to pickup their Javascript performance as well. And now, we see Microsoft is also working on Javascript performance and standards compliance for IE9.

Adobe have had a good run, but standardisation has caught them up. (In a similar way, standardisation caught Lotus Notes which, for the time it was developed, was - or appeared to be - visionary: Tabbed workspace, forms, separation of data from presentation, security, encryption, signed applications...).

Back to Steve Job's posting. Steve thinks that Flash is closed and the Apple is the exact opposite - meaning open.

Open

First, there’s “Open”.

Adobe’s Flash products are 100% proprietary. They are only available from Adobe, and Adobe has sole authority as to their future enhancement, pricing, etc. While Adobe’s Flash products are widely available, this does not mean they are open, since they are controlled entirely by Adobe and available only from Adobe. By almost any definition, Flash is a closed system.

Apple has many proprietary products too. Though the operating system for the iPhone, iPod and iPad is proprietary, we strongly believe that all standards pertaining to the web should be open. Rather than use Flash, Apple has adopted HTML5, CSS and JavaScript – all open standards. Apple’s mobile devices all ship with high performance, low power implementations of these open standards. HTML5, the new web standard that has been adopted by Apple, Google and many others, lets web developers create advanced graphics, typography, animations and transitions without relying on third party browser plug-ins (like Flash). HTML5 is completely open and controlled by a standards committee, of which Apple is a member.

Apple even creates open standards for the web. For example, Apple began with a small open source project and created WebKit, a complete open-source HTML5 rendering engine that is the heart of the Safari web browser used in all our products. WebKit has been widely adopted. Google uses it for Android’s browser, Palm uses it, Nokia uses it, and RIM (Blackberry) has announced they will use it too. Almost every smartphone web browser other than Microsoft’s uses WebKit. By making its WebKit technology open, Apple has set the standard for mobile web browsers.

What if we take what Steve wrote and swap 'Apple' with 'Adobe' and 'Flash' with 'Mac OS X'? This is what we get:

Apple’s Mac OS X products are 100% proprietary. They are only available from Apple, and Apple has sole authority as to their future enhancement, pricing, etc. While Apple’s Mac OS X products are widely available, this does not mean they are open, since they are controlled entirely by Apple and available only from Apple. By almost any definition, Mac OS X is a closed system.

It isn't perfect, but it is very close to the truth. Apple use open source, contribute to and develop with open source software, but they produce very proprietary software. You can not run OS X on any other hardware other than Apple hardware. To write well-integrated OS X applications, you need to use Apple's proprietary interfaces - Carbon or Cocoa. These applications will not run on other OSs (Windows or Linux) and so some developers choose to use frameworks that allow developers to write applications that will run on any OS platform - or they use Java. Steve doesn't like this.

And now, use have to use Apple's APIs directly to write applications for the iPhone and iPad. This has upset Adobe (and probably a number of other organisations that make cross-platform frameworks such as XMLVM).

I think Steve accepts an open web, but everything else should be closed, and Apple is certainly insisting on this path. The iTunes store can only really be used with iTunes and iTunes can only be used with iPods and iPhones and now iPads.

I wonder when iTunes will stop supporting Windows?

iPhoto, iDVD and iMovie only work on OS X too. Keeping your photos on a Mac using Apple's software does tend you lock you in to using Apple hardware and software for a long time.

Full Web

Second, there’s the “full web”.

Adobe has repeatedly said that Apple mobile devices cannot access “the full web” because 75% of video on the web is in Flash. What they don’t say is that almost all this video is also available in a more modern format, H.264, and viewable on iPhones, iPods and iPads. YouTube, with an estimated 40% of the web’s video, shines in an app bundled on all Apple mobile devices, with the iPad offering perhaps the best YouTube discovery and viewing experience ever. Add to this video from Vimeo, Netflix, Facebook, ABC, CBS, CNN, MSNBC, Fox News, ESPN, NPR, Time, The New York Times, The Wall Street Journal, Sports Illustrated, People, National Geographic, and many, many others. iPhone, iPod and iPad users aren’t missing much video.

Another Adobe claim is that Apple devices cannot play Flash games. This is true. Fortunately, there are over 50,000 games and entertainment titles on the App Store, and many of them are free. There are more games and entertainment titles available for iPhone, iPod and iPad than for any other platform in the world.

Adobe claim that by not having Flash, iPhone users are missing out on the full web experience. It is true that any Flash content can not be displayed on the iPhone/iPad, but I think most web sites will develop special versions of their sites specifically for iPhone/Android devices that have limited screen sizes and limited user input interfaces: mice and touch pads offer very fine pointing and clicking controls whereas fingers are a little less accurate and cover-up what you are touching. On-screen keyboards are great but they are no match for a reasonably large physical keyboard.

In time, keyboards may well disappear, but they will be replaced with something that works as good as the real thing, not something that slows you down.

So iPhone and Android users alike are already missing some of the full web experience, but they have the advantage of mobility and newer customised web sites that will only make the experience better.

The existence or lack of H.264 is not really an issue. Flash now supports H.264 and any video will be in a format that iPhone users will be able to view. Interestingly H.264 is proprietary (and Apples has some interest in the patents associated with it) so Steve is not pushing for an open web experience here - he want's royalties and refuses to add the open source Ogg/Theora audio and video formats to Safari to help make the web truly open and free.

Security

Third, there’s reliability, security and performance.

Symantec recently highlighted Flash for having one of the worst security records in 2009. We also know first hand that Flash is the number one reason Macs crash. We have been working with Adobe to fix these problems, but they have persisted for several years now. We don’t want to reduce the reliability and security of our iPhones, iPods and iPads by adding Flash.

In addition, Flash has not performed well on mobile devices. We have routinely asked Adobe to show us Flash performing well on a mobile device, any mobile device, for a few years now. We have never seen it. Adobe publicly said that Flash would ship on a smartphone in early 2009, then the second half of 2009, then the first half of 2010, and now they say the second half of 2010. We think it will eventually ship, but we’re glad we didn’t hold our breath. Who knows how it will perform?

Steve is also worried about Adobe's Flash reliability and security. He has the statistics and claims that Flash is the number one cause of Mac crashes. I wonder what the number two cause is?

So, he say that it is best to keep Flash away from the iPhone and iPad.

Google has taken a different, seemingly more rational approach. They have decided to include Flash into Chrome and have made plans to address reliability and security issues.

Google seeks to eliminate problems rather than add layers to reduce risk. Their Native Client does just this: an architecture to allow any plugin to run so long as it can be validated that it complies to hard rules to that prevent software doing anything harmful. This has to be better than validated compiler tool chains, signed applications, layers of malware filtering and heuristic code analysis.

Battery Life

Fourth, there’s battery life.

To achieve long battery life when playing video, mobile devices must decode the video in hardware; decoding it in software uses too much power. Many of the chips used in modern mobile devices contain a decoder called H.264 – an industry standard that is used in every Blu-ray DVD player and has been adopted by Apple, Google (YouTube), Vimeo, Netflix and many other companies.

Although Flash has recently added support for H.264, the video on almost all Flash websites currently requires an older generation decoder that is not implemented in mobile chips and must be run in software. The difference is striking: on an iPhone, for example, H.264 videos play for up to 10 hours, while videos decoded in software play for less than 5 hours before the battery is fully drained.

When websites re-encode their videos using H.264, they can offer them without using Flash at all. They play perfectly in browsers like Apple’s Safari and Google’s Chrome without any plugins whatsoever, and look great on iPhones, iPods and iPads.

Steve ignores other Flash applications here (which may or may not be kind to battery life) and focuses on H.264 video playback. Again, if Flash now supports H.264 then this is a non-issue (except that web sites would need to re-encode their content which they have to do for the iPhone/iPad anyway).

Touch

Fifth, there’s Touch.

Flash was designed for PCs using mice, not for touch screens using fingers. For example, many Flash websites rely on “rollovers”, which pop up menus or other elements when the mouse arrow hovers over a specific spot. Apple’s revolutionary multi-touch interface doesn’t use a mouse, and there is no concept of a rollover. Most Flash websites will need to be rewritten to support touch-based devices. If developers need to rewrite their Flash websites, why not use modern technologies like HTML5, CSS and JavaScript?

Even if iPhones, iPods and iPads ran Flash, it would not solve the problem that most Flash websites need to be rewritten to support touch-based devices.

Steve says that touch interfaces don't work the same as mice/touchpads and therefore Flash applications wont work anyway. Interestingly, Flash started out as a PenPoint OS which may have had similar behaviour to a touch interface, but I don't know.

What Steve fails to mention is that many web sites also make use of mouseover events to show text and graphics as your mouse pointer hovers over a particular word, link or image. Blogger uses tooltips which help a little in explaining the function of a button. Even Apple's web store for the iPhone uses 'rollovers' to display the help, account and cart menus! I guess Apple had to re-write these sites for the iPhone.

Actually, I just checked - the store is virtually unusable on an iPod touch. You can click on the help menu and a menu will be displayed so you can then double-touch to zoom in. Wouldn't this work for Flash too?

All web sites that need mouseover events to operate will have to be re-written for the iPhone so banning Flash does not fix this - the web site owner needs to do some work to make their sites more accessible for iPhone and iPad users. So why is this an issue Steve? This looks like hypocrisy to me.

The 'real' reason

Sixth, the most important reason.

Besides the fact that Flash is closed and proprietary, has major technical drawbacks, and doesn’t support touch based devices, there is an even more important reason we do not allow Flash on iPhones, iPods and iPads. We have discussed the downsides of using Flash to play video and interactive content from websites, but Adobe also wants developers to adopt Flash to create apps that run on our mobile devices.

We know from painful experience that letting a third party layer of software come between the platform and the developer ultimately results in sub-standard apps and hinders the enhancement and progress of the platform. If developers grow dependent on third party development libraries and tools, they can only take advantage of platform enhancements if and when the third party chooses to adopt the new features. We cannot be at the mercy of a third party deciding if and when they will make our enhancements available to our developers.

This becomes even worse if the third party is supplying a cross platform development tool. The third party may not adopt enhancements from one platform unless they are available on all of their supported platforms. Hence developers only have access to the lowest common denominator set of features. Again, we cannot accept an outcome where developers are blocked from using our innovations and enhancements because they are not available on our competitor’s platforms.

Flash is a cross platform development tool. It is not Adobe’s goal to help developers write the best iPhone, iPod and iPad apps. It is their goal to help developers write cross platform apps. And Adobe has been painfully slow to adopt enhancements to Apple’s platforms. For example, although Mac OS X has been shipping for almost 10 years now, Adobe just adopted it fully (Cocoa) two weeks ago when they shipped CS5. Adobe was the last major third party developer to fully adopt Mac OS X.

Our motivation is simple – we want to provide the most advanced and innovative platform to our developers, and we want them to stand directly on the shoulders of this platform and create the best apps the world has ever seen. We want to continually enhance the platform so developers can create even more amazing, powerful, fun and useful applications. Everyone wins – we sell more devices because we have the best apps, developers reach a wider and wider audience and customer base, and users are continually delighted by the best and broadest selection of apps on any platform.

Steve simply wants to make it hard for any developer to write applications for multiple platforms.

It may be true that frameworks limit the features, but equally it may be true that the Apple iPhone/iPad OS is the lowest common denominator - why do you think that Apple will always have more features that your competitors? Can you merge directories of the same name in OS X Finder yet?

If this is a real reason, then why not specify that any framework must support the whole API? This surely would address your concerns about having all the features available to the developer.

Conclusion

I am not a fan of Flash, but it is generally required for YouTube for PC and laptop users.

I agree the HTMLv5 is the future but I disagree that it should only include a patented and proprietary H.264.

Apple could allow Flash on the iPhone since web sites have to be re-written anyway for iPhone users.

Battery life while playing videos may not be an issue if Flash on the iPhone/iPad used H.264 and Flash had access the Apple's H.264 API.

Security is solvable and Google and Adobe seem to be about to demonstrate this.

Postscript

The person who sent me the link to Steve's posting owns a Mac and an iPod that I know for certain. They are going to say 'bye bye' to Apple based on Steve's compelling argument against proprietary software:

Jobs makes a compelling case for not trusting a proprietary company who has sole control over their proprietary products.

So, bye bye Apple.

I have 3 Mac Book Pros, iPod Nano, iPod Touch, Time Capsule and have been influential in at least the purchase of a Mac Mini over the last 5 years (about $15,000 worth at time of purchase). I am re-considering my use of Apple software and hardware and I will certainly not purchase anywhere near the amount of Apple products in the next 5 years - if any.

I will also be removing the shackles of proprietary Apple software by moving my photo and music collections to Open Source software and online services such as Google Docs. Not just because of this open letter about Flash, but because Apple is removing people's freedom to develop and use software the way they choose to.

Saturday, 17 April 2010

Simple Rolling Hash- Part 3

The Rolling Hash

Now that there is a way to remove the first character and append a new last character (in constant time) to a string hash value, we can also use this hash function suite to perform substring searches in linear time.

To find a substring of length m in a string of length n you might proceed as follows:

Start at the first character of the string

Compare the substring to the string for m characters.

If same, then the substring is found.

else, repeat from the next character of the string (stop when there are less than m characters left in the string)

In practice this works well since it is unlikely that more than a few character will match each round so the effective time efficiency is much less than O(N x M). But for some strings, such as DNA which have small alphabets, the number of matches in each round could be large.

By using a rolling hash this can be reduced to linear-time.

Calculate the hash of the substring we are searching for.

Start with a hash of the first m characters of the string

If the hash values match then the substring is found - probably.

else remove the first character of the string from the hash and add the m+1 th character to the hash.

Now the hash is of m characters from the next character of the string.

This process is O(N).

The catch is that this finds a substring with high probability. To ensure that it is a match we need to check each character.

A Good Hash Function

My next question was 'is this hash good enough?'.

After some research I found that hashing is similar to a problem called Balls and Bins. The idea is to analyse mathematically the result of randomly throwing m balls into n bins.

It turns out that if a hash function is 'good' then it should produce results similar to randomly throwing balls (strings) into bins (string hash numbers).

I am no mathematician so this was slow going and a bit frustrating since I don't know all the tricks used to simplify problems.

I found that for a good hash function, the number of unused hash numbers after generating hash numbers from n random-like strings should be n/e. This is easy to test: I made an array of size n, generated random strings, hashed them, and incremented the element of the array indexed by the hash number.

(To keep the array size small, I first reduced the 28 bit hash number to a 16 bit number by mod 65536).

The probability that a hash number was used once is also n/e. And likewise it is easy to calculate the probabilities of a hash number being used k times:

Pr[ k ] = 1/ek!

And the probability that a hash number is used k or more times is roughly

Pr[ >=k ] ~ ((e/k)**k)/e

(to be completed)

Friday, 16 April 2010

Simple Rolling Hash- Part 2

String Slicing

I also wanted to extract substrings of appended or fixed-length strings quickly as well.

Since most strings would be less than 64 characters (I reasoned), this too would be virtually constant-time.

In the case of appended strings, the slicing would be proportional to the number of appended strings which would be roughly O(N) but with a very small constant factor in the order of at least 1/10 th on average.

Again, to ensure that the hash calculation would not harm performance, I needed a way to adjust the hash of a string by the hash of the leading or trailing substrings.

eg.

hash( "def" ) = some_function( hash ( "abcdef" ) , hash( "abc" ) )

hash( "abc" ) = some_function( hash ( "abcdef" ) , hash( "def" ) )

To do this I needed to choose carefully the value of k such that k * inv(k) = 2**28 + 1, where inv(k) is the multiplicative inverse of k - ie. k and inv(k) are factors of 2**28+1.

Lucky Break

This part is simply a fluke. I needed my hash to be 28 bits so that I can store type information in the other 4 bits of a 32-bit value.

I could use a hash that is less that 28 bits, but not more. Fortunately, 2**28 +1, which is 268,435,457 has two prime factors: 17 and 15790321.

It is also fortunate that 17 is a reasonable multiplier for a hash as well since, I reasoned, it preserves the low 4 bits of each character which contain the most information for the letters and numbers and punctuation characters.

I have found another hash function that uses k=17.

http://www.codeproject.com/KB/recipes/hash_functions.aspx

Removing Characters and Calculating the New Hash in O(1)

To remove a leading character from a string of length n, you need to remove it's component of the hash which has been multiplied by k, n-1 times. So the new hash is

hash( s' ) = hash( s ) - k**(n-1) * c, where c is the first character of the string s

This can be done in constant-time, but removing M characters takes linear-time, O(M).

In the more general case, removing a leading substring of m characters follows the same pattern: here the leading substring has been multiplied by k, n - m times.

hash( s' ) = hash( s ) - k**(n-m) * hash( f ), where f represents the leading m characters of the string s.

To remove the trailing character from a string of length n, you need to subtract it from the hash and then divide the hash of the string by k. Division can be performed by multiplying by the inverse of k which is 15790321.

hash( s' ) = ( hash( s ) - c ) * inv( k ), where c is the last character of the string s.

To remove the trailing m characters from a string of length n, you need to subtract the hash of the m characters and then divide the hash of the string by k**m. Again, division can be performed by multiplying by the inverse of k which is 15790321.

hash( s' ) = ( hash( s ) - hash( t ) ) * inv( k )**m, where t is the last m characters of the string s.

Unfortunaely in the general case, to remove substrings from a string, I need the hash of each substring which takes linear-time.

Reducing Substring Hash Time

Here is where the fixed-length strings help: The only substrings that need to be calculated are the first and last fixed-length strings of a slice. And these substrings are between 1 and 63 characters long.

eg. Assume the fixed-length strings are 3 characters long and slice a string to return 6 characters from the second.

( "abc" , "def" , "ghi" ) :2:6 -> ( "bc" , "def" , "g" )

In this case, the hash for "bc" and "g" have to be calculated, but the hash for "def" remains the same.

We can always guarantee minimal time by observing that if the substring to be removed is greater than half the length of the fixed-length string then it is better to re-calculate the hash of the resulting substring. This way, worst case is calculating the hash of fixed-length characters, or 64 in my case.

If it happens that the hash of the substrings is known then we get back to constant-time.