Sunday, 21 June 2009

Browser Benchmarks - too many variables

Mozilla Firefox IconFile:Apple Safari.pngFile:GoogleChromeLogo.png
Laptop Battery Benchmark

Slashdot has a link to an article which presents a claim by AMD that the MobileMark 2007 battery benchmarking specification does not represent typical laptop use - if fact, AMD claims that the test basically runs the laptop at idle with the screen dimmed and with wifi turned off.

I know I don't get anything near what Apple claimed for my MacBook Pro 2008 - and I run with a dimmed screen, Bluetooth off and with the under-volted processor tweaked down to 600MHz on idle. My laptop typically runs between 15 and 20 degrees C above ambient. For example it is 18 degrees inside and the CPU is running at 34 degrees C.

Whether the claims are true or not, I wonder how the current browser benchmark tests relate to typical use? Is there such a thing as typical use? And how does a laptop or desktop energy settings affect the result?

Browser Benchmark Tests

There are a number of browser JavaScript benchmark tests online: V8, Sun Spider and Dromaeo. Dromaeo takes too long to run so I have only used V8 and Sun Spider.

V8 benchmark is setup by Google. There are curently 4 versions of the test and it is a very quick test. Sun Spider is built for testing WebKit. WebKit is a branch of KHTML which Konqueror was built from. Apple bases Safari on WebKit. Dromaeo is built by Mozilla.

I like to experiment. I use Shiretoko (Firefox beta) for Mac mostly. I also have Firefox 3, Safari 4, and a suite of development or experimental browsers: WebKit, Stainless, Chrome and Chromium. All but Firefox are based on WebKit but Google has their own V8 JavaScript engine. I have most of these running on an old PowerBook as well.

I've been interested in how their JavaScript engines are performing so I occasionally download the latest nightly-build and run a quick test. It occurred to me that the results are affected by what else the laptop is doing and how the operating system has throttled the CPU. So I started to fix my CPU speed and shutdown most applications before I ran the tests. But in the background all sorts of updating and backup utilities are running and if they decided to start-up, the performance test results would be poorer.

I have not read about anyone else fixing their CPU speed before running the test. Perhaps it is not important, perhaps the CPU and throttling techniques know not to adjust CPU frequencies while benchmark tests are running, but somehow I doubt it.

I think we need benchmark tests that ignore other tasks, garbage collection and somehow ensure that CPU frequency and caching does not affect the result - ideally, every time the test is run, the result should be the same. Otherwise there are too many uncontrolled variables that prevent any useful comparison.

How about we put some scientific method back into Computer Science and Software Engineering?

My Results


For those that are interested, here are my rounded results and some graphs to visualize the data. With the lack of accuracy in the measurements, I suggest that a 20% error is as good as any other.

You can see that for me, Firefox is not performing as well on the tests as the other browsers. This doesn't mean that Firefox is not useable - I use it more than the others combined. It does mean that Mozilla can do better. All the WebKit based browsers do well. They are more than 10x faster than Firefox 3 on the v8 tests and take 1/5th the time on the Sun Spider tests.

The dual core MacBook Pro 2.4GHz is about 10x faster than the PowerBook G4 1.67GHz. This is probably due to the work being done in optimizing the JavaScript compilers for the Intel instruction set - it seems that the PowerPC is not a high priority.

All browsers (except Chrome) run well on the PowerBook which is our main machine.


Postscript

On reflection, if we are to be more scientific then we should have some predictions as well.

Perhaps we can predict the optimum performance of an algorithm or test case running on a particular CPU. We would then have a target for our JavaScript compilers to aspire to. Of course we need to take language overheads, if any, into account.