Thursday, March 22, 2007


Twice the speed, half the size

Gathering the performance history of Nutch is now complete. I am glad to announce that the soon to be released Nutch 0.9.0 will be two times as fast as 0.8.x (with the configuration used in bench). Same time the crawled data will only use about half of the disc surface as before - thanks to Hadoop.

The following graph shows how the size of equal crawls has changed over time.

Time spend in crawling is plotted below.


Wednesday, March 14, 2007


Perfomance history for Nutch

Today I started a bench marathon to build a relative performance history of Nutch for the last 200 or so revisions. The process used in measuring is very simple. First the revision is checked out, compiled and configured. Then a full crawl cycle is executed: inject generate fetch updatedb and each of the phases is timed.

The crawl is run against a local http server to eliminate all external factors away from the results. The content for crawls consists of html pages (javadoc for java6) with size of 11062 pages. Pages are served with local apache httpd. The size of each crawl is also recorded.

Why such effort? Crawling performance is a critical aspect of any search engine (ok there are the features too) and that aspect is currently not measured regularly in Nutch. By analysing the (upcoming) results we can hopefully learn how the different commits have effected the overall crawling performance. It might even make sense to continue measuring relative performance in future after every commit just to make sure nothing seriously wrong gets checked in (we'll judge that after the experiment is over;).

The results will be published in real time as they are gathered in textual format as well as in the graph below. The format for text file is as follows:

revision, total (s), inject (s), generate (s), fetch (s), updatedb (s), size of crawl dir (kb)

If the speed continues to be like it is for the first few rounds then results should be complete in 3-4 days.

Disclaimer: The only purpose of this experiment is to look at how relative performance correlates to changes committed in trunk with a very limited test. Some bench-rounds seems also fail for various reasons that is why there is some turbulence in data points. The trend or end result will be a surprise for me too as I have not run similar benchmarks before with current versions.

Update (2007-03-18) I will be running the failing points again after the 1st run completes, I also need to run some of the recent runs again because there were configuration error which prevented space savings to surface. Hadoop Native libs are not working on RH5 currently because of bug in bin/nutch script. So expect to see more improvement when that is covered.