Wednesday, March 14, 2007

 

Perfomance history for Nutch

Today I started a bench marathon to build a relative performance history of Nutch for the last 200 or so revisions. The process used in measuring is very simple. First the revision is checked out, compiled and configured. Then a full crawl cycle is executed: inject generate fetch updatedb and each of the phases is timed.

The crawl is run against a local http server to eliminate all external factors away from the results. The content for crawls consists of html pages (javadoc for java6) with size of 11062 pages. Pages are served with local apache httpd. The size of each crawl is also recorded.

Why such effort? Crawling performance is a critical aspect of any search engine (ok there are the features too) and that aspect is currently not measured regularly in Nutch. By analysing the (upcoming) results we can hopefully learn how the different commits have effected the overall crawling performance. It might even make sense to continue measuring relative performance in future after every commit just to make sure nothing seriously wrong gets checked in (we'll judge that after the experiment is over;).

The results will be published in real time as they are gathered in textual format as well as in the graph below. The format for text file is as follows:


revision, total (s), inject (s), generate (s), fetch (s), updatedb (s), size of crawl dir (kb)


If the speed continues to be like it is for the first few rounds then results should be complete in 3-4 days.



Disclaimer: The only purpose of this experiment is to look at how relative performance correlates to changes committed in trunk with a very limited test. Some bench-rounds seems also fail for various reasons that is why there is some turbulence in data points. The trend or end result will be a surprise for me too as I have not run similar benchmarks before with current versions.

Update (2007-03-18) I will be running the failing points again after the 1st run completes, I also need to run some of the recent runs again because there were configuration error which prevented space savings to surface. Hadoop Native libs are not working on RH5 currently because of bug in bin/nutch script. So expect to see more improvement when that is covered.

Labels:



Comments



I want to do something similar. Can you please let me know how you are recording the performance numbers ?
# posted by Blogger Developer : January 23, 2008 at 4:38 PM  



Can you update this graph to today's nutch revisions ? Could be interesting to see how has evolved... or just provide the relevant scripts/test code you used to generate that graph, thanks !
# posted by Blogger brainstorm : August 29, 2008 at 1:57 PM  

Post a Comment

Subscribe to Post Comments [Atom]



<< Home

Navigation