Thursday, March 22, 2007
Twice the speed, half the size
Gathering the performance history of Nutch is now complete. I am glad to announce that the soon to be released Nutch 0.9.0 will be two times as fast as 0.8.x (with the configuration used in bench). Same time the crawled data will only use about half of the disc surface as before - thanks to Hadoop.
The following graph shows how the size of equal crawls has changed over time.

Time spend in crawling is plotted below.

The following graph shows how the size of equal crawls has changed over time.

Time spend in crawling is plotted below.

Labels: nutch
Comments
wow, these are great stats Sami. Very impressive and helpful.
# posted by Renaud : March 27, 2007 5:51 AM
I am learning nutch,this is helpful to me.
Thank u .
# posted by owen : April 28, 2007 6:10 PM
I am learning lucene and nutch
# posted by chen : August 9, 2007 4:54 AM
thank you !
# posted by chen : August 9, 2007 4:55 AM
Great job Sami! -JukkaT
# posted by Jukka : August 29, 2007 9:31 AM
Hi Sami,
Sorry to use this comment form to contact with you but I'm not sure you received my email for a possible collaboration on a search engine project.
Regards
# posted by thats-me : September 3, 2007 1:23 PM
If only your nifty application obeyed robots.txt instead of attempting to index the entries found within.
# posted by stutteringp0et : September 30, 2007 3:18 PM
I am learning nutch, i think i can learn it well,
# posted by 海丰 : October 27, 2007 10:27 AM
This post has been removed by the author.
# posted by 海丰 : October 27, 2007 10:29 AM
Thanks Sami, excellent job...
unfortunately I was too busy all that time...
I noticed slow performance of Nutch in 2006, and moved to custom modifications of Fetcher and Parser, plus custom HTTP Client config..
1. Apache HttpClient can automatically follow redirects + Cookie support (so that I can avoid HUGE overhead with URL normalization & removing of session IDs)
2. I noticed some holes in HTML Parser: for instance, still (v.0.9) it can't handle ALT attribute of {IMG} as Outlink object with anchor text...
And many (believe me!) more...
...
I am currently using MySQL + some modified code from Nutch 0.7; I needed MySQL and I was able easily understand what really happens with URLs, Outlinks, etc.; now, when I don't have any unresolved staff, I am going to move to Hadoop & Nutch. Probably, with some MySQL (constrained crawls, etc.; when we need huge configurations it's better to use database... may be...)
Release from Trunk looks excellent.
Anyway, MySQL helped me to see (very quickly) that something is going wrong even with encoding algos of HTML Parser, especially with EURO symbol... ISO is 'superseded' by windows-1252; in some cases it is buggy (for some specific websites...)
Hope to put all findings in-sync with Nutch 0.** and to share via mailing lists.
-www.tokenizer.org
# posted by Bambarbia Kirkudu : March 2, 2008 5:51 AM
Post a Comment
<< Home

