Thursday, March 22, 2007

 

Twice the speed, half the size

Gathering the performance history of Nutch is now complete. I am glad to announce that the soon to be released Nutch 0.9.0 will be two times as fast as 0.8.x (with the configuration used in bench). Same time the crawled data will only use about half of the disc surface as before - thanks to Hadoop.

The following graph shows how the size of equal crawls has changed over time.



Time spend in crawling is plotted below.

Labels:



Comments



wow, these are great stats Sami. Very impressive and helpful.
# posted by Blogger Renaud : March 27, 2007 at 5:51 AM  



I am learning nutch,this is helpful to me.
Thank u .
# posted by Blogger owen : April 28, 2007 at 6:10 PM  



I am learning lucene and nutch
# posted by Blogger chen : August 9, 2007 at 4:54 AM  



thank you !
# posted by Blogger chen : August 9, 2007 at 4:55 AM  



Great job Sami! -JukkaT
# posted by Blogger Jukka : August 29, 2007 at 9:31 AM  



Hi Sami,

Sorry to use this comment form to contact with you but I'm not sure you received my email for a possible collaboration on a search engine project.

Regards
# posted by Blogger thats-me : September 3, 2007 at 1:23 PM  



If only your nifty application obeyed robots.txt instead of attempting to index the entries found within.
# posted by Blogger stutteringp0et : September 30, 2007 at 3:18 PM  



I am learning nutch, i think i can learn it well,
# posted by Blogger 海丰 : October 27, 2007 at 10:27 AM  



This comment has been removed by the author.
# posted by Blogger 海丰 : October 27, 2007 at 10:29 AM  



Thanks Sami, excellent job...
unfortunately I was too busy all that time...
I noticed slow performance of Nutch in 2006, and moved to custom modifications of Fetcher and Parser, plus custom HTTP Client config..

1. Apache HttpClient can automatically follow redirects + Cookie support (so that I can avoid HUGE overhead with URL normalization & removing of session IDs)

2. I noticed some holes in HTML Parser: for instance, still (v.0.9) it can't handle ALT attribute of {IMG} as Outlink object with anchor text...

And many (believe me!) more...
...

I am currently using MySQL + some modified code from Nutch 0.7; I needed MySQL and I was able easily understand what really happens with URLs, Outlinks, etc.; now, when I don't have any unresolved staff, I am going to move to Hadoop & Nutch. Probably, with some MySQL (constrained crawls, etc.; when we need huge configurations it's better to use database... may be...)

Release from Trunk looks excellent.

Anyway, MySQL helped me to see (very quickly) that something is going wrong even with encoding algos of HTML Parser, especially with EURO symbol... ISO is 'superseded' by windows-1252; in some cases it is buggy (for some specific websites...)

Hope to put all findings in-sync with Nutch 0.** and to share via mailing lists.

-www.tokenizer.org
# posted by Blogger Bambarbia Kirkudu : March 2, 2008 at 5:51 AM  

Post a Comment

Subscribe to Post Comments [Atom]



<< Home

Navigation