Sunday, January 14, 2007

 

Sorted out

The Fetcher performance in post 0.7.x version of Nutch has been a target for a critique for long time and not without cause. Even when there are many improvements made (and also many waiting to be done) during the last year things just aren't as fast as one hopes.

One particular thing has been bothering me for a long but I never really had time to look it through, until now.

Nutch Fetcher operates by reading a sequence of urls from a list generated by Generator. These urls are then handled to FetcherThreads. FetcherThread fetches content parses it (or not depending on the configuration) and stores it into segment for later processing. Some info about Nutch segments contents
can be seen from my previous post.

Fetcher also has built in mechanism to behave like a good citizen and not fetch more pages per unit of time than configured. If fetchlist contains a lot of urls in a row to same host lots of threads get blocked because of this mechanism that enforces politenes. This queuing mechanism is a good thing but as a side effect of it a lot of threads just sit and wait in a queue because some other thread just fetched a page from same host they were going to fetch.

There are number of factors one can do by configuration that minimium amount of threads are blocked during fetching some of them are listed below:


generate.max.per.host
generate.max.per.host.by.ip
fetcher.threads.per.host.by.ip
fetcher.server.delay
fetcher.threads.fetch
fetcher.threads.per.host


But even after you have set up reasonable configuration (like generate.max.per.host * threads < num_of_urls_to_generate) you still end up in a situation where tons of Threads are blocked on on same host you just start to wonder what is the problem this time.

This time the blame was in Generator, or more specifically in HashComparator. I took me a long time to figure out what was the real problem, I even tried out other hash functions because I thought the one was flaved. At the end the problem is quite obvious:

public int compare(...) {
...
if (hash1 != hash2) {
return hash1 - hash2;
}
...
}

Isn't it? Well it wasn't for me. But afterwards it's easy to say that overflow in integer math was the blame. I replaced compare methods slighly to get rid of integer overflow:

public int compare(...) {
...
return (hash1 < hash2 ? -1 : (hash1 == hash2 ? 0 : 1));
}

To verify the effect of fix I generated two segments, both sized 10 000 urls (exactly same urls) - one with original code and one with modified code, runtimes for these single server fetches are listed below:

Original:
real 32m16.246s
user 2m33.726s
sys 0m9.989s

Modded:
real 19m40.026s
user 2m35.371s
sys 0m10.892s

The absoulte times are more or less meaningless and they are provided just for a reference, below is a chart of bandwidth used during fetching. A thing to note there is more even bandwidth usage with properly sorted fetchlist.



In the end I have to say that I am very pleased I got this one sorted out.

Labels: ,



Comments



congrats for the improvements! just wondering -- how do you plot these nice graphs? do you use some kind of profiler?

thanks,
Renaud
# posted by Blogger Renaud : February 2, 2007 at 6:00 PM  



Thanks! The graphs are done with the mother of all graphing software - gnuplot .

Data is collected with common Unix utilities and pre-prosessed with sed and friends.
# posted by Blogger Sami Siren : February 2, 2007 at 9:40 PM  

Post a Comment

Subscribe to Post Comments [Atom]



<< Home

Navigation