Saturday, January 20, 2007


Website up

I got an extra burst of energy yesterday and set up my Company website. The nice simple layout (also deployed to this blog) is designed by Andreas Viklund, nice work man!

Another part of credits goes to the creators of MeshCMS which powers the site. It was really easy to set up, no fighting with databases and stuff, just deploy the war and start writing content - another excellent example of how usable open source software is getting these days.

Regarding the company - everything is still at very early stages and I am working on it part time only (or should I say when there's demand ;). I'll get back to this when there's someting to report.

Labels: ,

Sunday, January 14, 2007


Sorted out

The Fetcher performance in post 0.7.x version of Nutch has been a target for a critique for long time and not without cause. Even when there are many improvements made (and also many waiting to be done) during the last year things just aren't as fast as one hopes.

One particular thing has been bothering me for a long but I never really had time to look it through, until now.

Nutch Fetcher operates by reading a sequence of urls from a list generated by Generator. These urls are then handled to FetcherThreads. FetcherThread fetches content parses it (or not depending on the configuration) and stores it into segment for later processing. Some info about Nutch segments contents
can be seen from my previous post.

Fetcher also has built in mechanism to behave like a good citizen and not fetch more pages per unit of time than configured. If fetchlist contains a lot of urls in a row to same host lots of threads get blocked because of this mechanism that enforces politenes. This queuing mechanism is a good thing but as a side effect of it a lot of threads just sit and wait in a queue because some other thread just fetched a page from same host they were going to fetch.

There are number of factors one can do by configuration that minimium amount of threads are blocked during fetching some of them are listed below:

But even after you have set up reasonable configuration (like * threads < num_of_urls_to_generate) you still end up in a situation where tons of Threads are blocked on on same host you just start to wonder what is the problem this time.

This time the blame was in Generator, or more specifically in HashComparator. I took me a long time to figure out what was the real problem, I even tried out other hash functions because I thought the one was flaved. At the end the problem is quite obvious:

public int compare(...) {
if (hash1 != hash2) {
return hash1 - hash2;

Isn't it? Well it wasn't for me. But afterwards it's easy to say that overflow in integer math was the blame. I replaced compare methods slighly to get rid of integer overflow:

public int compare(...) {
return (hash1 < hash2 ? -1 : (hash1 == hash2 ? 0 : 1));

To verify the effect of fix I generated two segments, both sized 10 000 urls (exactly same urls) - one with original code and one with modified code, runtimes for these single server fetches are listed below:

real 32m16.246s
user 2m33.726s
sys 0m9.989s

real 19m40.026s
user 2m35.371s
sys 0m10.892s

The absoulte times are more or less meaningless and they are provided just for a reference, below is a chart of bandwidth used during fetching. A thing to note there is more even bandwidth usage with properly sorted fetchlist.

In the end I have to say that I am very pleased I got this one sorted out.

Labels: ,