Thursday, February 19, 2009

 

Next Apache Nutch release is nearing

The list of bugs remaining to be fixed for next version of nutch is getting shorter.

I am positive that we get the first release candidate out during February 2009. We sure should since the last release (0.9) was out April 2007.

There are some very nice additions in the new version of Nutch like Solr integration, new Scoring and indexing framework. I will try to post more about those later.

If you want to help with getting version 1.0 out now is the time to act - go download the latest nightly version of Nutch, give it a ride and report back problems you experience. We could really use some more documentation in form of wiki pages and tutorials for the new stuff too!

Labels: ,



Saturday, March 8, 2008

 

Nutch training at ApacheCon EU 2008

As some of you might have noticed I am prepared to give a half day training about Apache Nutch at ApacheCon EU 2008.

However there are still too many seats available and I need Your help to get things going. So if You are interested about Nutch internals and have Tuesday, Apr 08 open in your calendar please go ahead and book a seat for You at ApacheCon web site !

Don't forget that there is also a huge amount of other interesting sessions and trainings during the week, see the schedule for more info.

Labels: , , ,



Thursday, March 22, 2007

 

Twice the speed, half the size

Gathering the performance history of Nutch is now complete. I am glad to announce that the soon to be released Nutch 0.9.0 will be two times as fast as 0.8.x (with the configuration used in bench). Same time the crawled data will only use about half of the disc surface as before - thanks to Hadoop.

The following graph shows how the size of equal crawls has changed over time.



Time spend in crawling is plotted below.

Labels:



Wednesday, March 14, 2007

 

Perfomance history for Nutch

Today I started a bench marathon to build a relative performance history of Nutch for the last 200 or so revisions. The process used in measuring is very simple. First the revision is checked out, compiled and configured. Then a full crawl cycle is executed: inject generate fetch updatedb and each of the phases is timed.

The crawl is run against a local http server to eliminate all external factors away from the results. The content for crawls consists of html pages (javadoc for java6) with size of 11062 pages. Pages are served with local apache httpd. The size of each crawl is also recorded.

Why such effort? Crawling performance is a critical aspect of any search engine (ok there are the features too) and that aspect is currently not measured regularly in Nutch. By analysing the (upcoming) results we can hopefully learn how the different commits have effected the overall crawling performance. It might even make sense to continue measuring relative performance in future after every commit just to make sure nothing seriously wrong gets checked in (we'll judge that after the experiment is over;).

The results will be published in real time as they are gathered in textual format as well as in the graph below. The format for text file is as follows:


revision, total (s), inject (s), generate (s), fetch (s), updatedb (s), size of crawl dir (kb)


If the speed continues to be like it is for the first few rounds then results should be complete in 3-4 days.



Disclaimer: The only purpose of this experiment is to look at how relative performance correlates to changes committed in trunk with a very limited test. Some bench-rounds seems also fail for various reasons that is why there is some turbulence in data points. The trend or end result will be a surprise for me too as I have not run similar benchmarks before with current versions.

Update (2007-03-18) I will be running the failing points again after the 1st run completes, I also need to run some of the recent runs again because there were configuration error which prevented space savings to surface. Hadoop Native libs are not working on RH5 currently because of bug in bin/nutch script. So expect to see more improvement when that is covered.

Labels:



Sunday, February 4, 2007

 

Online indexing - integrating Nutch with Solr

Update 2009-03-09:: There is now more up to date example of solr integration available at Lucid Imagination Blog.

There might be times when you would like to integrate Apache Nutch crawling with a single Apache Solr index server - for example when your collection size is limited to amount of documents that can be served by single Solr instance, or you like to do your updates on "live" index. By using Solr as your indexing server might even ease up your maintenance burden quite a bit - you would get rid of manual index life cycle management in Nutch and let Solr handle your index.

Overview

During this short post we will set up (and customize) Nutch to use Solr as indexing engine. If you are using Solr directly to provide a search interface then that's all you need to do to get a full working setup. The Nutch commands will be used as normally to manage the fetching part of the process (a scipt is provided that will ease up that part). The integration between Nutch and Solr is not yet available as "out of package" but it will not require so much glue code.

A patch against Nutch trunk is provided for those who wish to be brave. In addition to that you will need the solr-client.jar and xpp3-1.1.3.4.O.jar in nutch/lib directory (they are both part of the solr-client.zip package from SOLR-20.


Setting up Solr

A nightly build of Apache Solr
can be downloaded from Apache site. It is really easy to setup and basically the only thing
requiring special attention is the custom schema to be used (see Solr wiki for more Details
about available schema configuration options). Unpack the archive and go to the example
directory of extracted package.

I edited the example schema (solr/conf/schema.xml) and added the fields required by Nutch in it's stock configuration:



<fields>
<field name="url" type="string" indexed="true" stored="true"/>
<field name="content" type="text" indexed="true" stored="true"/>
<field name="segment" type="string" indexed="false" stored="true"/>
<field name="digest" type="string" indexed="false" stored="true"/>
<field name="host" type="string" indexed="true" stored="false"/>
<field name="site" type="string" indexed="true" stored="false"/>
<field name="anchor" type="string" indexed="true" stored="false" multiValued="true"/>
<field name="title" type="text" indexed="true" stored="true"/>
<field name="tstamp" type="slong" indexed="false" stored="true"/>
<field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
</fields>

<uniqueKey>url</uniqueKey>

<defaultSearchField>text</defaultSearchField>

<solrQueryParser defaultOperator="AND"/>

<copyField source="anchor" dest="text"/>
<copyField source="title" dest="text"/>
<copyField source="content" dest="text"/>




After setting up the schema just start the Solr server with command: java -jar start.jar

note: If you use indexing filters in Nutch that will use more fields you need to add them to the Solr schema before you start indexing.

Implementing clue



The integration to Solr server is done with the client posted on SOLR-20. We will also implement a new indexer called SolrIndexer which will extend the existing Indexer in Nutch. Basically we would only need to modify the OutputFormat of class Indexer but also some additional (duplicate) code needs to be used in order to launch the job with our custom code.


public static class OutputFormat extends org.apache.hadoop.mapred.OutputFormatBase
implements Configurable {

private Configuration conf;
SolrClientAdapter adapter;

public RecordWriter getRecordWriter(final FileSystem fs, JobConf job,
String name, Progressable progress) throws IOException {

return new RecordWriter() {
boolean closed;

public void write(WritableComparable key, Writable value)
throws IOException { // unwrap & index doc
Document doc = (Document) ((ObjectWritable) value).get();
LOG.info("Indexing [" + doc.getField("url").stringValue() + "]");
adapter.index(doc);
}

public void close(final Reporter reporter) throws IOException {
// spawn a thread to give progress heartbeats
Thread prog = new Thread() {
public void run() {
while (!closed) {
try {
reporter.setStatus("closing");
Thread.sleep(1000);
} catch (InterruptedException e) {
continue;
} catch (Throwable e) {
return;
}
}
}
};

try {
prog.start();
LOG.info("Executing commit");
adapter.commit();
} finally {
closed = true;
}
}
};
}

public Configuration getConf() {
return conf;
}

public void setConf(Configuration conf) {
this.conf = conf;
adapter = new SolrClientAdapter(conf);
}

}


In future it might be a good idea to improve the indexing API in Nutch to be more
generic so we could support a variety of different index back ends with same Indexer code.

The second class we will create is An adapter class towards the Solr java client, this is also
strictly not required but to get better immunity against changes in the client it is a smart
thing to do. The adapter class basically just extracts the required information from the Lucene
Document generated by the Indexer and uses the Solr java client to submit it to Solr server.



/** Adds single Lucene document to index. */
public void index(Document doc) {

SimpleSolrDoc solrDoc = new SimpleSolrDoc();
for (Enumeration e = doc.fields(); e.hasMoreElements();) {
Field field = e.nextElement();
if (!ignoreFields.contains((field.name()))) {
solrDoc.fields.put(field.name(), field.stringValue());
}
}
try {
client.add(solrDoc);
} catch (Exception e) {
LOG.warn("Could not index document, reason:" + e.getMessage(), e);
}
}

/** Commits changes */
public void commit(){
try {
client.commit(true, false);
} catch (Exception e) {
LOG.warn("Could not commit, reason:" + e.getMessage(), e);
}
}



Setting up Nutch

Before starting the crawling process you need to first configure Nutch. If you are not familiar with the way nutch operates it is recommended to first follow the tutorial in Nutch web site.

Basically the steps required are (make sure you use correct filenames - replace '_' with '-'):

1. Set up conf/regex-urlfilter.txt
2. Set up conf/nutch-site.xml
3. Generate a list of seed urls into folder urls
4. Grab this simple script that will help you along in your crawling task.



After those initial steps you can start crawling by simply executing the crawl.sh script:

crawl.sh <basedir>, where basedir will be the folder where your crawling contents will be stored.

The script will execute one iteration of fetching and indexing. After the first iteration
you can start querying the newly generated index for the content you have crawled - for
example with url like http://127.0.0.1:8983/solr/select?q=apache&start=0&rows=10&fl=title%2Curl%2Cscore&
qt=standard&wt=standard&hl=on&hl.fl=content


If you started with the provided seed list your index should contain exactly one document, the Apache front page. You can now fetch more rounds and see how your index will grow.

Deficiencies of the demonstrated integration

There are number of things you need to consider and implement before the
integration is at usable level.

Document boost

The document boosting was left of to keep this post small. If you are seriously planning to use pattern like this then you must add document boosting (not hard at all to add it). Without it you will lose a precious piece of information from the link graph.

Support for multivalued fields

The anchor texts in Nutch are indexed into multivalued field. The sample code from this post does not do that.

Deleting pages from index

The deleted pages are not removed from index. One could implement it as part of reduce method by checking the status from CrawlDatum and post a deletion request if it has status STATUS_FETCH_GONE.

Posting multiple documents at same time

The naive implementation here posts documents to index one by one over the network.
A better way would be adding multiple documents at a time.

Further improvements - extending index size

If you are unwilling to wait for the next killer component in Lucene family you could probably extend the pattern presented here to support even larger indexes than can be handled with single Solr server instance quite easily.

A small addition in SolrClientAdapter would be sufficient: instead of posting all docs to single Solr instance one would post documents to different indexes, target server could be selected by hashing the document URL for example. This is not however recommended unless you understand the consequences ;)

UPDATE 2007/07/15
Ryan has kindly posted a updated SolrClientAdapter that works with client version currently in solr trunk, thanks Ryan!

Labels: , ,



Sunday, January 14, 2007

 

Sorted out

The Fetcher performance in post 0.7.x version of Nutch has been a target for a critique for long time and not without cause. Even when there are many improvements made (and also many waiting to be done) during the last year things just aren't as fast as one hopes.

One particular thing has been bothering me for a long but I never really had time to look it through, until now.

Nutch Fetcher operates by reading a sequence of urls from a list generated by Generator. These urls are then handled to FetcherThreads. FetcherThread fetches content parses it (or not depending on the configuration) and stores it into segment for later processing. Some info about Nutch segments contents
can be seen from my previous post.

Fetcher also has built in mechanism to behave like a good citizen and not fetch more pages per unit of time than configured. If fetchlist contains a lot of urls in a row to same host lots of threads get blocked because of this mechanism that enforces politenes. This queuing mechanism is a good thing but as a side effect of it a lot of threads just sit and wait in a queue because some other thread just fetched a page from same host they were going to fetch.

There are number of factors one can do by configuration that minimium amount of threads are blocked during fetching some of them are listed below:


generate.max.per.host
generate.max.per.host.by.ip
fetcher.threads.per.host.by.ip
fetcher.server.delay
fetcher.threads.fetch
fetcher.threads.per.host


But even after you have set up reasonable configuration (like generate.max.per.host * threads < num_of_urls_to_generate) you still end up in a situation where tons of Threads are blocked on on same host you just start to wonder what is the problem this time.

This time the blame was in Generator, or more specifically in HashComparator. I took me a long time to figure out what was the real problem, I even tried out other hash functions because I thought the one was flaved. At the end the problem is quite obvious:

public int compare(...) {
...
if (hash1 != hash2) {
return hash1 - hash2;
}
...
}

Isn't it? Well it wasn't for me. But afterwards it's easy to say that overflow in integer math was the blame. I replaced compare methods slighly to get rid of integer overflow:

public int compare(...) {
...
return (hash1 < hash2 ? -1 : (hash1 == hash2 ? 0 : 1));
}

To verify the effect of fix I generated two segments, both sized 10 000 urls (exactly same urls) - one with original code and one with modified code, runtimes for these single server fetches are listed below:

Original:
real 32m16.246s
user 2m33.726s
sys 0m9.989s

Modded:
real 19m40.026s
user 2m35.371s
sys 0m10.892s

The absoulte times are more or less meaningless and they are provided just for a reference, below is a chart of bandwidth used during fetching. A thing to note there is more even bandwidth usage with properly sorted fetchlist.



In the end I have to say that I am very pleased I got this one sorted out.

Labels: ,



Sunday, December 31, 2006

 

Nutch and hadoop native compression revisited

After running experimental crawls using hadoop native compression I decided to report back some more data for you to consume.

I have been crawling several cycles before and after enabling compression. What I am comparing here is segments sizes generated and fetched with same settings (-topN 1000000), so strictly speaking, this not a test which will tell you how compression affects individual segments, bu just a log of segment sizes before and after enabling compression. Segments 0-6 are from time before enabling compression and segments 7 - are processed with compression enabled.

Total space consumption


As seen from graph the total savings from compression in segment data is roughly 50%.

Nutch segment data consists of several different independent parts. Below you can find graph for each individual piece and see the effect of enabling compression.

Content
Subfolder: content
Object: Content
Purpose: Store fetched raw content, headers and some additional metadata.


As you can see there was no significant (if any) gain from compression to the biggest space comsumer, the content. This is because it is already compressed. Actually this means that during processing it will be compressed twice, once by the object it self and the second time by hadoop. The object level compression should really be removed from Content. Instead one should rely on hadoop for doing the compression.

Crawl_fetch
Subfolder: crawl_fetch
Purpose: CrawlDatum object used when fetching.
Object: CrawlDatum



Crawl_generate
Subfolder: crawl_generte
Object: CrawlDatum
Purpose: CrawlDatum object as selected by Generator.


Crawl_parse
Subfolder: crawl_parse
Object: CrawlDatum
Purpose: CrawlDatum data extracted while parsing content (hash, outlinks)


Parse_data
Subfolder: parse_data
Object: ParseData
Purpose: Data extracted from a page's content.


Parse_text
Subfolder: parse_text
Object: ParseText
Purpose: he text conversion of page's content, stored using gzip compression.


The gains are not big here because of double compression. Again the compression should be removed from the object and let hadoop do it's job.

By removing the double compression from the two mentioned objects the performance of fetcher should once again increase. I will dig into this sometime in future (unless someone else gets to it first:)

Oh, and happy new year to all!

Labels: ,



Saturday, December 16, 2006

 

Record your data

Hadoop supports serializing/deserializing stream of objects that implement
the interface Writable. Writable interface has two methods, one for serializing
and one for deserializing data.

void write(DataOutput out) throws IOException;

void readFields(DataInput in) throws IOException;

Hadoop contains ready made Writable implementations for primitive data types like
IntWritable for persisting integer type of data, BytesWritable for persisting
arrays of bytes, Text for persisting string type of data and so on. By combining
these primitive types it is possible to write more complex persistable objects to
satisfy your needs.

There are several examples of more complex writable implementations in nutch
for you to look at like CrawlDatum that holds crawl state of a resource,
Content for storing the content of a fetched resource. Many of these objects
are there for just persisting data and don't do much more than record object
state to DataOutput or restore object state from DataInput.

Implementing and maintaining these sometimes complex objects by hand is both
error prone and time consuming task. There might even be cases when you need to
implement io with c++ for example in parallel to your java software - your problem just
got 100% bigger. This is where the Hadoop record package comes handy.

Hadoop record api contains tools to generate code for your custom Writables based
on your DDL so you can focus on more interesting tasks and let the machine do things where
it is better. The used DDL syntax is easy and readable and yet it can be used to
generate complex nestable data types with arrays and maps if so needed.

Example

Imagine you were writing code for a simple media search and needed to
implement the Writables to persist your data. The DDL in that case could look
something like:


module org.apache.nutch.media.io {

class InLink {
ustring FromUrl;
ustring AnchorText;
}

class Media {
ustring Url;
buffer Media;
map <ustring, ustring> Metadata;
}

class MediaRef {
ustring Url;
ustring Alt;
ustring AboutUrl;
ustring Context;
vector <inlink> InLinks;
}

}


To generate the java or c++ (the currently supported languages) code you would
then generate the code to match this DDL by executing:

bin/rcc <ddl-file>

Running the record compiler then generates the needed .java files (InLink.java, Media.java and MediaRef.java) to defined package which in case of our example would be org.apache.nutch.media.io. It is very fast to prototype different objects and change things when you see that something is missing or wrong. Just change the DDL and regenerate.

Evolving data

What it comes to evolving file formats in case where you have already a lot of data stored
and want to continue accessing it and still for example add new fields to freshly generated
data there currently is not so much of support available.

There are plans to add support for defining records in forward/backwards compatible manner in future versions of the record api. It might be something as simple as adding support for optional fields or it might be something as fancy as storing the DDL as metadata inside file containing the actual data and automatic run time data construction to some generic container.

One possibility to support data evolution is to write converters, those too should be guite easy to manage with DDL->code approach: just change the modulename of "current" DDL to something else. Classes generated from this DDL are used to read old data. Then create DDL for next version of data and generate new classes - these are used to write data into new format. Last thing left if to write the converter which can be a very simple mapper that reads old format and writes the new format.

Labels: , ,



Sunday, December 10, 2006

 

Compressed and fast

My fellow Nutch developer Andrzej Bialecki of SIGRAM recently upgraded Nutch to contain the latest released version of Hadoop. This is great news to people who have been suffering with the amount of disc resources nutch requires or suffering from slow io running under Linux. Why? I hear you asking.

Because now you can use native compression libs to compress your nutch data with speed of light! Unfortunately for some this only works under linux out-of-the-box. What you need to do is the following:

  • Get Precompiled libraries from hadoop distribution. Yes, they are not yet part of nutch. Libraries can be found from folder lib/native.

  • Make them available. I did it with environment variable LD_LIBRARY_PATH. One can also, and most propably should, use the system property -Djava.library.path= to do the same thing

  • Instruct hadoop to use block level compression with configuration like:


  • <property>
    <name>io.seqfile.compression.type</name>
    <value>BLOCK</value>
    </property>

    That's all there is to start using native compression within Nutch. All works backwards compatible way: Each Sequence file stores metadata about compression so when data is read hadoop knows automatically how is it compressed. So first time you run a command that handles data it is first read as it is (usually not compressed) but at the time new data files are starting to be generated compression kicks in.

    To verify that you got all the steps right you can consult the log files and search for a string like

    DEBUG util.NativeCodeLoader - Trying to load the custom-built native-hadoop library...
    INFO util.NativeCodeLoader - Loaded the native-hadoop library

    I did a small test with a crawldb containing little over 20M urls. The original crawldb had a size of 2737 Megabytes and after activating compression and running it through merge the size dropped to 359 Megabytes, quite a nice drop!

    There's always also the other side of a coin, wonder how much compression will affect speed of code involved with io. This aspect was simply tested with generating a fetchlist of size 1 M. With compressed crawldb time needed was 47 m 04 s and for uncompressed crawldb (where also uncompressed fetchlist was generated) it was 53 m 28 s. Yes you read that right - Using compression leads to faster operations. So not only it consumes less disk space it also runs faster!

    Hats off to hadoop team for this nice improvement!

    Labels: , ,



    Navigation