Sunday, February 4, 2007

 

Online indexing - integrating Nutch with Solr

Update 2009-03-09:: There is now more up to date example of solr integration available at Lucid Imagination Blog.

There might be times when you would like to integrate Apache Nutch crawling with a single Apache Solr index server - for example when your collection size is limited to amount of documents that can be served by single Solr instance, or you like to do your updates on "live" index. By using Solr as your indexing server might even ease up your maintenance burden quite a bit - you would get rid of manual index life cycle management in Nutch and let Solr handle your index.

Overview

During this short post we will set up (and customize) Nutch to use Solr as indexing engine. If you are using Solr directly to provide a search interface then that's all you need to do to get a full working setup. The Nutch commands will be used as normally to manage the fetching part of the process (a scipt is provided that will ease up that part). The integration between Nutch and Solr is not yet available as "out of package" but it will not require so much glue code.

A patch against Nutch trunk is provided for those who wish to be brave. In addition to that you will need the solr-client.jar and xpp3-1.1.3.4.O.jar in nutch/lib directory (they are both part of the solr-client.zip package from SOLR-20.


Setting up Solr

A nightly build of Apache Solr
can be downloaded from Apache site. It is really easy to setup and basically the only thing
requiring special attention is the custom schema to be used (see Solr wiki for more Details
about available schema configuration options). Unpack the archive and go to the example
directory of extracted package.

I edited the example schema (solr/conf/schema.xml) and added the fields required by Nutch in it's stock configuration:



<fields>
<field name="url" type="string" indexed="true" stored="true"/>
<field name="content" type="text" indexed="true" stored="true"/>
<field name="segment" type="string" indexed="false" stored="true"/>
<field name="digest" type="string" indexed="false" stored="true"/>
<field name="host" type="string" indexed="true" stored="false"/>
<field name="site" type="string" indexed="true" stored="false"/>
<field name="anchor" type="string" indexed="true" stored="false" multiValued="true"/>
<field name="title" type="text" indexed="true" stored="true"/>
<field name="tstamp" type="slong" indexed="false" stored="true"/>
<field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
</fields>

<uniqueKey>url</uniqueKey>

<defaultSearchField>text</defaultSearchField>

<solrQueryParser defaultOperator="AND"/>

<copyField source="anchor" dest="text"/>
<copyField source="title" dest="text"/>
<copyField source="content" dest="text"/>




After setting up the schema just start the Solr server with command: java -jar start.jar

note: If you use indexing filters in Nutch that will use more fields you need to add them to the Solr schema before you start indexing.

Implementing clue



The integration to Solr server is done with the client posted on SOLR-20. We will also implement a new indexer called SolrIndexer which will extend the existing Indexer in Nutch. Basically we would only need to modify the OutputFormat of class Indexer but also some additional (duplicate) code needs to be used in order to launch the job with our custom code.


public static class OutputFormat extends org.apache.hadoop.mapred.OutputFormatBase
implements Configurable {

private Configuration conf;
SolrClientAdapter adapter;

public RecordWriter getRecordWriter(final FileSystem fs, JobConf job,
String name, Progressable progress) throws IOException {

return new RecordWriter() {
boolean closed;

public void write(WritableComparable key, Writable value)
throws IOException { // unwrap & index doc
Document doc = (Document) ((ObjectWritable) value).get();
LOG.info("Indexing [" + doc.getField("url").stringValue() + "]");
adapter.index(doc);
}

public void close(final Reporter reporter) throws IOException {
// spawn a thread to give progress heartbeats
Thread prog = new Thread() {
public void run() {
while (!closed) {
try {
reporter.setStatus("closing");
Thread.sleep(1000);
} catch (InterruptedException e) {
continue;
} catch (Throwable e) {
return;
}
}
}
};

try {
prog.start();
LOG.info("Executing commit");
adapter.commit();
} finally {
closed = true;
}
}
};
}

public Configuration getConf() {
return conf;
}

public void setConf(Configuration conf) {
this.conf = conf;
adapter = new SolrClientAdapter(conf);
}

}


In future it might be a good idea to improve the indexing API in Nutch to be more
generic so we could support a variety of different index back ends with same Indexer code.

The second class we will create is An adapter class towards the Solr java client, this is also
strictly not required but to get better immunity against changes in the client it is a smart
thing to do. The adapter class basically just extracts the required information from the Lucene
Document generated by the Indexer and uses the Solr java client to submit it to Solr server.



/** Adds single Lucene document to index. */
public void index(Document doc) {

SimpleSolrDoc solrDoc = new SimpleSolrDoc();
for (Enumeration e = doc.fields(); e.hasMoreElements();) {
Field field = e.nextElement();
if (!ignoreFields.contains((field.name()))) {
solrDoc.fields.put(field.name(), field.stringValue());
}
}
try {
client.add(solrDoc);
} catch (Exception e) {
LOG.warn("Could not index document, reason:" + e.getMessage(), e);
}
}

/** Commits changes */
public void commit(){
try {
client.commit(true, false);
} catch (Exception e) {
LOG.warn("Could not commit, reason:" + e.getMessage(), e);
}
}



Setting up Nutch

Before starting the crawling process you need to first configure Nutch. If you are not familiar with the way nutch operates it is recommended to first follow the tutorial in Nutch web site.

Basically the steps required are (make sure you use correct filenames - replace '_' with '-'):

1. Set up conf/regex-urlfilter.txt
2. Set up conf/nutch-site.xml
3. Generate a list of seed urls into folder urls
4. Grab this simple script that will help you along in your crawling task.



After those initial steps you can start crawling by simply executing the crawl.sh script:

crawl.sh <basedir>, where basedir will be the folder where your crawling contents will be stored.

The script will execute one iteration of fetching and indexing. After the first iteration
you can start querying the newly generated index for the content you have crawled - for
example with url like http://127.0.0.1:8983/solr/select?q=apache&start=0&rows=10&fl=title%2Curl%2Cscore&
qt=standard&wt=standard&hl=on&hl.fl=content


If you started with the provided seed list your index should contain exactly one document, the Apache front page. You can now fetch more rounds and see how your index will grow.

Deficiencies of the demonstrated integration

There are number of things you need to consider and implement before the
integration is at usable level.

Document boost

The document boosting was left of to keep this post small. If you are seriously planning to use pattern like this then you must add document boosting (not hard at all to add it). Without it you will lose a precious piece of information from the link graph.

Support for multivalued fields

The anchor texts in Nutch are indexed into multivalued field. The sample code from this post does not do that.

Deleting pages from index

The deleted pages are not removed from index. One could implement it as part of reduce method by checking the status from CrawlDatum and post a deletion request if it has status STATUS_FETCH_GONE.

Posting multiple documents at same time

The naive implementation here posts documents to index one by one over the network.
A better way would be adding multiple documents at a time.

Further improvements - extending index size

If you are unwilling to wait for the next killer component in Lucene family you could probably extend the pattern presented here to support even larger indexes than can be handled with single Solr server instance quite easily.

A small addition in SolrClientAdapter would be sufficient: instead of posting all docs to single Solr instance one would post documents to different indexes, target server could be selected by hashing the document URL for example. This is not however recommended unless you understand the consequences ;)

UPDATE 2007/07/15
Ryan has kindly posted a updated SolrClientAdapter that works with client version currently in solr trunk, thanks Ryan!

Labels: , ,



Comments



Hi:

First of all thank you for an interesting blog post. After trying it out.. I am getting the following problem..

SolrIndexer: starting
SolrIndexer: linkdb: solr/linkdb
SolrIndexer: adding segment: solr/segments/20070227115629
SolrIndexer: java.lang.RuntimeException: Cannot instantiate Solr client with url: 'http://localhost:8983/solr/update'. Reason: Unable to get XPP factory instance
at org.apache.nutch.indexer.SolrClientAdapter.init(SolrClientAdapter.java:58)
at org.apache.nutch.indexer.SolrIndexer$OutputFormat.setConf(SolrIndexer.java:172)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:46)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:70)
at org.apache.hadoop.mapred.JobConf.getOutputFormat(JobConf.java:272)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:319)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:371)
at org.apache.nutch.indexer.SolrIndexer.index(SolrIndexer.java:85)
at org.apache.nutch.indexer.SolrIndexer.run(SolrIndexer.java:110)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.indexer.SolrIndexer.main(SolrIndexer.java:92)

Any help or what I am doing wrong.
# posted by Blogger Nutch : February 27, 2007 1:10 PM  



Never mind I was missing the XPP3 lib which was not part of solr-client after building.

Sorry.
# posted by Blogger Nutch : February 27, 2007 1:35 PM  



This post has been removed by the author.
# posted by Blogger laurentapo : May 23, 2007 6:42 PM  



I try to plug nutch to solr and i'm getting the following error:

Executing commit
Job failed!
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:350)
at fr.mutinfo.lucene.TraitementFichier.crawlerWeb(TraitementFichier.java:223)
at fr.mutinfo.lucene.IndexDocJoint.main(IndexDocJoint.java:18)

thanks for your help
# posted by Blogger laurentapo : May 23, 2007 7:17 PM  



laurentapo, I got the same error when using solr and nutch from svn head/trunk. I retried with solr head and nutch 0.8 and got it to work.

I'm using the latest solr client the code update posted here from ryan.

Still playing with this, but it looks very promising.
# posted by Blogger Brian B : July 28, 2007 1:13 AM  



Where can I find SolrIndexer.java?
I tried to recreate it from the source code snipplets found in the article, but I'm not sure what type "client" variable should be and how to intialize it.
Also I could not find SimpleSolrDoc. It's in the SOLR-20 patch but I thought they are now part of Solr trunk. solrj.jar doesn't have this class. Perhaps it has been dropped?
# posted by Blogger Oray : August 25, 2007 12:48 AM  



Hi
Try to check at second statement of post where it says: "A patch against Nutch ...". Check the patch ... it has 2 java classes.


Regards
# posted by Blogger The Innovator : September 10, 2007 7:16 PM  



A few comments :
I'm happy with this post. It's very interesting for people who want to use these facilities.

About the post, the instructios may be not clear, so I take the liberty to make details:
- First, in step "2. Set up conf/nutch-site.xml", the file you can download save as the name nutch_site.xml" .. so you must rename it with the "-" instead of "_"
- Second, in SolrIndexer, you must import the class import org.apache.nutch.indexer.Indexer;
- Third, in script crawl.sh you must setup the JAVA_HOME variable

In my case, all is that I need to run the post fine.
I'm glad to help and collaborate to improve this.

Cheers !
# posted by Blogger The Innovator : September 10, 2007 9:15 PM  



Thanks SO much for posting this. I had to make some changes to the code examples as nutch, solr, and Solrj have changed since this posting, but what a great primer. It showed me where to start, and filled in some problems I would have never known how to fix.
# posted by Blogger Scott : October 19, 2007 2:30 AM  



This is a really great post, I appreciate it.

I've gone though all the steps and when I'm attempting to run Nutch and interface with SOLR I get this error.

laptop:~/Programs/nutch-0.9$ ./crawl.sh crawl
JAVA_HOME
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20071115174153
Generator: filtering: true
Generator: topN: 1000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
processing segment crawl/segments/20071115174153
Fetcher: starting
Fetcher: segment: crawl/segments/20071115174153
Fetcher: threads: 20
fetching http://www.apache.org/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20071115174153]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20071115174153
LinkDb: done
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/nutch/indexer/SolrIndexer
Command exited with abnormal status, bailing out.

I'm sure I'm just missing something in one of the config files but I haven't been able to find it. Can anyone help me out?
# posted by Blogger Brandon P : November 15, 2007 7:44 PM  



Hi:

Thank you all for this very interesting bolg post!! I'm trying to get it work, But I found some problems...

The SolrClientAdapter available on the site doesn't contain the SolrServer.java, the CommonsHttpSolrServer.java and the SolrInputDocument.java classes. Where can I find them?

is this post work with nutch 9.0?

Thanks.
# posted by Blogger ze_dach : February 28, 2008 11:09 AM  



Has anyone successfully achieved this feat. I have been trying for the past 6 hrs and have given up.

If someone can kindly send me a step wise document to prospr99@gmail.com I will really really appreciate it.

I am running nutch,lucene and patch on windows/cygwin and haven't been successful..

Please Please help
# posted by Blogger prospr : March 2, 2008 10:14 AM  



Has anyone successfully achieved this feat. I have been trying for the past 6 hrs and have given up.

If someone can kindly send me a step wise document to prospr99@gmail.com I will really really appreciate it.

I am running nutch,lucene and patch on windows/cygwin and haven't been successful..

Please Please help
# posted by Blogger prospr : March 2, 2008 10:15 AM  



Hi

Just wanted to let you folks know that I successfully integrated nutch and solr.

But when I try to crawl any other web site besides www.apache.org it dosent seem to work ..... I modfied URLS, crawl filter, etc but no luck

$ ./crawl.sh crawl.s
Injector: starting
Injector: crawlDb: crawl.s/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl.s/segments/20080302112010
Generator: filtering: true
Generator: topN: 1000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Command exited with abnormal status, bailing out.
# posted by Blogger prospr : March 2, 2008 9:24 PM  



I've followed this tutorial very well and Solr was installed successfully. However, when I tried to run './bin/crawl.sh' in the Nutch installation directory, I got this message:
Usage is "crawl.sh basedir"

I have no idea whether Nutch works or not, and I don't see any index data coming in. Where could be wrong? thanks!
# posted by Blogger Tony W : December 22, 2008 8:45 AM  



[00:38:04 root@node bin]# ./crawl.sh crawl.s
Injector: starting
Injector: crawlDb: crawl.s/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl.s/segments/20081223003839
Generator: filtering: true
Generator: topN: 1000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
processing segment drwxr-xr-x - root root 4096 2008-12-23 00:38 /opt/tomcat6/nutch/crawl.s/segments/20081223003839
Fetcher: starting
Fetcher: segment: drwxr-xr-x
Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/tomcat6/nutch/drwxr-xr-x/crawl_generate
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:61)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:782)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1127)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:531)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:566)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:538)

Command exited with abnormal status, bailing out.


What is causing this problem? thanks
# posted by Blogger Tony W : December 23, 2008 7:45 AM  



Hi Tony W,

I had the same problem. I changed the following in crawl.sh:

cut -f1
to
cut -f16 -d' '

Somehow the tabs are not recognized. This line changes tab to a space as a delimiter. The file shows up after the 16th space. In your case it could be different number.

Good luck!
# posted by Blogger JSteggink : February 17, 2009 1:33 PM  



The dates on the blog page are confusing. The post is from 2009, the comments from 2007.
# posted by Blogger Edward_jones_customer : March 12, 2009 2:02 AM  



No, the post is also from 2007.
# posted by Blogger Sami Siren : March 12, 2009 7:43 AM  



I am new to Nutch. I am able to install and run it but have been trying to get a re-crawl script to work that I can schedule every night to fetch new/updated docs from the web sites and update the crawldB. If I put -adddays 30-31 it keeps processing for more than 24 hrs. If I put 5-10-15, it doesn't pick up any new/modified files.

This is what I have so far. The script runs well but doesn't pickup the modified file

#!/bin/bash

# tomcat_dir=$1 crawl_dir=$2 depth=$3 adddays=$4 topn="-topN $5"

# Set JAVA_HOME to reflect your systems java configuration

export JAVA_HOME='/cygdrive/c/Program Files/Java/jre1.6.0_01'

# Set the paths

nutch_dir='/cygdrive/d/inet/apps/nutch-0.9/bin'

crawl_dir='/cygdrive/d/inet/apps/nutch-0.9/crawl'

tomcat_dir='/cygdrive/d/inet/apps/Tomcat/webapps/nutch-0.9'

depth=10

# Only change if your crawl subdirectories are named something different

webdb_dir=$crawl_dir/crawldb

segments_dir=$crawl_dir/segments

linkdb_dir=$crawl_dir/linkdb

index_dir=$crawl_dir/index

# To generate/fetch/update cycle

for ((i=1; i <= depth ; i++))

do

bin/nutch generate crawl/crawldb crawl/segments -topN 1000 -adddays 20

segment=`ls -d crawl/segments/* | tail -1`

bin/nutch fetch $segment

bin/nutch updatedb crawl/crawldb $segment

done

# Merge segments and cleanup unused segments

mergesegs_dir=crawl/mergesegs_dir

bin/nutch mergesegs crawl/mergesegs_dir -dir crawl/segments

for segment in `ls -d crawl/segments/* | tail -$depth`

do

echo "Removing Temporary Segment: $segment"

rm -rf $segment

done

cp -R crawl/mergesegs_dir/* crawl/segments

rm -rf crawl/mergesegs_dir

# Update segments

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

# Index segments

new_indexes=crawl/newindexes

segment=`ls -d crawl/segments/* | tail -1`

bin/nutch index crawl/newindexes crawl/crawldb crawl/linkdb $segment

# De-duplicate indexes

bin/nutch dedup crawl/newindexes

# Merge indexes

bin/nutch merge crawl/index crawl/newindexes

# Tell Tomcat to reload index

touch /cygdrive/d/inet/apps/Tomcat/webapps/nutch-0.9/WEB-INF/web.xml

# Clean up

rm -rf crawl/newindexes


Your help will be appreciated.


Thanks,
# posted by Blogger Sanjay Malaviya : May 30, 2009 12:34 AM  

Post a Comment



<< Home

Navigation