Sunday, February 4, 2007


Online indexing - integrating Nutch with Solr

Update 2009-03-09:: There is now more up to date example of solr integration available at Lucid Imagination Blog.

There might be times when you would like to integrate Apache Nutch crawling with a single Apache Solr index server - for example when your collection size is limited to amount of documents that can be served by single Solr instance, or you like to do your updates on "live" index. By using Solr as your indexing server might even ease up your maintenance burden quite a bit - you would get rid of manual index life cycle management in Nutch and let Solr handle your index.


During this short post we will set up (and customize) Nutch to use Solr as indexing engine. If you are using Solr directly to provide a search interface then that's all you need to do to get a full working setup. The Nutch commands will be used as normally to manage the fetching part of the process (a scipt is provided that will ease up that part). The integration between Nutch and Solr is not yet available as "out of package" but it will not require so much glue code.

A patch against Nutch trunk is provided for those who wish to be brave. In addition to that you will need the solr-client.jar and xpp3- in nutch/lib directory (they are both part of the package from SOLR-20.

Setting up Solr

A nightly build of Apache Solr
can be downloaded from Apache site. It is really easy to setup and basically the only thing
requiring special attention is the custom schema to be used (see Solr wiki for more Details
about available schema configuration options). Unpack the archive and go to the example
directory of extracted package.

I edited the example schema (solr/conf/schema.xml) and added the fields required by Nutch in it's stock configuration:

<field name="url" type="string" indexed="true" stored="true"/>
<field name="content" type="text" indexed="true" stored="true"/>
<field name="segment" type="string" indexed="false" stored="true"/>
<field name="digest" type="string" indexed="false" stored="true"/>
<field name="host" type="string" indexed="true" stored="false"/>
<field name="site" type="string" indexed="true" stored="false"/>
<field name="anchor" type="string" indexed="true" stored="false" multiValued="true"/>
<field name="title" type="text" indexed="true" stored="true"/>
<field name="tstamp" type="slong" indexed="false" stored="true"/>
<field name="text" type="text" indexed="true" stored="false" multiValued="true"/>



<solrQueryParser defaultOperator="AND"/>

<copyField source="anchor" dest="text"/>
<copyField source="title" dest="text"/>
<copyField source="content" dest="text"/>

After setting up the schema just start the Solr server with command: java -jar start.jar

note: If you use indexing filters in Nutch that will use more fields you need to add them to the Solr schema before you start indexing.

Implementing clue

The integration to Solr server is done with the client posted on SOLR-20. We will also implement a new indexer called SolrIndexer which will extend the existing Indexer in Nutch. Basically we would only need to modify the OutputFormat of class Indexer but also some additional (duplicate) code needs to be used in order to launch the job with our custom code.

public static class OutputFormat extends org.apache.hadoop.mapred.OutputFormatBase
implements Configurable {

private Configuration conf;
SolrClientAdapter adapter;

public RecordWriter getRecordWriter(final FileSystem fs, JobConf job,
String name, Progressable progress) throws IOException {

return new RecordWriter() {
boolean closed;

public void write(WritableComparable key, Writable value)
throws IOException { // unwrap & index doc
Document doc = (Document) ((ObjectWritable) value).get();"Indexing [" + doc.getField("url").stringValue() + "]");

public void close(final Reporter reporter) throws IOException {
// spawn a thread to give progress heartbeats
Thread prog = new Thread() {
public void run() {
while (!closed) {
try {
} catch (InterruptedException e) {
} catch (Throwable e) {

try {
prog.start();"Executing commit");
} finally {
closed = true;

public Configuration getConf() {
return conf;

public void setConf(Configuration conf) {
this.conf = conf;
adapter = new SolrClientAdapter(conf);


In future it might be a good idea to improve the indexing API in Nutch to be more
generic so we could support a variety of different index back ends with same Indexer code.

The second class we will create is An adapter class towards the Solr java client, this is also
strictly not required but to get better immunity against changes in the client it is a smart
thing to do. The adapter class basically just extracts the required information from the Lucene
Document generated by the Indexer and uses the Solr java client to submit it to Solr server.

/** Adds single Lucene document to index. */
public void index(Document doc) {

SimpleSolrDoc solrDoc = new SimpleSolrDoc();
for (Enumeration e = doc.fields(); e.hasMoreElements();) {
Field field = e.nextElement();
if (!ignoreFields.contains(( {
solrDoc.fields.put(, field.stringValue());
try {
} catch (Exception e) {
LOG.warn("Could not index document, reason:" + e.getMessage(), e);

/** Commits changes */
public void commit(){
try {
client.commit(true, false);
} catch (Exception e) {
LOG.warn("Could not commit, reason:" + e.getMessage(), e);

Setting up Nutch

Before starting the crawling process you need to first configure Nutch. If you are not familiar with the way nutch operates it is recommended to first follow the tutorial in Nutch web site.

Basically the steps required are (make sure you use correct filenames - replace '_' with '-'):

1. Set up conf/regex-urlfilter.txt
2. Set up conf/nutch-site.xml
3. Generate a list of seed urls into folder urls
4. Grab this simple script that will help you along in your crawling task.

After those initial steps you can start crawling by simply executing the script: <basedir>, where basedir will be the folder where your crawling contents will be stored.

The script will execute one iteration of fetching and indexing. After the first iteration
you can start querying the newly generated index for the content you have crawled - for
example with url like

If you started with the provided seed list your index should contain exactly one document, the Apache front page. You can now fetch more rounds and see how your index will grow.

Deficiencies of the demonstrated integration

There are number of things you need to consider and implement before the
integration is at usable level.

Document boost

The document boosting was left of to keep this post small. If you are seriously planning to use pattern like this then you must add document boosting (not hard at all to add it). Without it you will lose a precious piece of information from the link graph.

Support for multivalued fields

The anchor texts in Nutch are indexed into multivalued field. The sample code from this post does not do that.

Deleting pages from index

The deleted pages are not removed from index. One could implement it as part of reduce method by checking the status from CrawlDatum and post a deletion request if it has status STATUS_FETCH_GONE.

Posting multiple documents at same time

The naive implementation here posts documents to index one by one over the network.
A better way would be adding multiple documents at a time.

Further improvements - extending index size

If you are unwilling to wait for the next killer component in Lucene family you could probably extend the pattern presented here to support even larger indexes than can be handled with single Solr server instance quite easily.

A small addition in SolrClientAdapter would be sufficient: instead of posting all docs to single Solr instance one would post documents to different indexes, target server could be selected by hashing the document URL for example. This is not however recommended unless you understand the consequences ;)

UPDATE 2007/07/15
Ryan has kindly posted a updated SolrClientAdapter that works with client version currently in solr trunk, thanks Ryan!

Labels: , ,