Thursday, May 14, 2009

 

Yahoo calculates digits of π (a lot) faster than before

Yahoo recently reported that they have calculated more bits of π roughly during weekend than an earlier project called PiHex did in span of 2 years. The record creating software was run on Hadoop cluster and the software itself was implemented in Java programming language. Check the details from original post.

Labels: , ,



Friday, April 10, 2009

 

First thoughts on GAE/Java

Google App Engine has been around for over a year now. It didn't have so big impact on me at first since the only language you could implement your application was python and I don't speak it natively. After the Google Java announcement few days ago I was fast enough to grab myself one of the first 10k test accounts. I have since learned a great deal about some new (for me) and nice technologies and I have to say that I am actually excited about web application development again.

The Eclipse plugin that was released at the same time is one of the biggest things that helped me to get a rapid start on Google App Engine. The plugin contains all required components for App Engine Development and also GWT development so after you install the plugin you can immediately start developing applications for Google App Engine. No need to install any additional software or SDKs. Writing the regular HelloWorld type of applications is easy as Google was kind enough to provide nice documentation on their web site. After you have the application running on local development environment you can launch it at Google cloud with a single mouse click from your Eclipse IDE.

One of the nicest things (IMO) in the application deployment process in general is the possibility to deploy several versions of your application. You can simultaneously access different versions (they're all accessible behind different hostnames). Accessing the versions from special urls before making them public makes testing and functional verification of the application an enjoyable experience. When you are satisfied with a new version you can make it the default one that is then accessible to normal site visitors. Also If you notice a glitch in application some point later you can very easily "roll back" to some earlier version if you like. (I Wish I had have this kind of deployment mechanism in place at some customer sites).

Google App Engine does not support storing data in filesystem. The way to persist data is to store it in google Big Table. From Java that is easily done with JDO or JPA. I tried the JDO way (for the first time in my life) and it was actually a very pleasant experience. The only thing you need to do (in simplest case) to persist your domain objects is to annotate them with JDO annotations - one annotation per class and one annotation per persistent property. After that you just call makePersistent method from PersistenceManager and you data is safe. Again one nice feature of Google App Engine is that you can browse and query the persisted data from the Dashboard application. Also you can manually insert data through the Dashboard application.

As part of my learning new things process I decided to take some recent java application framework, implement a small application with it and deploy it to Google App Engine. The Framework of my choice was Apache Click. There was only two minor issues I had to go though the get it up and running on Google App Engine. First one was to exclude the velocity templates from being served as static resources (in appengine-web.xml):


<static-files>
<exclude path="**.htm" />
</static-files>


Another issue was related to Ognl library (a library that is used by Apache Click to copy properties from Click Forms to domain objects (that can then be persisted with JDO). After setting the Security Manager of OgnlRuntime to null it started to work properly. In Click there is one fancy feature that would allow application extensions packaged as jars to provide static content to webapp that will unfortunately not work as it relies on using the filesystem.

Overall I have to say it was really much fun to work With Google App Engine and I think that it is a very nice platform to build/run your applications. Also the starting costs are definitely not too high. You can store up to 500 MB of data and serve up to 5 million pageloads per month for free! That is more than enough for majority of web sites.

Now if you only could offer your home grown Google App Engine applications in some market place similar to what they have in place for Android. The applications would be installable with single mouse click and running them in small scale would be very affordable or free. I bet this kind of commercial environment would be very interesting for application builders too.

Labels: ,



Wednesday, April 8, 2009

 

It's here: Java on Google App Engine

Just noticed the Google announcement about the availability of Java programming language on Google App Engine.

So far the only implementation language has been python then they joked about fortran and now they announce Java.

I am very exited about this announcement and can't wait to see more details.

Labels: ,



Thursday, March 19, 2009

 

Apache Tika 0.3 released

Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Chris Mattmann just announced that the release of version 0.3 is official.

Go grab yourself a copy from a mirror nearby. Tika is also available through the central maven repository.

There is also an article about Tika and Solr Cell at Lucid Imagination web site.

Labels: , , ,



Thursday, October 9, 2008

 

GWT + maven2 + eclipse on 64 bit linux in 30 seconds

GWT is a toolkit for building world class web2.0 gui applications without the headache. Setting up an Eclipse dev environment on 64 bit fedora core (version 9) however required some extra steps. I heard that Google is working on a smoother 64 bit integration but until it's here you might find the list of required actions useful:

The basic installation


1. Get GWT
GWT comes in three flavours - one for win, one for mac and one for (32 bit) linux. Manual installation of gwt is not strictly reguired for maven + eclipse use but it's also convenient to have it around to be able to run your apps in hosted mode without maven.

2. Get 32 bit jdk
The default JDK in my fc 9 installation identified itself as "OpenJDK Runtime Environment (build 1.6.0-b09)" and if you try to start gwt Shell with it all you get is following error message:


Exception in thread "main" java.lang.UnsatisfiedLinkError: ...gwt-linux-1.5.2/libswt-pi-gtk-3235.so: .../gwt/gwt-linux-1.5.2/libswt-pi-gtk-3235.so: wrong ELF class: ELFCLASS32 (Possible cause: architecture word width mismatch)


The cure to this problem is to fetch and install a 32 bit java from sun.

3. Install some required 32 bit libraries:

If your see error like:

Exception in thread "main" java.lang.UnsatisfiedLinkError: .../gwt/gwt-linux-1.5.2/libswt-pi-gtk-3235.so: libXtst.so.6: cannot open shared object file: No such file or directory
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1778)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1674)
at java.lang.Runtime.load0(Runtime.java:770)
at java.lang.System.load(System.java:1005)
at org.eclipse.swt.internal.Library.loadLibrary(Library.java:132)
at org.eclipse.swt.internal.gtk.OS.(OS.java:22)
at org.eclipse.swt.internal.Converter.wcsToMbcs(Converter.java:63)
at org.eclipse.swt.internal.Converter.wcsToMbcs(Converter.java:54)
at org.eclipse.swt.widgets.Display.(Display.java:126)
at com.google.gwt.dev.GWTShell.(GWTShell.java:301)
Could not find the main class: com.google.gwt.dev.GWTShell. Program will exit.


you need to install 32 bit version of libXtst.
sudo yum install libXtst.i386

If you see error like

Exception in thread "main" java.lang.UnsatisfiedLinkError: .../libswt-pi-gtk-3235.so: libgtk-x11-2.0.so.0: cannot open shared object file: No such file or directory

You need to install gtk2

sudo yum install gtk2.i386



If you see an error like

** Unable to load Mozilla for hosted mode **
java.lang.UnsatisfiedLinkError: .../gwt/gwt-linux-1.5.2/mozilla-1.7.12/libxpcom.so: libstdc++.so.5: cannot open shared object file: No such file or directory


you need to install the 32 bit compat-libstdc++ library:
sudo yum install compat-libstdc++-33.i386

You are now ready for command line development of GWT apps. Just remember to set the path so that the 32 bit java is used to launch the GWT Shell.

Alternative way for maven users


1. Get 32 bit jdk

2. Install the required 32 bit libraries

3. Check out sample maven app that uses gwt-maven

svn co http://gwt-maven.googlecode.com/svn/trunk/maven-googlewebtoolkit2-plugin/simplesample/ sample

Launch the sample application


cd sample
mvn gwt:gwt



Launching GTW Shell from eclipse


1. Import (or check out from scm) your project into eclipse
2. From Run menu select Run Configurations
3. Right click Java Application from left side of the screen and select new
4. Set com.google.gwt.dev.GWTShell as Main class
5. At Arguments tab enter the module name as program argument (in the sample it is com.totsp.sample.Application)
6. From tab JRE make sure that the 32 bit jre is used for the project
7. From tab Classpath click Advanced... select Add Folders and select the src/java folder from your project (ot the folder that contains the <package>/Module.gwt.xml file)
8. Click 'Apply' and 'Run'

Your GWT app is now running and you can enjoy features like nice fast dev cyckle edit->save->refresh, debugger etc.

Labels: , , ,



Sunday, August 31, 2008

 

Interactive query reporting with lucene

Todays post will give you a simple example of how Apache Lucene could be used as a powerful full-text-ad-hoc-query-enabled data warehouse to build simple reports from, little like Google trends.



In this example we use the software for query count reporting but it could also be used to report counts of any other kind of data for example blog posts, news articles and so on.

The data set used in this example is the AOL query data set which consists of little under 40 million records. A single record contains user identifier, user query, timestamp and click data. We are only interested about queries and timestamps here so we skip the rest of the data.

The used Lucene document structure is simple:


Document doc = new Document();
doc.add(new Field("query", fields[1].trim(), Field.Store.NO,
Field.Index.TOKENIZED, TermVector.NO));
doc.add(new Field("date", date[0].trim(), Field.Store.NO,
Field.Index.NO_NORMS, TermVector.NO));


On query side we rely on BitSets, unions of BitSets and the cardinality method to do the hit counting.

For transferring the data between the browser and back end we use json over HTTP. The message format is as follows:


{
"counts":[
{"date": "2006-03-01", "c0":43, "c1":0, "c2":230}
,{"date": "2006-03-02", "c0":11, "c1":0, "c2":245}
,{"date": "2006-03-03", "c0":15, "c1":0, "c2":252}
,{"date": "2006-03-04", "c0":50, "c1":2, "c2":288}
,{"date": "2006-03-05", "c0":14, "c1":0, "c2":294}
,{"date": "2006-03-06", "c0":51, "c1":0, "c2":345}
,{"date": "2006-03-07", "c0":19, "c1":0, "c2":225}
,{"date": "2006-03-08", "c0":16, "c1":0, "c2":219}
,{"date": "2006-03-09", "c0":44, "c1":0, "c2":197}
,{"date": "2006-03-10", "c0":32, "c1":0, "c2":269}
,{"date": "2006-03-11", "c0":38, "c1":0, "c2":311}
,{"date": "2006-03-12", "c0":10, "c1":0, "c2":230}
,{"date": "2006-03-13", "c0":25, "c1":0, "c2":162}
,{"date": "2006-03-14", "c0":59, "c1":0, "c2":261}
]
}

I found jsonlint to be a useful tool for json a novice like me to verify that the json messages I generated are indeed valid.

In end user interface we use YUI charts and YUI datasource components.

Steps to get the app running



Note: Because indexing and webapp are run inside maven in this example you need to make sure there's enough heap available with a command like "export MAVEN_OPTS=-Xmx1024m" )


  1. Download sources (size:6kb)


  2. Get the data set...


  3. Build index into directory index :


  4. mvn exec:java -Dexec.mainClass="fi.foofactory.lucene.report.Indexer" -Dexec.args="<location of AOL collection> index"

    The resulting index is about 706 mega bytes in size. The run time to build the index with the machine I used was around 20 minutes.

  5. Run webapp

    mvn jetty:run-war -Dlucene.index.dir=index


  6. Access the interface with your browser at http://localhost:8080/lucene-report/




Possible enhancements


Parallelism


The code on demo is totally single threaded so it utilizes only one cpu core per request. The dataset used in this demo is so small that the workload is totally cpu bound. You could parallelize at least some of work without exploding the memory requirements (getting BitSets for submitted queries, calculating the intersections).

Approximation


If you are interested about general trends and relatinve volumes then it does not matter if you miss an observation or some. You could use a technique called sampling to reduce the number of observations and still get good results on relatively frequent terms. On rare terms you can always fall back to statement like


Your terms - <insert search terms here> - do not have enough search volume to show graphs.

Partitioning


The data set could be divided into smaller lucene indexes and each of the indexes could be deployed on more machines. You would need to build an aggregator which would ask the results from those smaller indexes and simply add counts together before returning the final results. A nice way to build index partitions is a recently added Apache Hadoop contrib module.

Imagine


In an imaginary world if you would use multiple threads to calculate the intersections, use sampling to preserve one tenth of the original observations and would split the data into 10 machines (4 cpu cores each, with unlimited RAM) the response time of the report queries for that same data set (approximation) could be something like 1/400th of the original. Or the other way around: if you're satisfied with the response time already you could use the same technique on a data set of size 1,6*10^11 documents.

For a limited time there's a demo available online here. (the demo will be removed without further notice)

Labels: , , , ,



Thursday, July 3, 2008

 

Hadoop takes the lead position

Owen O'Malley, the project lead of Apache Hadoop, member of Yahoo grid team, announced today that they have taken the number uno position on Terabyte Sort Benchmark.

The new record of 209 seconds is nearly 30% faster than the previous one. More details here and here.

Congratulations! Nice work, once again.

Labels: , ,



Saturday, March 8, 2008

 

Nutch training at ApacheCon EU 2008

As some of you might have noticed I am prepared to give a half day training about Apache Nutch at ApacheCon EU 2008.

However there are still too many seats available and I need Your help to get things going. So if You are interested about Nutch internals and have Tuesday, Apr 08 open in your calendar please go ahead and book a seat for You at ApacheCon web site !

Don't forget that there is also a huge amount of other interesting sessions and trainings during the week, see the schedule for more info.

Labels: , , ,



Friday, January 18, 2008

 

Regenerating equally sized shards from set of Lucene indexes

If you have a need to split your large index or set of indexes into smaller equally sized "shards" this prototype of tool might be for you. There is a tool for combining several indexes into one inside Lucene distribution but to my knowledge there is no tool to do the opposite.

The usual use case for splitting your index is index distribution: for example you plan to distribute (pieces of your index) into several machines to increase query throughput. Of course this operation could be done by reindexing the data, but resizing the index shards _seems_ to be faster than that (need to do some benching to confirm that).

This tool should be able to handle several different scenarios for you:

1. splitting one large index into many smaller ones

2. combining and resplitting several indexes into new set of indexes

3. combining several indexes into one

This tool does not try to interprete the physical index format but lets Lucene do the heavy lifting by simply using IndexWriter.addIndexes().

DISCLAIMER: I only had time to do some very limited testing with smallish indexes with this tool, but I plan to do some more testing with bigger indexes soon to get an idea how this will work in real life.

Download sources.

Labels: ,



Tuesday, November 6, 2007

 

Javascript prototyping for Hadoop map/reduce jobs

Java 6 brought us jvm level scripting support as modelled in jsr-223. (Sun documentation about the subject)

Wordcount Demo



I have chosen the wordcount example (as seen on Hadoop) to demonstrate a complete javascript map/reduce example (a mapper, a reducer and a "driver"):


importPackage(java.util);
importPackage(org.apache.hadoop.io)


function map(key, value, collector){
tokenizer = new StringTokenizer(value);
while(tokenizer.hasMoreTokens()) {
collector.collect(tokenizer.nextToken(), 1);
}
}

function reduce(key, values, collector){
counter=0;
while(values.hasNext()) {
counter+=values.next().get();
}
collector.collect(key, counter);
}

function driver(args, jobConf, jobClient) {
jobConf.outputKeyClass=Text;
jobConf.outputValueClass=IntWritable;
jobConf.setInput(args[0]);
jobConf.setOutput(args[1]);
jobConf.map='map';
jobConf.combiner='reduce';
jobConf.reduce='reduce';
jobClient.runJob(jobConf);
}



System requirements




Running javascript map reduce



  • Install java as documented by Sun

  • Extract hadoop release tar ball into your favourite location

  • Run your javascript map/reduce job with following command: bin/hadoop jar </path/to/scriptmr.jar> <script-file-name> [<script-arg-1>...]



Performance?


Shortly: there isn't any. The wordcount demo as presented here takes roughly 6-7 times more time than the wordcount java exmple from Hadoop. It's quite obvious this technique is not really practical beyond prototyping. For prototyping... hmmm I am not totally sure it is usable for that either. Anyway have fun with it, I know I had some while making it ;)

Labels: , , ,



Tuesday, October 30, 2007

 

Spring portlet sample with maven2 on Apache Pluto

As part of my learning process to portals and JSR-168 portlets I decided to go and do some hands on testing based on the publicly available spring portlets and Apache pluto. While I was there I converted the sample to use maven build and decided to publish the results here in case someone else is thinking about the same.

So have fun with spring-portlet-sample-maven.tar.gz.

Compile the portlet application by running mvn clean install

Installation instructions are simple, once you have your pluto-tomcat-bundle up and running (available here) just copy the resulting war file under tomcats webapp directory and use the Pluto Page Administrator to add the portlets to your page.

Labels: , , ,



Thursday, June 14, 2007

 

Adding support for tight Enum and EnumSet into a hadoop application

After working with many projects that use a combination of int and String constants in various ways to mimic enumerations I learned to like java Enums. I am also a fan of Hadoop so I decided to see how they will fit together.

WritableUtils class has already support for writing and reading java Enum values. in WritableUtils Enum is serialized by converting value of enum.name() to Text. There is also another way of storing Enum state which requires less space. By using the oridinal() method will give you the index of the field in the Enum. This technique allows you to store Enum containing up to 255 different fields to a single byte. EnumWritable perhaps does not warrant to be in own class, an utility methods cabable of reading and writing enum to DataInput would be sufficient.


import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class EnumWritable<E extends Enum<E>> implements Writable {

private byte storage;

public EnumWritable() {
storage = 0;
}

public EnumWritable(Enum<E> value) {
set(value);
}

public <T extends Enum<E>> E get(Class<E> enumType) throws IOException {
for (E type : enumType.getEnumConstants()) {
if (storage == type.ordinal()) {
return type;
}
}
return null;
}

public void set(Enum<E> e) {
storage = (byte) e.ordinal();
}

public void readFields(DataInput in) throws IOException {
storage = in.readByte();
}

public void write(DataOutput out) throws IOException {
out.write(storage);
}

@Override
public String toString() {
return Integer.toString(storage);
}

@Override
public boolean equals(Object obj) {
if (!(obj instanceof EnumWritable)) {
return super.equals(obj);
}
EnumWritable that = (EnumWritable) obj;
return this.storage == that.storage;
}

@Override
public int hashCode() {
return storage;
}

}


Other useful Structure for storing equivalent of multiple boolean types or piece of data that consists of bits (that can be modeled as fields in enum) is EnumSet. Storage for EnumSet is also compact if take advantage of the oridinal() method. You can store EnumSet with storage space of 1 bit per field.

I wrote an prototype EnumSetWritable that takes from 1 to 4 bytes of storage dependeing on how many fields enum contains and what enums are in the set. 2 least signifigant bits store the size of storage (1-4 bytes) and rest of the bits store Enums in EnumSet. As I told you space for storing the EnumSet varies from 1 to 4 bytes so it can store EnumSets containing Enums with up to 30 fields. By ordering the fields in a way that fields that are most often in EnumSet will most probably lead to smaller space consumption.

Real World applications for EnumSet could be for example something like:



enum ContainsData {
Raw, Parsed, Processed;
}

EnumSetWritable<ContainsData> contains = new EnumSetWritable<ContainsData>();

public void readFields(DataInput in) throws IOException {
contains.readFields(in);
for (ContainsData type : contains.getEnumSet(ContainsData.class)) {
switch (type) {
case Raw: //read raw data;break;
case Parsed: //read parsed data;break;
case Processed: //read processed data;break;
}
}
}



Where you would use EnumSet to store the presence information of various data structures in a writable and so would avoid for example storing booleans (taking one byte of space each) for the same piece of information.

Labels: ,



Sunday, February 4, 2007

 

Online indexing - integrating Nutch with Solr

Update 2009-03-09:: There is now more up to date example of solr integration available at Lucid Imagination Blog.

There might be times when you would like to integrate Apache Nutch crawling with a single Apache Solr index server - for example when your collection size is limited to amount of documents that can be served by single Solr instance, or you like to do your updates on "live" index. By using Solr as your indexing server might even ease up your maintenance burden quite a bit - you would get rid of manual index life cycle management in Nutch and let Solr handle your index.

Overview

During this short post we will set up (and customize) Nutch to use Solr as indexing engine. If you are using Solr directly to provide a search interface then that's all you need to do to get a full working setup. The Nutch commands will be used as normally to manage the fetching part of the process (a scipt is provided that will ease up that part). The integration between Nutch and Solr is not yet available as "out of package" but it will not require so much glue code.

A patch against Nutch trunk is provided for those who wish to be brave. In addition to that you will need the solr-client.jar and xpp3-1.1.3.4.O.jar in nutch/lib directory (they are both part of the solr-client.zip package from SOLR-20.


Setting up Solr

A nightly build of Apache Solr
can be downloaded from Apache site. It is really easy to setup and basically the only thing
requiring special attention is the custom schema to be used (see Solr wiki for more Details
about available schema configuration options). Unpack the archive and go to the example
directory of extracted package.

I edited the example schema (solr/conf/schema.xml) and added the fields required by Nutch in it's stock configuration:



<fields>
<field name="url" type="string" indexed="true" stored="true"/>
<field name="content" type="text" indexed="true" stored="true"/>
<field name="segment" type="string" indexed="false" stored="true"/>
<field name="digest" type="string" indexed="false" stored="true"/>
<field name="host" type="string" indexed="true" stored="false"/>
<field name="site" type="string" indexed="true" stored="false"/>
<field name="anchor" type="string" indexed="true" stored="false" multiValued="true"/>
<field name="title" type="text" indexed="true" stored="true"/>
<field name="tstamp" type="slong" indexed="false" stored="true"/>
<field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
</fields>

<uniqueKey>url</uniqueKey>

<defaultSearchField>text</defaultSearchField>

<solrQueryParser defaultOperator="AND"/>

<copyField source="anchor" dest="text"/>
<copyField source="title" dest="text"/>
<copyField source="content" dest="text"/>




After setting up the schema just start the Solr server with command: java -jar start.jar

note: If you use indexing filters in Nutch that will use more fields you need to add them to the Solr schema before you start indexing.

Implementing clue



The integration to Solr server is done with the client posted on SOLR-20. We will also implement a new indexer called SolrIndexer which will extend the existing Indexer in Nutch. Basically we would only need to modify the OutputFormat of class Indexer but also some additional (duplicate) code needs to be used in order to launch the job with our custom code.


public static class OutputFormat extends org.apache.hadoop.mapred.OutputFormatBase
implements Configurable {

private Configuration conf;
SolrClientAdapter adapter;

public RecordWriter getRecordWriter(final FileSystem fs, JobConf job,
String name, Progressable progress) throws IOException {

return new RecordWriter() {
boolean closed;

public void write(WritableComparable key, Writable value)
throws IOException { // unwrap & index doc
Document doc = (Document) ((ObjectWritable) value).get();
LOG.info("Indexing [" + doc.getField("url").stringValue() + "]");
adapter.index(doc);
}

public void close(final Reporter reporter) throws IOException {
// spawn a thread to give progress heartbeats
Thread prog = new Thread() {
public void run() {
while (!closed) {
try {
reporter.setStatus("closing");
Thread.sleep(1000);
} catch (InterruptedException e) {
continue;
} catch (Throwable e) {
return;
}
}
}
};

try {
prog.start();
LOG.info("Executing commit");
adapter.commit();
} finally {
closed = true;
}
}
};
}

public Configuration getConf() {
return conf;
}

public void setConf(Configuration conf) {
this.conf = conf;
adapter = new SolrClientAdapter(conf);
}

}


In future it might be a good idea to improve the indexing API in Nutch to be more
generic so we could support a variety of different index back ends with same Indexer code.

The second class we will create is An adapter class towards the Solr java client, this is also
strictly not required but to get better immunity against changes in the client it is a smart
thing to do. The adapter class basically just extracts the required information from the Lucene
Document generated by the Indexer and uses the Solr java client to submit it to Solr server.



/** Adds single Lucene document to index. */
public void index(Document doc) {

SimpleSolrDoc solrDoc = new SimpleSolrDoc();
for (Enumeration e = doc.fields(); e.hasMoreElements();) {
Field field = e.nextElement();
if (!ignoreFields.contains((field.name()))) {
solrDoc.fields.put(field.name(), field.stringValue());
}
}
try {
client.add(solrDoc);
} catch (Exception e) {
LOG.warn("Could not index document, reason:" + e.getMessage(), e);
}
}

/** Commits changes */
public void commit(){
try {
client.commit(true, false);
} catch (Exception e) {
LOG.warn("Could not commit, reason:" + e.getMessage(), e);
}
}



Setting up Nutch

Before starting the crawling process you need to first configure Nutch. If you are not familiar with the way nutch operates it is recommended to first follow the tutorial in Nutch web site.

Basically the steps required are (make sure you use correct filenames - replace '_' with '-'):

1. Set up conf/regex-urlfilter.txt
2. Set up conf/nutch-site.xml
3. Generate a list of seed urls into folder urls
4. Grab this simple script that will help you along in your crawling task.



After those initial steps you can start crawling by simply executing the crawl.sh script:

crawl.sh <basedir>, where basedir will be the folder where your crawling contents will be stored.

The script will execute one iteration of fetching and indexing. After the first iteration
you can start querying the newly generated index for the content you have crawled - for
example with url like http://127.0.0.1:8983/solr/select?q=apache&start=0&rows=10&fl=title%2Curl%2Cscore&
qt=standard&wt=standard&hl=on&hl.fl=content


If you started with the provided seed list your index should contain exactly one document, the Apache front page. You can now fetch more rounds and see how your index will grow.

Deficiencies of the demonstrated integration

There are number of things you need to consider and implement before the
integration is at usable level.

Document boost

The document boosting was left of to keep this post small. If you are seriously planning to use pattern like this then you must add document boosting (not hard at all to add it). Without it you will lose a precious piece of information from the link graph.

Support for multivalued fields

The anchor texts in Nutch are indexed into multivalued field. The sample code from this post does not do that.

Deleting pages from index

The deleted pages are not removed from index. One could implement it as part of reduce method by checking the status from CrawlDatum and post a deletion request if it has status STATUS_FETCH_GONE.

Posting multiple documents at same time

The naive implementation here posts documents to index one by one over the network.
A better way would be adding multiple documents at a time.

Further improvements - extending index size

If you are unwilling to wait for the next killer component in Lucene family you could probably extend the pattern presented here to support even larger indexes than can be handled with single Solr server instance quite easily.

A small addition in SolrClientAdapter would be sufficient: instead of posting all docs to single Solr instance one would post documents to different indexes, target server could be selected by hashing the document URL for example. This is not however recommended unless you understand the consequences ;)

UPDATE 2007/07/15
Ryan has kindly posted a updated SolrClientAdapter that works with client version currently in solr trunk, thanks Ryan!

Labels: , ,



Saturday, January 20, 2007

 

Website up

I got an extra burst of energy yesterday and set up my Company website. The nice simple layout (also deployed to this blog) is designed by Andreas Viklund, nice work man!

Another part of credits goes to the creators of MeshCMS which powers the site. It was really easy to set up, no fighting with databases and stuff, just deploy the war and start writing content - another excellent example of how usable open source software is getting these days.

Regarding the company - everything is still at very early stages and I am working on it part time only (or should I say when there's demand ;). I'll get back to this when there's someting to report.

Labels: ,



Sunday, January 14, 2007

 

Sorted out

The Fetcher performance in post 0.7.x version of Nutch has been a target for a critique for long time and not without cause. Even when there are many improvements made (and also many waiting to be done) during the last year things just aren't as fast as one hopes.

One particular thing has been bothering me for a long but I never really had time to look it through, until now.

Nutch Fetcher operates by reading a sequence of urls from a list generated by Generator. These urls are then handled to FetcherThreads. FetcherThread fetches content parses it (or not depending on the configuration) and stores it into segment for later processing. Some info about Nutch segments contents
can be seen from my previous post.

Fetcher also has built in mechanism to behave like a good citizen and not fetch more pages per unit of time than configured. If fetchlist contains a lot of urls in a row to same host lots of threads get blocked because of this mechanism that enforces politenes. This queuing mechanism is a good thing but as a side effect of it a lot of threads just sit and wait in a queue because some other thread just fetched a page from same host they were going to fetch.

There are number of factors one can do by configuration that minimium amount of threads are blocked during fetching some of them are listed below:


generate.max.per.host
generate.max.per.host.by.ip
fetcher.threads.per.host.by.ip
fetcher.server.delay
fetcher.threads.fetch
fetcher.threads.per.host


But even after you have set up reasonable configuration (like generate.max.per.host * threads < num_of_urls_to_generate) you still end up in a situation where tons of Threads are blocked on on same host you just start to wonder what is the problem this time.

This time the blame was in Generator, or more specifically in HashComparator. I took me a long time to figure out what was the real problem, I even tried out other hash functions because I thought the one was flaved. At the end the problem is quite obvious:

public int compare(...) {
...
if (hash1 != hash2) {
return hash1 - hash2;
}
...
}

Isn't it? Well it wasn't for me. But afterwards it's easy to say that overflow in integer math was the blame. I replaced compare methods slighly to get rid of integer overflow:

public int compare(...) {
...
return (hash1 < hash2 ? -1 : (hash1 == hash2 ? 0 : 1));
}

To verify the effect of fix I generated two segments, both sized 10 000 urls (exactly same urls) - one with original code and one with modified code, runtimes for these single server fetches are listed below:

Original:
real 32m16.246s
user 2m33.726s
sys 0m9.989s

Modded:
real 19m40.026s
user 2m35.371s
sys 0m10.892s

The absoulte times are more or less meaningless and they are provided just for a reference, below is a chart of bandwidth used during fetching. A thing to note there is more even bandwidth usage with properly sorted fetchlist.



In the end I have to say that I am very pleased I got this one sorted out.

Labels: ,



Saturday, December 16, 2006

 

Record your data

Hadoop supports serializing/deserializing stream of objects that implement
the interface Writable. Writable interface has two methods, one for serializing
and one for deserializing data.

void write(DataOutput out) throws IOException;

void readFields(DataInput in) throws IOException;

Hadoop contains ready made Writable implementations for primitive data types like
IntWritable for persisting integer type of data, BytesWritable for persisting
arrays of bytes, Text for persisting string type of data and so on. By combining
these primitive types it is possible to write more complex persistable objects to
satisfy your needs.

There are several examples of more complex writable implementations in nutch
for you to look at like CrawlDatum that holds crawl state of a resource,
Content for storing the content of a fetched resource. Many of these objects
are there for just persisting data and don't do much more than record object
state to DataOutput or restore object state from DataInput.

Implementing and maintaining these sometimes complex objects by hand is both
error prone and time consuming task. There might even be cases when you need to
implement io with c++ for example in parallel to your java software - your problem just
got 100% bigger. This is where the Hadoop record package comes handy.

Hadoop record api contains tools to generate code for your custom Writables based
on your DDL so you can focus on more interesting tasks and let the machine do things where
it is better. The used DDL syntax is easy and readable and yet it can be used to
generate complex nestable data types with arrays and maps if so needed.

Example

Imagine you were writing code for a simple media search and needed to
implement the Writables to persist your data. The DDL in that case could look
something like:


module org.apache.nutch.media.io {

class InLink {
ustring FromUrl;
ustring AnchorText;
}

class Media {
ustring Url;
buffer Media;
map <ustring, ustring> Metadata;
}

class MediaRef {
ustring Url;
ustring Alt;
ustring AboutUrl;
ustring Context;
vector <inlink> InLinks;
}

}


To generate the java or c++ (the currently supported languages) code you would
then generate the code to match this DDL by executing:

bin/rcc <ddl-file>

Running the record compiler then generates the needed .java files (InLink.java, Media.java and MediaRef.java) to defined package which in case of our example would be org.apache.nutch.media.io. It is very fast to prototype different objects and change things when you see that something is missing or wrong. Just change the DDL and regenerate.

Evolving data

What it comes to evolving file formats in case where you have already a lot of data stored
and want to continue accessing it and still for example add new fields to freshly generated
data there currently is not so much of support available.

There are plans to add support for defining records in forward/backwards compatible manner in future versions of the record api. It might be something as simple as adding support for optional fields or it might be something as fancy as storing the DDL as metadata inside file containing the actual data and automatic run time data construction to some generic container.

One possibility to support data evolution is to write converters, those too should be guite easy to manage with DDL->code approach: just change the modulename of "current" DDL to something else. Classes generated from this DDL are used to read old data. Then create DDL for next version of data and generate new classes - these are used to write data into new format. Last thing left if to write the converter which can be a very simple mapper that reads old format and writes the new format.

Labels: , ,



Sunday, December 10, 2006

 

Compressed and fast

My fellow Nutch developer Andrzej Bialecki of SIGRAM recently upgraded Nutch to contain the latest released version of Hadoop. This is great news to people who have been suffering with the amount of disc resources nutch requires or suffering from slow io running under Linux. Why? I hear you asking.

Because now you can use native compression libs to compress your nutch data with speed of light! Unfortunately for some this only works under linux out-of-the-box. What you need to do is the following:

  • Get Precompiled libraries from hadoop distribution. Yes, they are not yet part of nutch. Libraries can be found from folder lib/native.

  • Make them available. I did it with environment variable LD_LIBRARY_PATH. One can also, and most propably should, use the system property -Djava.library.path= to do the same thing

  • Instruct hadoop to use block level compression with configuration like:


  • <property>
    <name>io.seqfile.compression.type</name>
    <value>BLOCK</value>
    </property>

    That's all there is to start using native compression within Nutch. All works backwards compatible way: Each Sequence file stores metadata about compression so when data is read hadoop knows automatically how is it compressed. So first time you run a command that handles data it is first read as it is (usually not compressed) but at the time new data files are starting to be generated compression kicks in.

    To verify that you got all the steps right you can consult the log files and search for a string like

    DEBUG util.NativeCodeLoader - Trying to load the custom-built native-hadoop library...
    INFO util.NativeCodeLoader - Loaded the native-hadoop library

    I did a small test with a crawldb containing little over 20M urls. The original crawldb had a size of 2737 Megabytes and after activating compression and running it through merge the size dropped to 359 Megabytes, quite a nice drop!

    There's always also the other side of a coin, wonder how much compression will affect speed of code involved with io. This aspect was simply tested with generating a fetchlist of size 1 M. With compressed crawldb time needed was 47 m 04 s and for uncompressed crawldb (where also uncompressed fetchlist was generated) it was 53 m 28 s. Yes you read that right - Using compression leads to faster operations. So not only it consumes less disk space it also runs faster!

    Hats off to hadoop team for this nice improvement!

    Labels: , ,



    Navigation