Sunday, December 31, 2006


Nutch and hadoop native compression revisited

After running experimental crawls using hadoop native compression I decided to report back some more data for you to consume.

I have been crawling several cycles before and after enabling compression. What I am comparing here is segments sizes generated and fetched with same settings (-topN 1000000), so strictly speaking, this not a test which will tell you how compression affects individual segments, bu just a log of segment sizes before and after enabling compression. Segments 0-6 are from time before enabling compression and segments 7 - are processed with compression enabled.

Total space consumption

As seen from graph the total savings from compression in segment data is roughly 50%.

Nutch segment data consists of several different independent parts. Below you can find graph for each individual piece and see the effect of enabling compression.

Subfolder: content
Object: Content
Purpose: Store fetched raw content, headers and some additional metadata.

As you can see there was no significant (if any) gain from compression to the biggest space comsumer, the content. This is because it is already compressed. Actually this means that during processing it will be compressed twice, once by the object it self and the second time by hadoop. The object level compression should really be removed from Content. Instead one should rely on hadoop for doing the compression.

Subfolder: crawl_fetch
Purpose: CrawlDatum object used when fetching.
Object: CrawlDatum

Subfolder: crawl_generte
Object: CrawlDatum
Purpose: CrawlDatum object as selected by Generator.

Subfolder: crawl_parse
Object: CrawlDatum
Purpose: CrawlDatum data extracted while parsing content (hash, outlinks)

Subfolder: parse_data
Object: ParseData
Purpose: Data extracted from a page's content.

Subfolder: parse_text
Object: ParseText
Purpose: he text conversion of page's content, stored using gzip compression.

The gains are not big here because of double compression. Again the compression should be removed from the object and let hadoop do it's job.

By removing the double compression from the two mentioned objects the performance of fetcher should once again increase. I will dig into this sometime in future (unless someone else gets to it first:)

Oh, and happy new year to all!

Labels: ,

Saturday, December 16, 2006


Record your data

Hadoop supports serializing/deserializing stream of objects that implement
the interface Writable. Writable interface has two methods, one for serializing
and one for deserializing data.

void write(DataOutput out) throws IOException;

void readFields(DataInput in) throws IOException;

Hadoop contains ready made Writable implementations for primitive data types like
IntWritable for persisting integer type of data, BytesWritable for persisting
arrays of bytes, Text for persisting string type of data and so on. By combining
these primitive types it is possible to write more complex persistable objects to
satisfy your needs.

There are several examples of more complex writable implementations in nutch
for you to look at like CrawlDatum that holds crawl state of a resource,
Content for storing the content of a fetched resource. Many of these objects
are there for just persisting data and don't do much more than record object
state to DataOutput or restore object state from DataInput.

Implementing and maintaining these sometimes complex objects by hand is both
error prone and time consuming task. There might even be cases when you need to
implement io with c++ for example in parallel to your java software - your problem just
got 100% bigger. This is where the Hadoop record package comes handy.

Hadoop record api contains tools to generate code for your custom Writables based
on your DDL so you can focus on more interesting tasks and let the machine do things where
it is better. The used DDL syntax is easy and readable and yet it can be used to
generate complex nestable data types with arrays and maps if so needed.


Imagine you were writing code for a simple media search and needed to
implement the Writables to persist your data. The DDL in that case could look
something like:

module {

class InLink {
ustring FromUrl;
ustring AnchorText;

class Media {
ustring Url;
buffer Media;
map <ustring, ustring> Metadata;

class MediaRef {
ustring Url;
ustring Alt;
ustring AboutUrl;
ustring Context;
vector <inlink> InLinks;


To generate the java or c++ (the currently supported languages) code you would
then generate the code to match this DDL by executing:

bin/rcc <ddl-file>

Running the record compiler then generates the needed .java files (, and to defined package which in case of our example would be It is very fast to prototype different objects and change things when you see that something is missing or wrong. Just change the DDL and regenerate.

Evolving data

What it comes to evolving file formats in case where you have already a lot of data stored
and want to continue accessing it and still for example add new fields to freshly generated
data there currently is not so much of support available.

There are plans to add support for defining records in forward/backwards compatible manner in future versions of the record api. It might be something as simple as adding support for optional fields or it might be something as fancy as storing the DDL as metadata inside file containing the actual data and automatic run time data construction to some generic container.

One possibility to support data evolution is to write converters, those too should be guite easy to manage with DDL->code approach: just change the modulename of "current" DDL to something else. Classes generated from this DDL are used to read old data. Then create DDL for next version of data and generate new classes - these are used to write data into new format. Last thing left if to write the converter which can be a very simple mapper that reads old format and writes the new format.

Labels: , ,

Sunday, December 10, 2006


Compressed and fast

My fellow Nutch developer Andrzej Bialecki of SIGRAM recently upgraded Nutch to contain the latest released version of Hadoop. This is great news to people who have been suffering with the amount of disc resources nutch requires or suffering from slow io running under Linux. Why? I hear you asking.

Because now you can use native compression libs to compress your nutch data with speed of light! Unfortunately for some this only works under linux out-of-the-box. What you need to do is the following:

  • Get Precompiled libraries from hadoop distribution. Yes, they are not yet part of nutch. Libraries can be found from folder lib/native.

  • Make them available. I did it with environment variable LD_LIBRARY_PATH. One can also, and most propably should, use the system property -Djava.library.path= to do the same thing

  • Instruct hadoop to use block level compression with configuration like:

  • <property>

    That's all there is to start using native compression within Nutch. All works backwards compatible way: Each Sequence file stores metadata about compression so when data is read hadoop knows automatically how is it compressed. So first time you run a command that handles data it is first read as it is (usually not compressed) but at the time new data files are starting to be generated compression kicks in.

    To verify that you got all the steps right you can consult the log files and search for a string like

    DEBUG util.NativeCodeLoader - Trying to load the custom-built native-hadoop library...
    INFO util.NativeCodeLoader - Loaded the native-hadoop library

    I did a small test with a crawldb containing little over 20M urls. The original crawldb had a size of 2737 Megabytes and after activating compression and running it through merge the size dropped to 359 Megabytes, quite a nice drop!

    There's always also the other side of a coin, wonder how much compression will affect speed of code involved with io. This aspect was simply tested with generating a fetchlist of size 1 M. With compressed crawldb time needed was 47 m 04 s and for uncompressed crawldb (where also uncompressed fetchlist was generated) it was 53 m 28 s. Yes you read that right - Using compression leads to faster operations. So not only it consumes less disk space it also runs faster!

    Hats off to hadoop team for this nice improvement!

    Labels: , ,