Wednesday, June 10, 2009


Yahoo! announces their own Distribution of Hadoop

Yahoo has announced the availability of Yahoo Distribution of Hadoop.

Each Yahoo! Distribution of Hadoop goes through exhaustive 2 day testing on Yahoo's 500 node test cluster. Yahoo! also promises that all improvements and patches are to be released under Apache license either through Apache Jira or directly into Apache source code repository.

Yahoo! is not offering any support for their releases but they expect companies that are specialized in Hadoop to take care of that part.


Thursday, May 14, 2009


Yahoo calculates digits of π (a lot) faster than before

Yahoo recently reported that they have calculated more bits of π roughly during weekend than an earlier project called PiHex did in span of 2 years. The record creating software was run on Hadoop cluster and the software itself was implemented in Java programming language. Check the details from original post.

Labels: , ,

Thursday, April 2, 2009


Amazon offers MapReduce as a service

Amazon announced today a new Service called elastic MapReduce. It promises to ease up the configuration and managing of Hadoop clusters. Pricing model is simple: add additional $0.015 - $0.12 per instance hour (depending on the size of the instances you use) to your bill.

When converted to monthly payments that makes $10.95 - $87.60 per machine. When you calculate the total cost of running one instance (the extra large instance) the cost is $671.60 per month and that does not yet include the network or storage costs. Just for a comparison: You can buy single machine with similar specs from Finland for about $1300. Do you also think that Amazon should check their pricing in general?

Labels: , ,

Wednesday, March 18, 2009


Hadoop the easy edition

Cloudera has put together a nice looking configurator for Apache Hadoop. (see video)

They also offer yum repository to install RPMified version of Hadoop manageable as a standard Linux service together with local documentation and man pages.

All of this is of course available under a commercial friendly Apache License.


Thursday, July 3, 2008


Hadoop takes the lead position

Owen O'Malley, the project lead of Apache Hadoop, member of Yahoo grid team, announced today that they have taken the number uno position on Terabyte Sort Benchmark.

The new record of 209 seconds is nearly 30% faster than the previous one. More details here and here.

Congratulations! Nice work, once again.

Labels: , ,

Tuesday, November 6, 2007


Javascript prototyping for Hadoop map/reduce jobs

Java 6 brought us jvm level scripting support as modelled in jsr-223. (Sun documentation about the subject)

Wordcount Demo

I have chosen the wordcount example (as seen on Hadoop) to demonstrate a complete javascript map/reduce example (a mapper, a reducer and a "driver"):


function map(key, value, collector){
tokenizer = new StringTokenizer(value);
while(tokenizer.hasMoreTokens()) {
collector.collect(tokenizer.nextToken(), 1);

function reduce(key, values, collector){
while(values.hasNext()) {;
collector.collect(key, counter);

function driver(args, jobConf, jobClient) {

System requirements

Running javascript map reduce

  • Install java as documented by Sun

  • Extract hadoop release tar ball into your favourite location

  • Run your javascript map/reduce job with following command: bin/hadoop jar </path/to/scriptmr.jar> <script-file-name> [<script-arg-1>...]


Shortly: there isn't any. The wordcount demo as presented here takes roughly 6-7 times more time than the wordcount java exmple from Hadoop. It's quite obvious this technique is not really practical beyond prototyping. For prototyping... hmmm I am not totally sure it is usable for that either. Anyway have fun with it, I know I had some while making it ;)

Labels: , , ,

Thursday, June 14, 2007


Adding support for tight Enum and EnumSet into a hadoop application

After working with many projects that use a combination of int and String constants in various ways to mimic enumerations I learned to like java Enums. I am also a fan of Hadoop so I decided to see how they will fit together.

WritableUtils class has already support for writing and reading java Enum values. in WritableUtils Enum is serialized by converting value of to Text. There is also another way of storing Enum state which requires less space. By using the oridinal() method will give you the index of the field in the Enum. This technique allows you to store Enum containing up to 255 different fields to a single byte. EnumWritable perhaps does not warrant to be in own class, an utility methods cabable of reading and writing enum to DataInput would be sufficient.


public class EnumWritable<E extends Enum<E>> implements Writable {

private byte storage;

public EnumWritable() {
storage = 0;

public EnumWritable(Enum<E> value) {

public <T extends Enum<E>> E get(Class<E> enumType) throws IOException {
for (E type : enumType.getEnumConstants()) {
if (storage == type.ordinal()) {
return type;
return null;

public void set(Enum<E> e) {
storage = (byte) e.ordinal();

public void readFields(DataInput in) throws IOException {
storage = in.readByte();

public void write(DataOutput out) throws IOException {

public String toString() {
return Integer.toString(storage);

public boolean equals(Object obj) {
if (!(obj instanceof EnumWritable)) {
return super.equals(obj);
EnumWritable that = (EnumWritable) obj;
return ==;

public int hashCode() {
return storage;


Other useful Structure for storing equivalent of multiple boolean types or piece of data that consists of bits (that can be modeled as fields in enum) is EnumSet. Storage for EnumSet is also compact if take advantage of the oridinal() method. You can store EnumSet with storage space of 1 bit per field.

I wrote an prototype EnumSetWritable that takes from 1 to 4 bytes of storage dependeing on how many fields enum contains and what enums are in the set. 2 least signifigant bits store the size of storage (1-4 bytes) and rest of the bits store Enums in EnumSet. As I told you space for storing the EnumSet varies from 1 to 4 bytes so it can store EnumSets containing Enums with up to 30 fields. By ordering the fields in a way that fields that are most often in EnumSet will most probably lead to smaller space consumption.

Real World applications for EnumSet could be for example something like:

enum ContainsData {
Raw, Parsed, Processed;

EnumSetWritable<ContainsData> contains = new EnumSetWritable<ContainsData>();

public void readFields(DataInput in) throws IOException {
for (ContainsData type : contains.getEnumSet(ContainsData.class)) {
switch (type) {
case Raw: //read raw data;break;
case Parsed: //read parsed data;break;
case Processed: //read processed data;break;

Where you would use EnumSet to store the presence information of various data structures in a writable and so would avoid for example storing booleans (taking one byte of space each) for the same piece of information.

Labels: ,

Sunday, December 31, 2006


Nutch and hadoop native compression revisited

After running experimental crawls using hadoop native compression I decided to report back some more data for you to consume.

I have been crawling several cycles before and after enabling compression. What I am comparing here is segments sizes generated and fetched with same settings (-topN 1000000), so strictly speaking, this not a test which will tell you how compression affects individual segments, bu just a log of segment sizes before and after enabling compression. Segments 0-6 are from time before enabling compression and segments 7 - are processed with compression enabled.

Total space consumption

As seen from graph the total savings from compression in segment data is roughly 50%.

Nutch segment data consists of several different independent parts. Below you can find graph for each individual piece and see the effect of enabling compression.

Subfolder: content
Object: Content
Purpose: Store fetched raw content, headers and some additional metadata.

As you can see there was no significant (if any) gain from compression to the biggest space comsumer, the content. This is because it is already compressed. Actually this means that during processing it will be compressed twice, once by the object it self and the second time by hadoop. The object level compression should really be removed from Content. Instead one should rely on hadoop for doing the compression.

Subfolder: crawl_fetch
Purpose: CrawlDatum object used when fetching.
Object: CrawlDatum

Subfolder: crawl_generte
Object: CrawlDatum
Purpose: CrawlDatum object as selected by Generator.

Subfolder: crawl_parse
Object: CrawlDatum
Purpose: CrawlDatum data extracted while parsing content (hash, outlinks)

Subfolder: parse_data
Object: ParseData
Purpose: Data extracted from a page's content.

Subfolder: parse_text
Object: ParseText
Purpose: he text conversion of page's content, stored using gzip compression.

The gains are not big here because of double compression. Again the compression should be removed from the object and let hadoop do it's job.

By removing the double compression from the two mentioned objects the performance of fetcher should once again increase. I will dig into this sometime in future (unless someone else gets to it first:)

Oh, and happy new year to all!

Labels: ,

Saturday, December 16, 2006


Record your data

Hadoop supports serializing/deserializing stream of objects that implement
the interface Writable. Writable interface has two methods, one for serializing
and one for deserializing data.

void write(DataOutput out) throws IOException;

void readFields(DataInput in) throws IOException;

Hadoop contains ready made Writable implementations for primitive data types like
IntWritable for persisting integer type of data, BytesWritable for persisting
arrays of bytes, Text for persisting string type of data and so on. By combining
these primitive types it is possible to write more complex persistable objects to
satisfy your needs.

There are several examples of more complex writable implementations in nutch
for you to look at like CrawlDatum that holds crawl state of a resource,
Content for storing the content of a fetched resource. Many of these objects
are there for just persisting data and don't do much more than record object
state to DataOutput or restore object state from DataInput.

Implementing and maintaining these sometimes complex objects by hand is both
error prone and time consuming task. There might even be cases when you need to
implement io with c++ for example in parallel to your java software - your problem just
got 100% bigger. This is where the Hadoop record package comes handy.

Hadoop record api contains tools to generate code for your custom Writables based
on your DDL so you can focus on more interesting tasks and let the machine do things where
it is better. The used DDL syntax is easy and readable and yet it can be used to
generate complex nestable data types with arrays and maps if so needed.


Imagine you were writing code for a simple media search and needed to
implement the Writables to persist your data. The DDL in that case could look
something like:

module {

class InLink {
ustring FromUrl;
ustring AnchorText;

class Media {
ustring Url;
buffer Media;
map <ustring, ustring> Metadata;

class MediaRef {
ustring Url;
ustring Alt;
ustring AboutUrl;
ustring Context;
vector <inlink> InLinks;


To generate the java or c++ (the currently supported languages) code you would
then generate the code to match this DDL by executing:

bin/rcc <ddl-file>

Running the record compiler then generates the needed .java files (, and to defined package which in case of our example would be It is very fast to prototype different objects and change things when you see that something is missing or wrong. Just change the DDL and regenerate.

Evolving data

What it comes to evolving file formats in case where you have already a lot of data stored
and want to continue accessing it and still for example add new fields to freshly generated
data there currently is not so much of support available.

There are plans to add support for defining records in forward/backwards compatible manner in future versions of the record api. It might be something as simple as adding support for optional fields or it might be something as fancy as storing the DDL as metadata inside file containing the actual data and automatic run time data construction to some generic container.

One possibility to support data evolution is to write converters, those too should be guite easy to manage with DDL->code approach: just change the modulename of "current" DDL to something else. Classes generated from this DDL are used to read old data. Then create DDL for next version of data and generate new classes - these are used to write data into new format. Last thing left if to write the converter which can be a very simple mapper that reads old format and writes the new format.

Labels: , ,

Sunday, December 10, 2006


Compressed and fast

My fellow Nutch developer Andrzej Bialecki of SIGRAM recently upgraded Nutch to contain the latest released version of Hadoop. This is great news to people who have been suffering with the amount of disc resources nutch requires or suffering from slow io running under Linux. Why? I hear you asking.

Because now you can use native compression libs to compress your nutch data with speed of light! Unfortunately for some this only works under linux out-of-the-box. What you need to do is the following:

  • Get Precompiled libraries from hadoop distribution. Yes, they are not yet part of nutch. Libraries can be found from folder lib/native.

  • Make them available. I did it with environment variable LD_LIBRARY_PATH. One can also, and most propably should, use the system property -Djava.library.path= to do the same thing

  • Instruct hadoop to use block level compression with configuration like:

  • <property>

    That's all there is to start using native compression within Nutch. All works backwards compatible way: Each Sequence file stores metadata about compression so when data is read hadoop knows automatically how is it compressed. So first time you run a command that handles data it is first read as it is (usually not compressed) but at the time new data files are starting to be generated compression kicks in.

    To verify that you got all the steps right you can consult the log files and search for a string like

    DEBUG util.NativeCodeLoader - Trying to load the custom-built native-hadoop library...
    INFO util.NativeCodeLoader - Loaded the native-hadoop library

    I did a small test with a crawldb containing little over 20M urls. The original crawldb had a size of 2737 Megabytes and after activating compression and running it through merge the size dropped to 359 Megabytes, quite a nice drop!

    There's always also the other side of a coin, wonder how much compression will affect speed of code involved with io. This aspect was simply tested with generating a fetchlist of size 1 M. With compressed crawldb time needed was 47 m 04 s and for uncompressed crawldb (where also uncompressed fetchlist was generated) it was 53 m 28 s. Yes you read that right - Using compression leads to faster operations. So not only it consumes less disk space it also runs faster!

    Hats off to hadoop team for this nice improvement!

    Labels: , ,