Saturday, December 16, 2006

 

Record your data

Hadoop supports serializing/deserializing stream of objects that implement
the interface Writable. Writable interface has two methods, one for serializing
and one for deserializing data.

void write(DataOutput out) throws IOException;

void readFields(DataInput in) throws IOException;

Hadoop contains ready made Writable implementations for primitive data types like
IntWritable for persisting integer type of data, BytesWritable for persisting
arrays of bytes, Text for persisting string type of data and so on. By combining
these primitive types it is possible to write more complex persistable objects to
satisfy your needs.

There are several examples of more complex writable implementations in nutch
for you to look at like CrawlDatum that holds crawl state of a resource,
Content for storing the content of a fetched resource. Many of these objects
are there for just persisting data and don't do much more than record object
state to DataOutput or restore object state from DataInput.

Implementing and maintaining these sometimes complex objects by hand is both
error prone and time consuming task. There might even be cases when you need to
implement io with c++ for example in parallel to your java software - your problem just
got 100% bigger. This is where the Hadoop record package comes handy.

Hadoop record api contains tools to generate code for your custom Writables based
on your DDL so you can focus on more interesting tasks and let the machine do things where
it is better. The used DDL syntax is easy and readable and yet it can be used to
generate complex nestable data types with arrays and maps if so needed.

Example

Imagine you were writing code for a simple media search and needed to
implement the Writables to persist your data. The DDL in that case could look
something like:


module org.apache.nutch.media.io {

class InLink {
ustring FromUrl;
ustring AnchorText;
}

class Media {
ustring Url;
buffer Media;
map <ustring, ustring> Metadata;
}

class MediaRef {
ustring Url;
ustring Alt;
ustring AboutUrl;
ustring Context;
vector <inlink> InLinks;
}

}


To generate the java or c++ (the currently supported languages) code you would
then generate the code to match this DDL by executing:

bin/rcc <ddl-file>

Running the record compiler then generates the needed .java files (InLink.java, Media.java and MediaRef.java) to defined package which in case of our example would be org.apache.nutch.media.io. It is very fast to prototype different objects and change things when you see that something is missing or wrong. Just change the DDL and regenerate.

Evolving data

What it comes to evolving file formats in case where you have already a lot of data stored
and want to continue accessing it and still for example add new fields to freshly generated
data there currently is not so much of support available.

There are plans to add support for defining records in forward/backwards compatible manner in future versions of the record api. It might be something as simple as adding support for optional fields or it might be something as fancy as storing the DDL as metadata inside file containing the actual data and automatic run time data construction to some generic container.

One possibility to support data evolution is to write converters, those too should be guite easy to manage with DDL->code approach: just change the modulename of "current" DDL to something else. Classes generated from this DDL are used to read old data. Then create DDL for next version of data and generate new classes - these are used to write data into new format. Last thing left if to write the converter which can be a very simple mapper that reads old format and writes the new format.

Labels: , ,



Comments



Sami, great post. Now is scale the main reason why one would use Hadoop's record management framework over say, storing serialized blobs in a database? I imagine Hadoop scales much more gracefully if you're dealing with millions of records, but would it be overkill for smaller datasets?
# posted by Blogger Andy : December 19, 2006 at 6:17 PM  



For me the main motivation for using Hadoop records is actually Hadoop itself (as all data flowing in Hadoop must implement Writable interface) and ease of use (no manual writing of lengthy and error prone io code).

IMO it does equally well in smaller datasets too (in hadoop Writables are actually used in RPC communication too) and then you have always the possibility to scale if so needed :)

There are a lot of different persistence frameworks out there and some of them are better for one purpose and some other are better for something else.
# posted by Blogger Sami Siren : February 16, 2007 at 7:21 PM  

Post a Comment

Subscribe to Post Comments [Atom]



<< Home

Navigation