Sunday, December 10, 2006

 

Compressed and fast

My fellow Nutch developer Andrzej Bialecki of SIGRAM recently upgraded Nutch to contain the latest released version of Hadoop. This is great news to people who have been suffering with the amount of disc resources nutch requires or suffering from slow io running under Linux. Why? I hear you asking.

Because now you can use native compression libs to compress your nutch data with speed of light! Unfortunately for some this only works under linux out-of-the-box. What you need to do is the following:

  • Get Precompiled libraries from hadoop distribution. Yes, they are not yet part of nutch. Libraries can be found from folder lib/native.

  • Make them available. I did it with environment variable LD_LIBRARY_PATH. One can also, and most propably should, use the system property -Djava.library.path= to do the same thing

  • Instruct hadoop to use block level compression with configuration like:


  • <property>
    <name>io.seqfile.compression.type</name>
    <value>BLOCK</value>
    </property>

    That's all there is to start using native compression within Nutch. All works backwards compatible way: Each Sequence file stores metadata about compression so when data is read hadoop knows automatically how is it compressed. So first time you run a command that handles data it is first read as it is (usually not compressed) but at the time new data files are starting to be generated compression kicks in.

    To verify that you got all the steps right you can consult the log files and search for a string like

    DEBUG util.NativeCodeLoader - Trying to load the custom-built native-hadoop library...
    INFO util.NativeCodeLoader - Loaded the native-hadoop library

    I did a small test with a crawldb containing little over 20M urls. The original crawldb had a size of 2737 Megabytes and after activating compression and running it through merge the size dropped to 359 Megabytes, quite a nice drop!

    There's always also the other side of a coin, wonder how much compression will affect speed of code involved with io. This aspect was simply tested with generating a fetchlist of size 1 M. With compressed crawldb time needed was 47 m 04 s and for uncompressed crawldb (where also uncompressed fetchlist was generated) it was 53 m 28 s. Yes you read that right - Using compression leads to faster operations. So not only it consumes less disk space it also runs faster!

    Hats off to hadoop team for this nice improvement!

    Labels: , ,



    Comments



    # Get Precompiled libraries from hadoop distribution. Yes, they are not yet part of nutch. Libraries can be found from folder lib/native.

    -- Where do I ge this libs? Hadoop trunk? well am I missing something .. I don't see it in hadoop trunk.

    http://svn.apache.org/repos/asf/lucene/hadoop/trunk/lib/

    Please provide a link. Thanks again for your help.
    # posted by Blogger Nutch : December 18, 2006 8:24 AM  



    The lib/ is inside of hadoop distribution. To get the hadoop distribution just go to
    http://www.apache.org/dyn/closer.cgi/lucene/hadoop/ and pick a mirror and download it (0.9.2 is currently latest)
    # posted by Blogger Sami Siren : December 18, 2006 3:08 PM  



    hello sami.
    well, many thanks for your effective advice.

    i have downloaded the native library and put it on my search library path using the enviroment variable.

    then i have instructed the hadoop to compress files using the mentioned property. i have put this property on hadoop-site.xmlfile.

    but when i run the nutch and finished crawling, i went directly to see the log files to discover that no lines about loading native library..

    could u help me on that plz.

    regards
    waseem
    # posted by Blogger waseemsadeh : December 25, 2006 4:20 PM  



    FYI, the native library path wasn't correctly passed to child processes in Hadoop 0.9 releases. This should be fixed in Hadoop 0.10.1.

    https://issues.apache.org/jira/browse/HADOOP-838
    https://issues.apache.org/jira/browse/HADOOP-871
    https://issues.apache.org/jira/browse/HADOOP-873
    # posted by Blogger Doug Cutting : January 10, 2007 8:00 PM  

    Post a Comment



    << Home

    Navigation