Sunday, December 31, 2006


Nutch and hadoop native compression revisited

After running experimental crawls using hadoop native compression I decided to report back some more data for you to consume.

I have been crawling several cycles before and after enabling compression. What I am comparing here is segments sizes generated and fetched with same settings (-topN 1000000), so strictly speaking, this not a test which will tell you how compression affects individual segments, bu just a log of segment sizes before and after enabling compression. Segments 0-6 are from time before enabling compression and segments 7 - are processed with compression enabled.

Total space consumption

As seen from graph the total savings from compression in segment data is roughly 50%.

Nutch segment data consists of several different independent parts. Below you can find graph for each individual piece and see the effect of enabling compression.

Subfolder: content
Object: Content
Purpose: Store fetched raw content, headers and some additional metadata.

As you can see there was no significant (if any) gain from compression to the biggest space comsumer, the content. This is because it is already compressed. Actually this means that during processing it will be compressed twice, once by the object it self and the second time by hadoop. The object level compression should really be removed from Content. Instead one should rely on hadoop for doing the compression.

Subfolder: crawl_fetch
Purpose: CrawlDatum object used when fetching.
Object: CrawlDatum

Subfolder: crawl_generte
Object: CrawlDatum
Purpose: CrawlDatum object as selected by Generator.

Subfolder: crawl_parse
Object: CrawlDatum
Purpose: CrawlDatum data extracted while parsing content (hash, outlinks)

Subfolder: parse_data
Object: ParseData
Purpose: Data extracted from a page's content.

Subfolder: parse_text
Object: ParseText
Purpose: he text conversion of page's content, stored using gzip compression.

The gains are not big here because of double compression. Again the compression should be removed from the object and let hadoop do it's job.

By removing the double compression from the two mentioned objects the performance of fetcher should once again increase. I will dig into this sometime in future (unless someone else gets to it first:)

Oh, and happy new year to all!

Labels: ,


I just committed the LZO codec to Hadoop, which, while it doesn't provide quite as much compression as zlib, it's much faster.

This should make it into the Hadoop 0.10.1 release this Friday. I'd love to see how this helps Nutch! Note that it's linux-only native code right now. But it will come pre-built in the Hadoop release for 32-bit linux platforms.
# posted by Blogger Doug Cutting : January 10, 2007 at 7:54 PM  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home