Friday, January 18, 2008


Regenerating equally sized shards from set of Lucene indexes

If you have a need to split your large index or set of indexes into smaller equally sized "shards" this prototype of tool might be for you. There is a tool for combining several indexes into one inside Lucene distribution but to my knowledge there is no tool to do the opposite.

The usual use case for splitting your index is index distribution: for example you plan to distribute (pieces of your index) into several machines to increase query throughput. Of course this operation could be done by reindexing the data, but resizing the index shards _seems_ to be faster than that (need to do some benching to confirm that).

This tool should be able to handle several different scenarios for you:

1. splitting one large index into many smaller ones

2. combining and resplitting several indexes into new set of indexes

3. combining several indexes into one

This tool does not try to interprete the physical index format but lets Lucene do the heavy lifting by simply using IndexWriter.addIndexes().

DISCLAIMER: I only had time to do some very limited testing with smallish indexes with this tool, but I plan to do some more testing with bigger indexes soon to get an idea how this will work in real life.

Download sources.

Labels: ,


This is cool! It might be faster to implement MaskingIndexReader to provide a TermPositions implementation whose next() implementation uses skipTo() on the underlying TermPositions at the start of each term, then returns false at the end, thus avoiding processing postings outside the specified range.
# posted by Blogger Doug Cutting : January 23, 2008 at 2:25 AM  

Thanks for the pointer Doug! I will check that out.
# posted by Blogger Sami Siren : January 23, 2008 at 6:29 PM  

Hi Sami,

I have ~20GB of wiki dataset index which don't fit to my S3 bucket since the file size limit is 5GB for S3 bucket. So, when I was browsing internet to find a lucene index splitter utility, and found this page. I have tried to use your utility posted at this page, however, I couldn't be able to get it work. May be I am doing something wrong, but what I did is to create a jar file using this command "jar cfvm iSplitter2.jar Manifest.txt index-slice/*.*" after I built source code at eclipse using lucene 2.4.0 and JUnit 4.5 libraries. Then I tried to run IndexSplitTool using this command "java -jar iSplitter2.jar" at least to see utility usage information. However, I am having this error:

"Exception in thread "main" java.lang.NoClassDefFoundError: IndexSplitTool/class
Caused by: java.lang.ClassNotFoundException: IndexSplitTool.class"

Probably I am doing something wrong when I am creating the jar file since I don't have much experience about it. Therefore, if you can provide a working jar file at your blog if you have one I will appreciate it. Thanks.

PS: Manifest.txt content in the jar file as of now is:

Manifest-Version: 1.0
Created-By: 1.6.0_07 (Sun Microsystems Inc.)
Main-Class: IndexSplitTool.class
# posted by Blogger Unknown : December 10, 2008 at 6:36 PM  


you should pick up the recent version from the stretch project (

When you have checked out the sources and compiled you should go to search directory and execute command:

mvn exec:java -Dexec.mainClass="" -Dexec.args="./src ./dst 1000"

where ./src is the source directory (where original index is) ./dst is the destination directory amd 1000 is the number of docs
# posted by Blogger Sami Siren : December 10, 2008 at 9:26 PM  

Hi Sami,

Thanks for your reply first. I have checked out the project using eclipse but I couldn't be able to build it somehow. I have successfully installed maven but since I couldn't compile files it produces this error:

[INFO] An exception occured while executing the Java class.

I used javac to compile files but it didn't help either. Any suggestions? Thanks.
# posted by Blogger Unknown : December 13, 2008 at 9:00 AM  

You can compile the project from the root dir of it by executing "mvn clean install"
# posted by Blogger Sami Siren : December 13, 2008 at 11:20 AM  

Hi Sami,

I have just followed your instructions and I have two more errors which I could be able to fix one of them while the other one still remains. First one occurs when downloading "tika-0.2.jar" file initially where its download path shown as "". When I browsed the link supplied a little bit I found out that there is a duplicate "tika" entry. When I removed that entry I was able to download the library. Then I installed it manually by entering this command:"mvn install:install-file -DgroupId=org.apache.tika -DartifactId=tika -Dversion=0.2 -Dpackaging=jar -Dfile=tika-0.2.jar" as it is suggested in the initial error output. However, when I executed the "mvn clean install" command I am having "Running fi.foofactory.stretch.parse.tika.TikaParserTest
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.09 sec <<< FAILURE!"
error this time. I retried everything with creating a new directory and using the source compiling instructions that you have posted today but it didn't help either. Do I need to checkout some earlier versions to have a more stable copy or something else? Any ideas? Meanwhile, thanks a lot for helping on this again.
# posted by Blogger Unknown : December 13, 2008 at 10:20 PM  

Now it should work, there was some problems with tika libraries in maven repo but it's all fixed now.
# posted by Blogger Sami Siren : December 17, 2008 at 10:14 PM  

Hi Sami,

Sorry for not returning your comment for a long time. I don't want to take your time but here is what happened: I had needed to find out a workaround solution when I couldn't be able to split my index. So, I had to continue with that approach. However, thank you very much for your help and best of luck with your project.
# posted by Blogger Unknown : February 1, 2009 at 1:41 AM  

Hi sami,

I used the tool but my all shards would be created of the same size as of the original index. Could you help. My original index is an optimized index.
# posted by Blogger pravesh : July 14, 2009 at 8:13 AM  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home