Wednesday, August 5, 2009
Payloads with Lucene/Solr
Grant Ingersoll has written a nice post about using payloads with Lucene/Solr.
Grant gives an introduction on what payloads are, shows how they are injected into Lucene index and finally how stored payloads can be used to score documents. Very nice post!
Grant gives an introduction on what payloads are, shows how they are injected into Lucene index and finally how stored payloads can be used to score documents. Very nice post!
Thursday, March 19, 2009
Apache Tika 0.3 released
Apache Tika, a subproject of Apache Lucene, is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Chris Mattmann just announced that the release of version 0.3 is official.
Go grab yourself a copy from a mirror nearby. Tika is also available through the central maven repository.
There is also an article about Tika and Solr Cell at Lucid Imagination web site.
Chris Mattmann just announced that the release of version 0.3 is official.
Go grab yourself a copy from a mirror nearby. Tika is also available through the central maven repository.
There is also an article about Tika and Solr Cell at Lucid Imagination web site.
Wednesday, January 28, 2009
Lucid Imagination Launched
Lucid Imagination has walked away from stealth mode couple of days ago. If you are friend of Apache Lucene, Apache Solr or Apache Tika you might want to check out the fresh articles.
There is also some nice audio/video infotainment available from various guests.
Lucid Imagination has also released certified versions of Lucene and Solr bundled together with a jvm, Luke and Gaze.
There is also some nice audio/video infotainment available from various guests.
Lucid Imagination has also released certified versions of Lucene and Solr bundled together with a jvm, Luke and Gaze.
Sunday, August 31, 2008
Interactive query reporting with lucene
Todays post will give you a simple example of how Apache Lucene could be used as a powerful full-text-ad-hoc-query-enabled data warehouse to build simple reports from, little like Google trends.

In this example we use the software for query count reporting but it could also be used to report counts of any other kind of data for example blog posts, news articles and so on.
The data set used in this example is the AOL query data set which consists of little under 40 million records. A single record contains user identifier, user query, timestamp and click data. We are only interested about queries and timestamps here so we skip the rest of the data.
The used Lucene document structure is simple:
On query side we rely on BitSets, unions of BitSets and the cardinality method to do the hit counting.
For transferring the data between the browser and back end we use json over HTTP. The message format is as follows:
I found jsonlint to be a useful tool for json a novice like me to verify that the json messages I generated are indeed valid.
In end user interface we use YUI charts and YUI datasource components.
Note: Because indexing and webapp are run inside maven in this example you need to make sure there's enough heap available with a command like "export MAVEN_OPTS=-Xmx1024m" )
The code on demo is totally single threaded so it utilizes only one cpu core per request. The dataset used in this demo is so small that the workload is totally cpu bound. You could parallelize at least some of work without exploding the memory requirements (getting BitSets for submitted queries, calculating the intersections).
If you are interested about general trends and relatinve volumes then it does not matter if you miss an observation or some. You could use a technique called sampling to reduce the number of observations and still get good results on relatively frequent terms. On rare terms you can always fall back to statement like
Your terms - <insert search terms here> - do not have enough search volume to show graphs.
The data set could be divided into smaller lucene indexes and each of the indexes could be deployed on more machines. You would need to build an aggregator which would ask the results from those smaller indexes and simply add counts together before returning the final results. A nice way to build index partitions is a recently added Apache Hadoop contrib module.
In an imaginary world if you would use multiple threads to calculate the intersections, use sampling to preserve one tenth of the original observations and would split the data into 10 machines (4 cpu cores each, with unlimited RAM) the response time of the report queries for that same data set (approximation) could be something like 1/400th of the original. Or the other way around: if you're satisfied with the response time already you could use the same technique on a data set of size 1,6*10^11 documents.
For a limited time there's a demo available online here. (the demo will be removed without further notice)

In this example we use the software for query count reporting but it could also be used to report counts of any other kind of data for example blog posts, news articles and so on.
The data set used in this example is the AOL query data set which consists of little under 40 million records. A single record contains user identifier, user query, timestamp and click data. We are only interested about queries and timestamps here so we skip the rest of the data.
The used Lucene document structure is simple:
Document doc = new Document();
doc.add(new Field("query", fields[1].trim(), Field.Store.NO,
Field.Index.TOKENIZED, TermVector.NO));
doc.add(new Field("date", date[0].trim(), Field.Store.NO,
Field.Index.NO_NORMS, TermVector.NO));
On query side we rely on BitSets, unions of BitSets and the cardinality method to do the hit counting.
For transferring the data between the browser and back end we use json over HTTP. The message format is as follows:
{
"counts":[
{"date": "2006-03-01", "c0":43, "c1":0, "c2":230}
,{"date": "2006-03-02", "c0":11, "c1":0, "c2":245}
,{"date": "2006-03-03", "c0":15, "c1":0, "c2":252}
,{"date": "2006-03-04", "c0":50, "c1":2, "c2":288}
,{"date": "2006-03-05", "c0":14, "c1":0, "c2":294}
,{"date": "2006-03-06", "c0":51, "c1":0, "c2":345}
,{"date": "2006-03-07", "c0":19, "c1":0, "c2":225}
,{"date": "2006-03-08", "c0":16, "c1":0, "c2":219}
,{"date": "2006-03-09", "c0":44, "c1":0, "c2":197}
,{"date": "2006-03-10", "c0":32, "c1":0, "c2":269}
,{"date": "2006-03-11", "c0":38, "c1":0, "c2":311}
,{"date": "2006-03-12", "c0":10, "c1":0, "c2":230}
,{"date": "2006-03-13", "c0":25, "c1":0, "c2":162}
,{"date": "2006-03-14", "c0":59, "c1":0, "c2":261}
]
}
I found jsonlint to be a useful tool for json a novice like me to verify that the json messages I generated are indeed valid.
In end user interface we use YUI charts and YUI datasource components.
Steps to get the app running
Note: Because indexing and webapp are run inside maven in this example you need to make sure there's enough heap available with a command like "export MAVEN_OPTS=-Xmx1024m" )
- Download sources (size:6kb)
- Get the data set...
- Build index into directory index :
- Run webapp
mvn jetty:run-war -Dlucene.index.dir=index - Access the interface with your browser at http://localhost:8080/lucene-report/
mvn exec:java -Dexec.mainClass="fi.foofactory.lucene.report.Indexer" -Dexec.args="<location of AOL collection> index"
The resulting index is about 706 mega bytes in size. The run time to build the index with the machine I used was around 20 minutes.
Possible enhancements
Parallelism
The code on demo is totally single threaded so it utilizes only one cpu core per request. The dataset used in this demo is so small that the workload is totally cpu bound. You could parallelize at least some of work without exploding the memory requirements (getting BitSets for submitted queries, calculating the intersections).
Approximation
If you are interested about general trends and relatinve volumes then it does not matter if you miss an observation or some. You could use a technique called sampling to reduce the number of observations and still get good results on relatively frequent terms. On rare terms you can always fall back to statement like
Your terms - <insert search terms here> - do not have enough search volume to show graphs.
Partitioning
The data set could be divided into smaller lucene indexes and each of the indexes could be deployed on more machines. You would need to build an aggregator which would ask the results from those smaller indexes and simply add counts together before returning the final results. A nice way to build index partitions is a recently added Apache Hadoop contrib module.
Imagine
In an imaginary world if you would use multiple threads to calculate the intersections, use sampling to preserve one tenth of the original observations and would split the data into 10 machines (4 cpu cores each, with unlimited RAM) the response time of the report queries for that same data set (approximation) could be something like 1/400th of the original. Or the other way around: if you're satisfied with the response time already you could use the same technique on a data set of size 1,6*10^11 documents.
For a limited time there's a demo available online here. (the demo will be removed without further notice)
Friday, January 18, 2008
Regenerating equally sized shards from set of Lucene indexes
If you have a need to split your large index or set of indexes into smaller equally sized "shards" this prototype of tool might be for you. There is a tool for combining several indexes into one inside Lucene distribution but to my knowledge there is no tool to do the opposite.
The usual use case for splitting your index is index distribution: for example you plan to distribute (pieces of your index) into several machines to increase query throughput. Of course this operation could be done by reindexing the data, but resizing the index shards _seems_ to be faster than that (need to do some benching to confirm that).
This tool should be able to handle several different scenarios for you:
1. splitting one large index into many smaller ones
2. combining and resplitting several indexes into new set of indexes
3. combining several indexes into one
This tool does not try to interprete the physical index format but lets Lucene do the heavy lifting by simply using IndexWriter.addIndexes().
DISCLAIMER: I only had time to do some very limited testing with smallish indexes with this tool, but I plan to do some more testing with bigger indexes soon to get an idea how this will work in real life.
Download sources.
The usual use case for splitting your index is index distribution: for example you plan to distribute (pieces of your index) into several machines to increase query throughput. Of course this operation could be done by reindexing the data, but resizing the index shards _seems_ to be faster than that (need to do some benching to confirm that).
This tool should be able to handle several different scenarios for you:
1. splitting one large index into many smaller ones
2. combining and resplitting several indexes into new set of indexes
3. combining several indexes into one
This tool does not try to interprete the physical index format but lets Lucene do the heavy lifting by simply using IndexWriter.addIndexes().
DISCLAIMER: I only had time to do some very limited testing with smallish indexes with this tool, but I plan to do some more testing with bigger indexes soon to get an idea how this will work in real life.
Download sources.
