Sunday, August 31, 2008

 

Interactive query reporting with lucene

Todays post will give you a simple example of how Apache Lucene could be used as a powerful full-text-ad-hoc-query-enabled data warehouse to build simple reports from, little like Google trends.



In this example we use the software for query count reporting but it could also be used to report counts of any other kind of data for example blog posts, news articles and so on.

The data set used in this example is the AOL query data set which consists of little under 40 million records. A single record contains user identifier, user query, timestamp and click data. We are only interested about queries and timestamps here so we skip the rest of the data.

The used Lucene document structure is simple:


Document doc = new Document();
doc.add(new Field("query", fields[1].trim(), Field.Store.NO,
Field.Index.TOKENIZED, TermVector.NO));
doc.add(new Field("date", date[0].trim(), Field.Store.NO,
Field.Index.NO_NORMS, TermVector.NO));


On query side we rely on BitSets, unions of BitSets and the cardinality method to do the hit counting.

For transferring the data between the browser and back end we use json over HTTP. The message format is as follows:


{
"counts":[
{"date": "2006-03-01", "c0":43, "c1":0, "c2":230}
,{"date": "2006-03-02", "c0":11, "c1":0, "c2":245}
,{"date": "2006-03-03", "c0":15, "c1":0, "c2":252}
,{"date": "2006-03-04", "c0":50, "c1":2, "c2":288}
,{"date": "2006-03-05", "c0":14, "c1":0, "c2":294}
,{"date": "2006-03-06", "c0":51, "c1":0, "c2":345}
,{"date": "2006-03-07", "c0":19, "c1":0, "c2":225}
,{"date": "2006-03-08", "c0":16, "c1":0, "c2":219}
,{"date": "2006-03-09", "c0":44, "c1":0, "c2":197}
,{"date": "2006-03-10", "c0":32, "c1":0, "c2":269}
,{"date": "2006-03-11", "c0":38, "c1":0, "c2":311}
,{"date": "2006-03-12", "c0":10, "c1":0, "c2":230}
,{"date": "2006-03-13", "c0":25, "c1":0, "c2":162}
,{"date": "2006-03-14", "c0":59, "c1":0, "c2":261}
]
}

I found jsonlint to be a useful tool for json a novice like me to verify that the json messages I generated are indeed valid.

In end user interface we use YUI charts and YUI datasource components.

Steps to get the app running



Note: Because indexing and webapp are run inside maven in this example you need to make sure there's enough heap available with a command like "export MAVEN_OPTS=-Xmx1024m" )


  1. Download sources (size:6kb)


  2. Get the data set...


  3. Build index into directory index :


  4. mvn exec:java -Dexec.mainClass="fi.foofactory.lucene.report.Indexer" -Dexec.args="<location of AOL collection> index"

    The resulting index is about 706 mega bytes in size. The run time to build the index with the machine I used was around 20 minutes.

  5. Run webapp

    mvn jetty:run-war -Dlucene.index.dir=index


  6. Access the interface with your browser at http://localhost:8080/lucene-report/




Possible enhancements


Parallelism


The code on demo is totally single threaded so it utilizes only one cpu core per request. The dataset used in this demo is so small that the workload is totally cpu bound. You could parallelize at least some of work without exploding the memory requirements (getting BitSets for submitted queries, calculating the intersections).

Approximation


If you are interested about general trends and relatinve volumes then it does not matter if you miss an observation or some. You could use a technique called sampling to reduce the number of observations and still get good results on relatively frequent terms. On rare terms you can always fall back to statement like


Your terms - <insert search terms here> - do not have enough search volume to show graphs.

Partitioning


The data set could be divided into smaller lucene indexes and each of the indexes could be deployed on more machines. You would need to build an aggregator which would ask the results from those smaller indexes and simply add counts together before returning the final results. A nice way to build index partitions is a recently added Apache Hadoop contrib module.

Imagine


In an imaginary world if you would use multiple threads to calculate the intersections, use sampling to preserve one tenth of the original observations and would split the data into 10 machines (4 cpu cores each, with unlimited RAM) the response time of the report queries for that same data set (approximation) could be something like 1/400th of the original. Or the other way around: if you're satisfied with the response time already you could use the same technique on a data set of size 1,6*10^11 documents.

For a limited time there's a demo available online here. (the demo will be removed without further notice)

Labels: , , , ,



Comments

Post a Comment

Subscribe to Post Comments [Atom]



<< Home

Navigation