<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss'><id>tag:blogger.com,1999:blog-830893785995753823</id><updated>2010-02-14T17:32:53.609+02:00</updated><title type='text'>FooFactory</title><subtitle type='html'>Java Lucene Solr Nutch Hadoop and Open Source in general</subtitle><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default?start-index=26&amp;max-results=25'/><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://blog.foofactory.fi/atom.xml'/><author><name>Sami Siren</name><email>noreply@blogger.com</email></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>29</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-5703433805918919448</id><published>2009-11-04T15:11:00.004+02:00</published><updated>2009-11-04T17:30:16.898+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='open source'/><title type='text'>Open source electric cars?</title><content type='html'>Is it possible to use open source principles into building electric cars? It seems like it is. &lt;a href="http://www.sahkoautot.fi/eng"&gt;Electric Cars - Now!&lt;/a&gt; movement (based in Finland) is running a project which allows you to convert your everyday fuel powered car into Zero Emission Vehicle. All the software, blueprints and assembly instructions are provided as open source, free for everybody to use.&lt;br /&gt;&lt;br /&gt;They currently have a working prototype called eCorolla (that I happened to see when the false fire alarm drove me to the yard of Innopoli earlier today) based on Toyota Corolla that can drive up to 150 km with full battery. The price for conversion kit is about 20 000 EUR. The lithium batteries are currently the most expensive piece. The price of batteries is expected to drop when the production lines of battery producers ramp up with the demand in future.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-5703433805918919448?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/5703433805918919448/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=5703433805918919448' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/5703433805918919448'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/5703433805918919448'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2009/11/open-source-electric-cars.html' title='Open source electric cars?'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-2112731390970746173</id><published>2009-08-05T18:54:00.003+03:00</published><updated>2009-08-05T19:01:54.508+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='solr'/><category scheme='http://www.blogger.com/atom/ns#' term='lucene'/><category scheme='http://www.blogger.com/atom/ns#' term='payload'/><title type='text'>Payloads with Lucene/Solr</title><content type='html'>Grant Ingersoll has written a nice &lt;a href="http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/"&gt;post about using payloads&lt;/a&gt; with Lucene/Solr.&lt;br /&gt;&lt;br /&gt;Grant gives an introduction on what payloads are, shows how they are injected into Lucene index and finally how stored payloads can be used to score documents. Very nice post!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-2112731390970746173?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/2112731390970746173/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=2112731390970746173' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/2112731390970746173'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/2112731390970746173'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2009/08/payloads-with-lucenesolr.html' title='Payloads with Lucene/Solr'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-2695744671653771612</id><published>2009-06-10T21:42:00.004+03:00</published><updated>2009-06-10T22:00:47.031+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Yahoo! announces their own Distribution of Hadoop</title><content type='html'>Yahoo has &lt;a href="http://developer.yahoo.net/blogs/hadoop/2009/06/yahoo_distribution_of_hadoop.html"&gt;announced&lt;/a&gt; the availability of Yahoo Distribution of Hadoop.&lt;br /&gt;&lt;br /&gt;Each Yahoo! Distribution of Hadoop goes through exhaustive 2 day testing on Yahoo's 500 node test cluster. Yahoo! also promises that all improvements and patches are to be released under Apache license either through Apache Jira or directly into Apache source code repository.&lt;br /&gt;&lt;br /&gt;Yahoo! is not offering any support for their releases but they expect companies that are specialized in &lt;a href="http://hadoop.apache.org/core/"&gt;Hadoop&lt;/a&gt; to take care of that part.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-2695744671653771612?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/2695744671653771612/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=2695744671653771612' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/2695744671653771612'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/2695744671653771612'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2009/06/yahoo-announces-their-own-distribution.html' title='Yahoo! announces their own Distribution of Hadoop'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-8690595073413325879</id><published>2009-06-09T17:54:00.002+03:00</published><updated>2009-06-09T18:15:55.006+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='open source'/><title type='text'>54% of organizations in Finland use OSS</title><content type='html'>According to TNS Gallup (on behalf of Sun) &lt;a href="http://www.mysql.com/news-and-events/generate-article.php?id=2009_10"&gt;54% of organizations in Finland use Open Source Software&lt;/a&gt;. OSS penetration is high in all of the the Nordic countries (Finland, Sweden, Norway, Denmark) with average of 46%.&lt;br /&gt;&lt;br /&gt;Sun believes that the increased demand for open source in general is driven largely by the global economic downturn. I am pretty sure the trend of OSS adoption will keep going north even when the economical situation gets better.&lt;br /&gt;&lt;br /&gt;Marc Krellenstein has written a nice post about the &lt;a href="http://blog.lucidimagination.com/?p=10"&gt;myths of open source software&lt;/a&gt; that might help you to decide your policy about using Open Source (Search) Software in your organization.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-8690595073413325879?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/8690595073413325879/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=8690595073413325879' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/8690595073413325879'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/8690595073413325879'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2009/06/54-of-organizations-in-finland-use-oss.html' title='54% of organizations in Finland use OSS'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-3343068859240294343</id><published>2009-05-14T08:41:00.002+03:00</published><updated>2009-05-14T08:56:30.047+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='yahoo'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Yahoo calculates digits of π (a lot) faster than before</title><content type='html'>Yahoo recently reported that they have calculated more bits of π roughly during weekend than an earlier project called &lt;a href="http://oldweb.cecm.sfu.ca/projects/pihex/announce1q.html"&gt;PiHex&lt;/a&gt; did in span of 2 years. The record creating software was run on &lt;a href="http://hadoop.apache.org/core/"&gt;Hadoop&lt;/a&gt; cluster and the software itself was implemented in Java programming language. Check the details from &lt;a href="http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_computes_the_10151st_bi.html"&gt;original post&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-3343068859240294343?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/3343068859240294343/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=3343068859240294343' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/3343068859240294343'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/3343068859240294343'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2009/05/yahoo-calculates-digits-of-lot-faster.html' title='Yahoo calculates digits of π (a lot) faster than before'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-8354190874407902591</id><published>2009-04-10T22:25:00.004+02:00</published><updated>2009-04-10T23:31:36.019+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='gae'/><title type='text'>First thoughts on GAE/Java</title><content type='html'>&lt;a href="http://code.google.com/appengine/"&gt;Google App Engine&lt;/a&gt; has been around for over a year now. It didn't have so big impact on me at first since the only language you could implement your application was python and I don't speak it natively. After the Google Java announcement few days ago I was fast enough to grab myself one of the first 10k test accounts. I have since learned a great deal about some new (for me) and nice technologies and I have to say that I am actually excited about web application development again.&lt;br /&gt;&lt;br /&gt;The Eclipse plugin that was released at the same time is one of the biggest things that helped me to get a rapid start on Google App Engine. The plugin contains all required components for App Engine Development and also GWT development so after you install the plugin you can immediately start developing applications for Google App Engine. No need to install any additional software or SDKs. Writing the regular HelloWorld type of applications is easy as Google was kind enough to provide &lt;a href="http://code.google.com/appengine/docs/java/gettingstarted/"&gt;nice documentation&lt;/a&gt; on their web site. After you have the application running on local development environment you can launch it at Google cloud with a single mouse click from your Eclipse IDE.&lt;br /&gt;&lt;br /&gt;One of the nicest things (IMO) in the application deployment process in general is the possibility to deploy several versions of your application. You can simultaneously access different versions (they're all accessible behind different hostnames). Accessing the versions from special urls before making them public makes testing and functional verification of the application an enjoyable experience. When you are satisfied with a new version you can make it the default one that is then accessible to normal site visitors. Also If you notice a glitch in application some point later you can very easily "roll back" to some earlier version if you like. (I Wish I had have this kind of deployment mechanism in place at some customer sites).&lt;br /&gt;&lt;br /&gt;Google App Engine does not support storing data in filesystem. The way to persist data is to store it in google Big Table. From Java that is easily done with JDO or JPA. I tried the JDO way (for the first time in my life) and it was actually a very pleasant experience. The only thing you need to do (in simplest case) to persist your domain objects is to annotate them with JDO annotations - one annotation per class and one annotation per persistent property. After that you just call makePersistent method from PersistenceManager and you data is safe. Again one nice feature of Google App Engine is that you can browse and query the persisted data from the Dashboard application. Also you can manually insert data through the Dashboard application.&lt;br /&gt;&lt;br /&gt;As part of my learning new things process I decided to take some recent java application framework, implement a small application with it and deploy it to Google App Engine. The Framework of my choice was &lt;a href="http://incubator.apache.org/click/"&gt;Apache Click&lt;/a&gt;. There was only two minor issues I had to go though the get it up and running on Google App Engine. First one was to exclude the velocity templates from being served as static resources (in appengine-web.xml):&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;    &amp;lt;static-files&gt;&lt;br /&gt;        &amp;lt;exclude path="**.htm" /&gt;&lt;br /&gt;    &amp;lt;/static-files&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Another issue was related to Ognl library (a library that is used by Apache Click to copy properties from Click Forms to domain objects (that can then be persisted with JDO). After setting the Security Manager of OgnlRuntime to null it started to work properly. In Click there is one fancy feature that would allow application extensions packaged as jars to provide static content to webapp that will unfortunately not work as it relies on using the filesystem. &lt;br /&gt;&lt;br /&gt;Overall I have to say it was really much fun to work With Google App Engine and I think that it is a very nice platform to build/run your applications. Also the starting costs are definitely not too high. You can store up to 500 MB of data and serve up to 5 million pageloads per month for free! That is more than enough for majority of web sites.&lt;br /&gt;&lt;br /&gt;Now if you only could offer your home grown Google App Engine applications in some market place similar to what they have in place for Android. The applications would be installable with single mouse click and running them in small scale would be very affordable or free. I bet this kind of commercial environment would be very interesting for application builders too.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-8354190874407902591?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/8354190874407902591/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=8354190874407902591' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/8354190874407902591'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/8354190874407902591'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2009/04/first-thoughts-on-gaejava.html' title='First thoughts on GAE/Java'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-2867364520929544493</id><published>2009-04-08T08:54:00.003+02:00</published><updated>2009-04-08T09:00:20.952+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='google'/><title type='text'>It's here: Java on Google App Engine</title><content type='html'>Just noticed the Google announcement about the availability of &lt;a href="http://googleappengine.blogspot.com/2009/04/seriously-this-time-new-language-on-app.html"&gt;Java&lt;/a&gt; programming language on Google App Engine.&lt;br /&gt;&lt;br /&gt;So far the only implementation language has been python then they joked about &lt;a href="http://googleappengine.blogspot.com/2009/04/brand-new-language-on-google-app-engine.html"&gt;fortran&lt;/a&gt; and now they announce Java.&lt;br /&gt;&lt;br /&gt;I am very exited about this announcement and can't wait to see more details.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-2867364520929544493?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/2867364520929544493/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=2867364520929544493' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/2867364520929544493'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/2867364520929544493'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2009/04/its-here-java-on-google-app-engine.html' title='It&apos;s here: Java on Google App Engine'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-5169365191841866158</id><published>2009-04-02T10:34:00.006+02:00</published><updated>2009-04-02T12:00:25.841+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='amazon'/><category scheme='http://www.blogger.com/atom/ns#' term='saas'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Amazon offers MapReduce as a service</title><content type='html'>Amazon announced today a new Service called &lt;a href="http://aws.amazon.com/elasticmapreduce/"&gt;elastic MapReduce&lt;/a&gt;. It promises to ease up the configuration and managing of Hadoop clusters. Pricing model is simple: add additional $0.015 - $0.12 per instance hour (depending on the size of the instances you use) to your bill.&lt;br /&gt;&lt;br /&gt;When converted to monthly payments that makes $10.95 - $87.60 per machine. When you calculate the total cost of running one instance (the extra large instance) the cost is $671.60 per month and that does not yet include the network or storage costs. Just for a comparison: You can buy single machine with similar specs from Finland for about $1300. Do you also think that Amazon should check their pricing in general?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-5169365191841866158?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/5169365191841866158/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=5169365191841866158' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/5169365191841866158'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/5169365191841866158'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2009/04/amazon-offers-mapreduce-as-service.html' title='Amazon offers MapReduce as a service'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-8873218188295375151</id><published>2009-03-19T18:36:00.003+02:00</published><updated>2009-03-19T18:45:31.598+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='tika'/><category scheme='http://www.blogger.com/atom/ns#' term='lucene'/><category scheme='http://www.blogger.com/atom/ns#' term='apache'/><title type='text'>Apache Tika 0.3 released</title><content type='html'>&lt;a href="http://lucene.apache.org/tika/"&gt;Apache Tika&lt;/a&gt;, a subproject of Apache Lucene, is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://sunset.usc.edu/~mattmann/"&gt;Chris Mattmann&lt;/a&gt; just announced that the release of version 0.3 is official.&lt;br /&gt;&lt;br /&gt;Go grab yourself a copy from a &lt;a href="http://www.apache.org/dyn/closer.cgi/lucene/tika/"&gt;mirror nearby&lt;/a&gt;. Tika is also available through the central maven repository.&lt;br /&gt;&lt;br /&gt;There is also an article about &lt;a href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika"&gt;Tika and Solr Cell&lt;/a&gt; at Lucid Imagination web site.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-8873218188295375151?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/8873218188295375151/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=8873218188295375151' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/8873218188295375151'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/8873218188295375151'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2009/03/apache-tika-03-released.html' title='Apache Tika 0.3 released'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-7955831236086276324</id><published>2009-03-18T17:00:00.006+02:00</published><updated>2009-03-18T17:11:11.088+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Hadoop the easy edition</title><content type='html'>&lt;a href="http://www.cloudera.com/"&gt;Cloudera&lt;/a&gt; has put together a nice looking &lt;a href="https://my.cloudera.com/"&gt;configurator&lt;/a&gt; for &lt;a href="http://hadoop.apache.org/"&gt;Apache Hadoop&lt;/a&gt;. (&lt;a href="http://www.cloudera.com/hadoop/"&gt;see video&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;They also offer yum repository to install RPMified version of Hadoop manageable as a standard Linux service together with local documentation and man pages.&lt;br /&gt;&lt;br /&gt;All of this is of course available under a commercial friendly Apache License.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-7955831236086276324?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/7955831236086276324/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=7955831236086276324' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/7955831236086276324'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/7955831236086276324'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2009/03/hadoop-easy-edition.html' title='Hadoop the easy edition'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-6511927504942807838</id><published>2009-02-19T12:55:00.003+02:00</published><updated>2009-02-19T13:39:07.407+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='solr'/><category scheme='http://www.blogger.com/atom/ns#' term='nutch'/><title type='text'>Next Apache Nutch release is nearing</title><content type='html'>The &lt;a href="https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&amp;mode=hide&amp;sorter/order=DESC&amp;sorter/field=priority&amp;resolution=-1&amp;pid=10680&amp;fixfor=12312443"/&gt;list of bugs&lt;/a&gt; remaining to be fixed for next version of nutch is getting shorter.&lt;br /&gt;&lt;br /&gt;I am positive that we get the first release candidate out during February 2009. We sure should since the last release (0.9) was out April 2007.&lt;br /&gt;&lt;br /&gt;There are some very nice additions in the new version of Nutch like Solr integration, new Scoring and indexing framework. I will try to post more about those later.&lt;br /&gt;&lt;br /&gt;If you want to help with getting version 1.0 out now is the time to act - go download the &lt;a href="http://hudson.zones.apache.org/hudson/job/Nutch-trunk/"&gt;latest nightly version of Nutch&lt;/a&gt;, give it a ride and report back problems you experience. We could really use some more documentation in form of wiki pages and tutorials for the new stuff too!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-6511927504942807838?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/6511927504942807838/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=6511927504942807838' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/6511927504942807838'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/6511927504942807838'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2009/02/next-apache-nutch-release-is-nearing.html' title='Next Apache Nutch release is nearing'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-7868562462988055408</id><published>2009-01-28T17:52:00.004+02:00</published><updated>2009-01-28T18:52:32.790+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='tika'/><category scheme='http://www.blogger.com/atom/ns#' term='solr'/><category scheme='http://www.blogger.com/atom/ns#' term='lucene'/><title type='text'>Lucid Imagination Launched</title><content type='html'>&lt;a href="http://www.lucidimagination.com/"&gt;Lucid Imagination&lt;/a&gt; has walked away from stealth mode couple of days ago. If you are friend of &lt;a href="http://lucene.apache.org/java/docs/index.html"&gt;Apache Lucene&lt;/a&gt;, &lt;a href="http://lucene.apache.org/solr/"&gt;Apache Solr&lt;/a&gt; or &lt;a href="http://lucene.apache.org/tika/"&gt;Apache Tika&lt;/a&gt; you might want to check out the fresh &lt;a href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/"&gt;articles&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;There is also some nice audio/video &lt;a href="http://www.lucidimagination.com/Community/Hear-from-the-Experts/Podcasts-and-Videos/"&gt;infotainment&lt;/a&gt; available from various guests.&lt;br /&gt;&lt;br /&gt;Lucid Imagination has also released &lt;a href="http://www.lucidimagination.com/Downloads/Certified-Distributions/"&gt;certified versions&lt;/a&gt; of Lucene and Solr bundled together with a jvm, &lt;a href="http://www.getopt.org/luke/"&gt;Luke&lt;/a&gt; and &lt;a href="http://www.lucidimagination.com/Downloads/Certified-Distributions/#solrgaze"&gt;Gaze&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-7868562462988055408?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/7868562462988055408/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=7868562462988055408' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/7868562462988055408'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/7868562462988055408'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2009/01/lucid-imagination-launched.html' title='Lucid Imagination Launched'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-4154457942633553386</id><published>2008-11-26T22:00:00.009+02:00</published><updated>2008-12-19T23:42:03.519+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='boss'/><category scheme='http://www.blogger.com/atom/ns#' term='api'/><category scheme='http://www.blogger.com/atom/ns#' term='yahoo'/><category scheme='http://www.blogger.com/atom/ns#' term='search'/><title type='text'>Yahoo expands BOSS functionality</title><content type='html'>Yahoo seems to be busy implementing new features to it's &lt;a href="http://developer.yahoo.com/search/boss/"&gt;BOSS platform&lt;/a&gt;. BOSS appears to be gaining momentum in form of many interesting mashups and verticals that have been build on top of it lately.&lt;br /&gt;&lt;br /&gt;The new piece of functionality they call &lt;a href="http://www.ysearchblog.com/archives/000659.html"&gt;Vertical Lens&lt;/a&gt; offers faceted search over structured content. Custom results can also be blended with generic search results. Yahoo says it's possible to tweak the ranking of search results and also feed proprietary content in real time.&lt;br /&gt;&lt;br /&gt;At this time the new feature is only available to selected partners but Yahoo is working on extending the availability.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-4154457942633553386?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/4154457942633553386/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=4154457942633553386' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/4154457942633553386'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/4154457942633553386'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2008/11/yahoo-expands-boss-functionality.html' title='Yahoo expands BOSS functionality'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-8801848070257615370</id><published>2008-10-09T20:11:00.010+02:00</published><updated>2009-03-10T15:27:35.736+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='maven2'/><category scheme='http://www.blogger.com/atom/ns#' term='GWT'/><category scheme='http://www.blogger.com/atom/ns#' term='eclipse'/><title type='text'>GWT + maven2 + eclipse on 64 bit linux in 30 seconds</title><content type='html'>&lt;a href="http://code.google.com/webtoolkit/"&gt;GWT&lt;/a&gt; is a toolkit for building world class web2.0 gui applications without the headache. Setting up an Eclipse dev environment on 64 bit fedora core (version 9) however required some extra steps. I heard that Google is working on a smoother 64 bit integration but until it's here you might find the list of required actions useful:&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;The basic installation&lt;/h2&gt;&lt;br /&gt;1. Get GWT&lt;br /&gt;GWT comes in three flavours - one for win, one for mac and one for (32 bit) linux. Manual installation of gwt is not strictly reguired for maven + eclipse use but it's also convenient to have it around to be able to run your apps in hosted mode without maven.&lt;br /&gt;&lt;br /&gt;2. Get 32 bit jdk&lt;br /&gt;The default JDK in my fc 9 installation identified itself as "OpenJDK  Runtime Environment (build 1.6.0-b09)" and if you try to start gwt Shell with it all you get is following error message:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;Exception in thread "main" java.lang.UnsatisfiedLinkError: ...gwt-linux-1.5.2/libswt-pi-gtk-3235.so: .../gwt/gwt-linux-1.5.2/libswt-pi-gtk-3235.so: wrong ELF class: ELFCLASS32 (Possible cause: architecture word width mismatch)&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;The cure to this problem is to fetch and install a 32 bit java from &lt;a href="http://java.sun.com/"&gt;sun&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;3. Install some required 32 bit libraries:&lt;br /&gt;&lt;br /&gt;If your see error like:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;Exception in thread "main" java.lang.UnsatisfiedLinkError: .../gwt/gwt-linux-1.5.2/libswt-pi-gtk-3235.so: libXtst.so.6: cannot open shared object file: No such file or directory&lt;br /&gt; at java.lang.ClassLoader$NativeLibrary.load(Native Method)&lt;br /&gt; at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1778)&lt;br /&gt; at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1674)&lt;br /&gt; at java.lang.Runtime.load0(Runtime.java:770)&lt;br /&gt; at java.lang.System.load(System.java:1005)&lt;br /&gt; at org.eclipse.swt.internal.Library.loadLibrary(Library.java:132)&lt;br /&gt; at org.eclipse.swt.internal.gtk.OS.&lt;clinit&gt;(OS.java:22)&lt;br /&gt; at org.eclipse.swt.internal.Converter.wcsToMbcs(Converter.java:63)&lt;br /&gt; at org.eclipse.swt.internal.Converter.wcsToMbcs(Converter.java:54)&lt;br /&gt; at org.eclipse.swt.widgets.Display.&lt;clinit&gt;(Display.java:126)&lt;br /&gt; at com.google.gwt.dev.GWTShell.&lt;clinit&gt;(GWTShell.java:301)&lt;br /&gt;Could not find the main class: com.google.gwt.dev.GWTShell.  Program will exit.&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;you need to install 32 bit version of libXtst.&lt;br /&gt;&lt;code&gt;sudo yum install libXtst.i386&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;If you see error like&lt;br /&gt;&lt;code&gt;&lt;br /&gt;Exception in thread "main" java.lang.UnsatisfiedLinkError: .../libswt-pi-gtk-3235.so: libgtk-x11-2.0.so.0: cannot open shared object file: No such file or directory&lt;br /&gt;&lt;br /&gt;You need to install gtk2&lt;br /&gt;&lt;br /&gt;&lt;code&gt;sudo yum install gtk2.i386&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;If you see an error like&lt;br /&gt;&lt;code&gt;&lt;br /&gt;** Unable to load Mozilla for hosted mode **&lt;br /&gt;java.lang.UnsatisfiedLinkError: .../gwt/gwt-linux-1.5.2/mozilla-1.7.12/libxpcom.so: libstdc++.so.5: cannot open shared object file: No such file or directory&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;you need to install the 32 bit compat-libstdc++ library:&lt;br /&gt;&lt;code&gt;sudo yum install compat-libstdc++-33.i386&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;You are now ready for command line development of GWT apps. Just remember to set the path so that the 32 bit java is used to launch the GWT Shell.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Alternative way for maven users&lt;/h2&gt;&lt;br /&gt;1. Get 32 bit jdk&lt;br /&gt;&lt;br /&gt;2. Install the required 32 bit libraries&lt;br /&gt;&lt;br /&gt;3. Check out sample maven app that uses gwt-maven&lt;br /&gt;&lt;br /&gt;&lt;code&gt;svn co http://gwt-maven.googlecode.com/svn/trunk/maven-googlewebtoolkit2-plugin/simplesample/ sample&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Launch the sample application&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;cd sample&lt;br /&gt;mvn gwt:gwt&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Launching GTW Shell from eclipse&lt;/h2&gt;&lt;br /&gt;1. Import (or check out from scm) your project into eclipse&lt;br /&gt;2. From Run menu select Run Configurations&lt;br /&gt;3. Right click Java Application from left side of the screen and select new&lt;br /&gt;4. Set com.google.gwt.dev.GWTShell as Main class&lt;br /&gt;5. At Arguments tab enter the module name as program argument (in the sample it is com.totsp.sample.Application)&lt;br /&gt;6. From tab JRE make sure that the 32 bit jre is used for the project&lt;br /&gt;7. From tab Classpath click Advanced... select Add Folders and select the src/java folder from your project (ot the folder that contains the &amp;lt;package&gt;/Module.gwt.xml file)&lt;br /&gt;8. Click 'Apply' and 'Run'&lt;br /&gt;&lt;br /&gt;Your GWT app is now running and you can enjoy features like nice fast dev cyckle edit-&gt;save-&gt;refresh, debugger etc.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-8801848070257615370?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/8801848070257615370/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=8801848070257615370' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/8801848070257615370'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/8801848070257615370'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2008/10/gwt-maven2-eclipse-on-64-bit-linux-in.html' title='GWT + maven2 + eclipse on 64 bit linux in 30 seconds'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-4877057432974483983</id><published>2008-08-31T19:36:00.006+02:00</published><updated>2008-08-31T20:24:04.254+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='yui'/><category scheme='http://www.blogger.com/atom/ns#' term='lucene'/><category scheme='http://www.blogger.com/atom/ns#' term='json'/><category scheme='http://www.blogger.com/atom/ns#' term='reporting'/><title type='text'>Interactive query reporting with lucene</title><content type='html'>Todays post will give you a simple example of how Apache Lucene could be used as a powerful full-text-ad-hoc-query-enabled data warehouse to build simple reports from, &lt;i&gt;little&lt;/i&gt; like &lt;a href="http://www.google.com/trends"&gt;Google trends&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;center&gt;&lt;img src="http://blog.foofactory.fi/images/interactive-report.png"&gt;&lt;/center&gt;&lt;br /&gt;&lt;br /&gt;In this example we use the software for query count reporting but it could also be used to report counts of any other kind of data for example blog posts, news articles and so on.&lt;br /&gt;&lt;br /&gt;The data set used in this example is the AOL query data set which consists of little under 40 million records. A single record contains user identifier, user query, timestamp and click data. We are only interested about queries and timestamps here so we skip the rest of the data. &lt;br /&gt;&lt;br /&gt;The used Lucene document structure is simple:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;  Document doc = new Document();&lt;br /&gt;             doc.add(new Field("query", fields[1].trim(), Field.Store.NO,&lt;br /&gt;                    Field.Index.TOKENIZED, TermVector.NO));&lt;br /&gt;             doc.add(new Field("date", date[0].trim(), Field.Store.NO,&lt;br /&gt;                    Field.Index.NO_NORMS, TermVector.NO));&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;On query side we rely on BitSets, unions of BitSets and the cardinality method to do the hit counting.&lt;br /&gt;&lt;br /&gt;For transferring the data between the browser and back end we use &lt;a href="http://www.json.org/"&gt;json&lt;/a&gt; over HTTP. The message format is as follows:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;{&lt;br /&gt;"counts":[ &lt;br /&gt; {"date": "2006-03-01", "c0":43, "c1":0, "c2":230}&lt;br /&gt;,{"date": "2006-03-02", "c0":11, "c1":0, "c2":245}&lt;br /&gt;,{"date": "2006-03-03", "c0":15, "c1":0, "c2":252}&lt;br /&gt;,{"date": "2006-03-04", "c0":50, "c1":2, "c2":288}&lt;br /&gt;,{"date": "2006-03-05", "c0":14, "c1":0, "c2":294}&lt;br /&gt;,{"date": "2006-03-06", "c0":51, "c1":0, "c2":345}&lt;br /&gt;,{"date": "2006-03-07", "c0":19, "c1":0, "c2":225}&lt;br /&gt;,{"date": "2006-03-08", "c0":16, "c1":0, "c2":219}&lt;br /&gt;,{"date": "2006-03-09", "c0":44, "c1":0, "c2":197}&lt;br /&gt;,{"date": "2006-03-10", "c0":32, "c1":0, "c2":269}&lt;br /&gt;,{"date": "2006-03-11", "c0":38, "c1":0, "c2":311}&lt;br /&gt;,{"date": "2006-03-12", "c0":10, "c1":0, "c2":230}&lt;br /&gt;,{"date": "2006-03-13", "c0":25, "c1":0, "c2":162}&lt;br /&gt;,{"date": "2006-03-14", "c0":59, "c1":0, "c2":261}&lt;br /&gt;]&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;I found &lt;a href="http://www.jsonlint.com/"&gt;jsonlint&lt;/a&gt; to be a useful tool for json a novice like me to verify that the json messages I generated are indeed valid.&lt;br /&gt;&lt;br /&gt;In end user interface we use YUI &lt;a href="http://developer.yahoo.com/yui/charts/"&gt;charts&lt;/a&gt; and YUI &lt;a href="http://developer.yahoo.com/yui/datasource/"&gt;datasource&lt;/a&gt; components.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Steps to get the app running&lt;/h2&gt;&lt;br /&gt;&lt;br /&gt;Note: Because indexing and webapp are run inside maven in this example you need to make sure there's enough heap available with a command like "export MAVEN_OPTS=-Xmx1024m" )&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;br /&gt;&lt;li&gt;Download &lt;a href="http://blog.foofactory.fi/images/lucene-query-report.tar.gz"&gt;sources&lt;/a&gt; (size:6kb)&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;Get the data set...&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;Build index into directory index :&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;mvn exec:java -Dexec.mainClass="fi.foofactory.lucene.report.Indexer" -Dexec.args="&amp;lt;location of AOL collection&gt; index"&lt;br /&gt;&lt;br /&gt;The resulting index is about 706 mega bytes in size. The run time to build the index with the machine I used was around 20 minutes.&lt;br /&gt;&lt;br /&gt;&lt;li&gt;Run webapp&lt;br /&gt;&lt;br /&gt;mvn jetty:run-war -Dlucene.index.dir=index&lt;br /&gt;&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;Access the interface with your browser at http://localhost:8080/lucene-report/&lt;br /&gt;&lt;/li&gt;&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Possible enhancements&lt;/h2&gt;&lt;br /&gt;&lt;h3&gt;Parallelism&lt;/h3&gt;&lt;br /&gt;The code on demo is totally single threaded so it utilizes only one cpu core per request. The dataset used in this demo is so small that the workload is totally cpu bound. You could parallelize at least some of work without exploding the memory requirements (getting BitSets for submitted queries, calculating the intersections).&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Approximation&lt;/h3&gt;&lt;br /&gt;If you are interested about general trends and relatinve volumes then it does not matter if you miss an observation or some. You could use a technique called &lt;a href="http://en.wikipedia.org/wiki/Sampling_(statistics)"&gt;sampling&lt;/a&gt; to reduce the number of observations and still get good results on relatively frequent terms. On rare terms you can always fall back to statement like&lt;br /&gt;&lt;br&gt;&lt;br&gt;&lt;b&gt;Your terms - &amp;lt;insert search terms here&gt; - do not have enough search volume to show graphs&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Partitioning&lt;/h3&gt;&lt;br /&gt;The data set could be divided into smaller lucene indexes and each of the indexes could be deployed on more machines. You would need to build an aggregator which would ask the results from those smaller indexes and simply add counts together before returning the final results. A nice way to build index partitions is a recently added &lt;a href="https://svn.apache.org/repos/asf/hadoop/core/trunk/src/contrib/index/"&gt;Apache Hadoop contrib module&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Imagine&lt;/h2&gt;&lt;br /&gt;In an imaginary world if you would use multiple threads to calculate the intersections, use sampling to preserve one tenth of the original observations and would split the data into 10 machines (4 cpu cores each, with unlimited RAM) the response time of the report queries for that same data set (approximation) could be something like 1/400th of the original. Or the other way around: if you're satisfied with the response time already you could use the same technique on a data set of size 1,6*10^11 documents. &lt;br /&gt;&lt;br /&gt;For a limited time there's a demo available online &lt;a href="http://www.hakulaite.net:8080/lucene-report/"&gt;here&lt;/a&gt;. &lt;b&gt;(the demo will be removed without further notice)&lt;/b&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-4877057432974483983?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/4877057432974483983/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=4877057432974483983' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/4877057432974483983'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/4877057432974483983'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2008/08/interactive-query-reporting-with-lucene.html' title='Interactive query reporting with lucene'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-5928145229998849187</id><published>2008-07-03T16:39:00.004+03:00</published><updated>2008-07-03T17:36:51.993+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed computing'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Hadoop takes the lead position</title><content type='html'>Owen O'Malley, the project lead of Apache Hadoop, member of Yahoo grid team, announced today that they have taken the number uno position on &lt;a href="http://www.hpl.hp.com/hosted/sortbenchmark/"&gt;Terabyte Sort Benchmark&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;The new record of 209 seconds is nearly 30% faster than the previous one. More details &lt;a href="http://www.hpl.hp.com/hosted/sortbenchmark/YahooHadoop.pdf"&gt;here&lt;/a&gt; and &lt;a href="http://developer.yahoo.com/blogs/hadoop/2008/07/apache_hadoop_wins_terabyte_sort_benchmark.html"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Congratulations! Nice work, once again.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-5928145229998849187?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/5928145229998849187/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=5928145229998849187' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/5928145229998849187'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/5928145229998849187'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2008/07/hadoop-takes-lead-position.html' title='Hadoop takes the lead position'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-4842618790372479709</id><published>2008-03-08T07:26:00.005+02:00</published><updated>2008-03-08T07:53:41.983+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='open source'/><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='mapreduce'/><category scheme='http://www.blogger.com/atom/ns#' term='nutch'/><title type='text'>Nutch training at ApacheCon EU 2008</title><content type='html'>As some of you might have noticed I am prepared to give a half day &lt;a href="http://eu.apachecon.com/eu2008/program/talk/2406"&gt;training&lt;/a&gt; about &lt;a href="http://lucene.apache.org/nutch"&gt;Apache Nutch&lt;/a&gt; at ApacheCon EU 2008. &lt;br /&gt;&lt;br /&gt;However there are still too many seats available and I need Your help to get things going. So if You are interested about Nutch internals and have Tuesday, Apr 08 open in your calendar please go ahead and book a seat for You at &lt;a href="http://eu.apachecon.com/eu2008/"&gt;ApacheCon web site&lt;/a&gt; !&lt;br /&gt;&lt;br /&gt;Don't forget that there is also a huge amount of other interesting sessions and trainings during the week, see the &lt;a href="http://eu.apachecon.com/eu2008/program/day/?date=2008-04-08"&gt;schedule&lt;/a&gt; for more info.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-4842618790372479709?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/4842618790372479709/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=4842618790372479709' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/4842618790372479709'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/4842618790372479709'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2008/03/nutch-training-at-apachecon-eu-2008.html' title='Nutch training at ApacheCon EU 2008'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-2223273317778905343</id><published>2008-01-18T23:53:00.001+02:00</published><updated>2008-01-18T23:57:56.362+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='lucene'/><title type='text'>Regenerating equally sized shards from set of Lucene indexes</title><content type='html'>If you have a need to split your large index or set of indexes into smaller equally sized "shards" this prototype of tool might be for you. There is a tool for combining several indexes into one inside Lucene distribution but to my knowledge there is no tool to do the opposite.&lt;br /&gt;&lt;br /&gt;The usual use case for splitting your index is index distribution: for example you plan to distribute (pieces of your index) into several machines to increase query throughput. Of course this operation could be done by reindexing the data, but resizing the index shards _seems_ to be faster than that (need to do some benching to confirm that).&lt;br /&gt;&lt;br /&gt;This tool should be able to handle several different scenarios for you:&lt;br /&gt;&lt;br /&gt;1. splitting one large index into many smaller ones&lt;br /&gt;&lt;br /&gt;2. combining and resplitting several indexes into new set of indexes&lt;br /&gt;&lt;br /&gt;3. combining several indexes into one&lt;br /&gt;&lt;br /&gt;This tool does not try to interprete the physical index format but lets Lucene do the heavy lifting by simply using &lt;a href="http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexWriter.html#addIndexes(org.apache.lucene.index.IndexReader[])"&gt;IndexWriter.addIndexes()&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;DISCLAIMER: I only had time to do some very limited testing with smallish indexes with this tool, but I plan to do some more testing with bigger indexes soon to get an idea how this will work in real life.&lt;br /&gt;&lt;br /&gt;Download &lt;a href="http://blog.foofactory.fi/resources/index-slice.tar.gz"&gt;sources&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-2223273317778905343?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/2223273317778905343/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=2223273317778905343' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/2223273317778905343'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/2223273317778905343'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2008/01/regenerating-equally-sized-shards-from.html' title='Regenerating equally sized shards from set of Lucene indexes'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-4431451820357378516</id><published>2007-11-06T20:10:00.000+02:00</published><updated>2007-11-06T21:01:54.990+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='javascript'/><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='mapreduce'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Javascript prototyping for Hadoop map/reduce jobs</title><content type='html'>Java 6 brought us jvm level scripting support as modelled in &lt;a href="http://jcp.org/en/jsr/detail?id=223"&gt;jsr-223&lt;/a&gt;. (&lt;a href="http://java.sun.com/javase/6/docs/technotes/guides/scripting/"&gt;Sun&lt;/a&gt; documentation about the subject)&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Wordcount Demo&lt;/h3&gt;&lt;br /&gt;&lt;br /&gt;I have chosen the wordcount example (as seen on Hadoop) to demonstrate a complete javascript map/reduce example (a mapper, a reducer and a "driver"):&lt;br /&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;importPackage(java.util);&lt;br /&gt;importPackage(org.apache.hadoop.io)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;function map(key, value, collector){&lt;br /&gt;        tokenizer = new StringTokenizer(value);&lt;br /&gt;        while(tokenizer.hasMoreTokens()) {&lt;br /&gt;                collector.collect(tokenizer.nextToken(), 1);&lt;br /&gt;        }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;function reduce(key, values, collector){&lt;br /&gt;        counter=0;&lt;br /&gt;        while(values.hasNext()) {&lt;br /&gt;                counter+=values.next().get();&lt;br /&gt;        }&lt;br /&gt;        collector.collect(key, counter);&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;function driver(args, jobConf, jobClient) {&lt;br /&gt;        jobConf.outputKeyClass=Text;&lt;br /&gt;        jobConf.outputValueClass=IntWritable;&lt;br /&gt;        jobConf.setInput(args[0]);&lt;br /&gt;        jobConf.setOutput(args[1]);&lt;br /&gt;        jobConf.map='map';&lt;br /&gt;        jobConf.combiner='reduce';&lt;br /&gt;        jobConf.reduce='reduce';&lt;br /&gt;        jobClient.runJob(jobConf);&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;System requirements&lt;/h3&gt;&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://java.sun.com/"&gt;Java&lt;/a&gt; 6 from Sun&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://lucene.apache.org/hadoop"&gt;Hadoop&lt;/a&gt; from Apache&lt;/li&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://blog.foofactory.fi/resources/scriptmr.jar"&gt;scriptmr.jar&lt;/a&gt; from foofactory [&lt;a href="http://blog.foofactory.fi/resources/scriptmr-src.tar.gz"&gt;source&lt;/a&gt;]&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Running javascript map reduce&lt;/h3&gt;&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Install java as documented by Sun&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Extract hadoop release tar ball into your favourite location&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Run your javascript map/reduce job with following command: &lt;code&gt;bin/hadoop jar &amp;lt;/path/to/scriptmr.jar&gt; &amp;lt;script-file-name&gt; [&amp;lt;script-arg-1&gt;...]&lt;/code&gt;&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Performance?&lt;/h3&gt;&lt;br /&gt;Shortly: there isn't any. The wordcount demo as presented here takes roughly 6-7 times more time than the wordcount java exmple from Hadoop. It's quite obvious this technique is not really practical beyond prototyping. For prototyping... hmmm I am not totally sure it is usable for that either. Anyway have fun with it, I know I had some while making it ;)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-4431451820357378516?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/4431451820357378516/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=4431451820357378516' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/4431451820357378516'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/4431451820357378516'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2007/11/javascript-prototyping-for-hadoop.html' title='Javascript prototyping for Hadoop map/reduce jobs'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-4512512156979000149</id><published>2007-10-30T18:47:00.000+02:00</published><updated>2007-10-30T19:07:22.564+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='spring'/><category scheme='http://www.blogger.com/atom/ns#' term='pluto'/><category scheme='http://www.blogger.com/atom/ns#' term='jsr168'/><title type='text'>Spring portlet sample with maven2 on Apache Pluto</title><content type='html'>As part of my learning process to portals and JSR-168 portlets I decided to go and do some hands on testing based on the publicly available &lt;a href="http://opensource.atlassian.com/confluence/spring/display/JSR168/Home"&gt;spring portlets&lt;/a&gt; and &lt;a href="http://portals.apache.org/pluto/"&gt;Apache pluto&lt;/a&gt;. While I was there I converted the sample to use maven build and decided to publish the results here in case someone else is thinking about the same.&lt;br /&gt;&lt;br /&gt;So have fun with &lt;a href="http://blog.foofactory.fi/resources/spring-portlet-sample-maven.tar.gz"&gt;spring-portlet-sample-maven.tar.gz&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Compile the portlet application by running &lt;code&gt;mvn clean install&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Installation instructions are simple, once you have your pluto-tomcat-bundle up and running (&lt;a href="http://www.apache.org/dyn/closer.cgi/portals/pluto/"&gt;available here&lt;/a&gt;) just copy the resulting war file under tomcats webapp directory and use the Pluto Page Administrator to add the portlets to your page.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-4512512156979000149?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/4512512156979000149/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=4512512156979000149' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/4512512156979000149'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/4512512156979000149'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2007/10/spring-portlet-sample-with-maven2-on.html' title='Spring portlet sample with maven2 on Apache Pluto'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-1290583775512522094</id><published>2007-06-14T00:26:00.000+03:00</published><updated>2007-06-23T11:24:00.062+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Adding support for tight Enum and EnumSet into a hadoop application</title><content type='html'>After working with many projects that use a combination of int and String constants in various ways to mimic enumerations I learned to like java Enums. I am also a fan of Hadoop so I decided to see how they will fit together.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/java/org/apache/hadoop/io/WritableUtils.java?view=markup"&gt;WritableUtils&lt;/a&gt; class has already support for writing and reading java Enum values. in WritableUtils Enum is serialized by converting value of enum.name() to Text. There is also another way of storing Enum state which requires less space. By using the oridinal() method will give you the index of the field in the Enum. This technique allows you to store Enum containing up to 255 different fields to a single byte. EnumWritable perhaps does not warrant to be in own class, an utility methods cabable of reading and writing enum to DataInput would be sufficient.&lt;br /&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;import java.io.DataInput;&lt;br /&gt;import java.io.DataOutput;&lt;br /&gt;import java.io.IOException;&lt;br /&gt;&lt;br /&gt;public class EnumWritable&amp;lt;E extends Enum&amp;lt;E&gt;&gt; implements Writable {&lt;br /&gt;&lt;br /&gt;  private byte storage;&lt;br /&gt;&lt;br /&gt;  public EnumWritable() {&lt;br /&gt;    storage = 0;&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  public EnumWritable(Enum&amp;lt;E&gt; value) {&lt;br /&gt;    set(value);&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  public &amp;lt;T extends Enum&amp;lt;E&gt;&gt; E get(Class&amp;lt;E&gt; enumType) throws IOException {&lt;br /&gt;    for (E type : enumType.getEnumConstants()) {&lt;br /&gt;      if (storage == type.ordinal()) {&lt;br /&gt;        return type;&lt;br /&gt;      }&lt;br /&gt;    }&lt;br /&gt;    return null;&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  public void set(Enum&amp;lt;E&gt; e) {&lt;br /&gt;    storage = (byte) e.ordinal();&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  public void readFields(DataInput in) throws IOException {&lt;br /&gt;    storage = in.readByte();&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  public void write(DataOutput out) throws IOException {&lt;br /&gt;    out.write(storage);&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  @Override&lt;br /&gt;  public String toString() {&lt;br /&gt;    return Integer.toString(storage);&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  @Override&lt;br /&gt;  public boolean equals(Object obj) {&lt;br /&gt;    if (!(obj instanceof EnumWritable)) {&lt;br /&gt;      return super.equals(obj);&lt;br /&gt;    }&lt;br /&gt;    EnumWritable that = (EnumWritable) obj;&lt;br /&gt;    return this.storage == that.storage;&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  @Override&lt;br /&gt;  public int hashCode() {&lt;br /&gt;    return storage;&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Other useful Structure for storing equivalent of multiple boolean types or piece of data that consists of bits (that can be modeled as fields in enum) is EnumSet. Storage for EnumSet is also compact if take advantage of the oridinal() method. You can store EnumSet with storage space of 1 bit per field. &lt;br /&gt;&lt;br /&gt;I wrote an prototype &lt;a href="http://blog.foofactory.fi/src/EnumSetWritable.java"&gt;EnumSetWritable&lt;/a&gt; that takes from 1 to 4 bytes of storage dependeing on how many fields enum contains and what enums are in the set. 2 least signifigant bits store the size of storage (1-4 bytes) and rest of the bits store Enums in EnumSet. As I told you space for storing the EnumSet varies from 1 to 4 bytes so it can store EnumSets containing Enums with up to 30 fields. By ordering the fields in a way that fields that are most often in EnumSet will most probably lead to smaller space consumption.&lt;br /&gt;&lt;br /&gt;Real World applications for EnumSet could be for example something like:&lt;br /&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;&lt;br /&gt;  enum ContainsData {&lt;br /&gt;    Raw, Parsed, Processed;&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  EnumSetWritable&amp;lt;ContainsData&gt; contains = new EnumSetWritable&amp;lt;ContainsData&gt;();&lt;br /&gt;&lt;br /&gt;  public void readFields(DataInput in) throws IOException {&lt;br /&gt;    contains.readFields(in);&lt;br /&gt;    for (ContainsData type : contains.getEnumSet(ContainsData.class)) {&lt;br /&gt;      switch (type) {&lt;br /&gt;        case Raw: //read raw data;break;&lt;br /&gt;        case Parsed: //read parsed data;break;&lt;br /&gt;        case Processed: //read processed data;break;&lt;br /&gt;      }&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Where you would use EnumSet to store the presence information of various data structures in a writable and so would avoid for example storing booleans (taking one byte of space each) for the same piece of information.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-1290583775512522094?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/1290583775512522094/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=1290583775512522094' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/1290583775512522094'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/1290583775512522094'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2007/06/adding-support-for-tight-enum-and.html' title='Adding support for tight Enum and EnumSet into a hadoop application'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-3464310273600416312</id><published>2007-03-22T21:28:00.000+02:00</published><updated>2007-03-22T21:52:27.605+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='nutch'/><title type='text'>Twice the speed, half the size</title><content type='html'>Gathering the &lt;a href="http://blog.foofactory.fi/2007/03/perfomance-history-for-nutch.html"&gt;performance history&lt;/a&gt; of Nutch is now complete. I am glad to announce that the soon to be released Nutch 0.9.0 will be two times as fast as 0.8.x (with the configuration used in bench). Same time the crawled data will only use about half of the disc surface as before - thanks to Hadoop.&lt;br /&gt;&lt;br /&gt;The following graph shows how the size of equal crawls has changed over time.&lt;br /&gt;&lt;br /&gt;&lt;img src="http://blog.foofactory.fi/images/perf_size.png"&gt;&lt;br /&gt;&lt;br /&gt;Time spend in crawling is plotted below. &lt;br /&gt;&lt;br /&gt;&lt;img src="http://blog.foofactory.fi/images/perf_time.png"&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-3464310273600416312?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/3464310273600416312/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=3464310273600416312' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/3464310273600416312'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/3464310273600416312'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2007/03/twice-speed-half-size.html' title='Twice the speed, half the size'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-4081407498618028069</id><published>2007-03-14T21:51:00.000+02:00</published><updated>2007-03-18T21:15:45.635+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='nutch'/><title type='text'>Perfomance history for Nutch</title><content type='html'>Today I started a bench marathon to build a relative performance history of &lt;a href="http://lucene.apache.org/nutch/"&gt;Nutch&lt;/a&gt; for the last 200 or so revisions. The process used in measuring is very simple. First the revision is checked out, compiled and configured. Then a full crawl cycle is executed: inject generate fetch updatedb and each of the phases is timed.&lt;br /&gt;&lt;br /&gt;The crawl is run against a local http server to eliminate all external factors away from the results. The content for crawls consists of html pages (javadoc for java6) with size of 11062 pages. Pages are served with local apache httpd. The size of each crawl is also recorded.&lt;br /&gt;&lt;br /&gt;Why such effort? Crawling performance is a critical aspect of any search engine (ok there are the features too) and that aspect is currently not measured regularly in Nutch. By analysing the (upcoming) results we can hopefully learn how the different commits have effected the overall crawling performance. It might even make sense to continue measuring relative performance in future after every commit just to make sure nothing seriously wrong gets checked in (we'll judge that after the experiment is over;).&lt;br /&gt;&lt;br /&gt;The results will be published in real time as they are gathered in &lt;a href="http://blog.foofactory.fi/images/results.txt"&gt;textual&lt;/a&gt; format as well as in the graph below. The format for text file is as follows:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;revision, total (s), inject (s), generate (s), fetch (s), updatedb (s), size of crawl dir (kb)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;If the speed continues to be like it is for the first few rounds then results should be complete in 3-4 days.&lt;br /&gt;&lt;br /&gt;&lt;img src="http://blog.foofactory.fi/images/perfhistory.png" &gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Disclaimer:&lt;/b&gt; The only purpose of this experiment is to look at how relative performance correlates to changes committed in trunk with a very limited test. Some bench-rounds seems also fail for various reasons that is why there is some turbulence in data points. The trend or end result will be a surprise for me too as I have not run similar benchmarks before with current versions.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Update (2007-03-18)&lt;/b&gt; I will be running the failing points again after the 1st run completes, I also need to run some of the recent runs again because there were configuration error which prevented space savings to surface. Hadoop Native libs are not working on RH5 currently because of bug in bin/nutch script. So expect to see more improvement when that is covered.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-4081407498618028069?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/4081407498618028069/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=4081407498618028069' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/4081407498618028069'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/4081407498618028069'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2007/03/perfomance-history-for-nutch.html' title='Perfomance history for Nutch'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-793310135590305828</id><published>2007-02-04T12:30:00.001+02:00</published><updated>2009-03-09T12:11:25.020+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='solr'/><category scheme='http://www.blogger.com/atom/ns#' term='nutch'/><title type='text'>Online indexing - integrating Nutch with Solr</title><content type='html'>&lt;span style="font-weight:bold;"&gt;Update 2009-03-09:&lt;/span&gt;: There is now more up to date example of solr integration available at &lt;a href="http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/"&gt;Lucid Imagination Blog&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;There might be times when you would like to integrate &lt;a href="http://lucene.apache.org/nutch"&gt;Apache Nutch&lt;/a&gt; crawling with a single &lt;a href="http://lucene.apache.org/solr"&gt;Apache Solr&lt;/a&gt; index server - for example when your collection size is limited to amount of documents that can be served by single Solr instance, or you like to do your updates on "live" index. By using Solr as your indexing server might even ease up your maintenance burden quite a bit - you would get rid of manual index life cycle management in Nutch and let Solr handle your index.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Overview&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;During this short post we will set up (and customize) Nutch to use Solr as indexing engine. If you are using Solr directly to provide a search interface then that's all you need to do to get a full working setup. The Nutch commands will be used as normally to manage the fetching part of the process (a scipt is provided that will ease up that part). The integration between Nutch and Solr is not yet available as "out of package" but it will not require so much glue code. &lt;br /&gt;&lt;br /&gt;A &lt;a href="http://www.foofactory.fi/files/nutch-solr/nutch_solr.patch"&gt;patch&lt;/a&gt; against Nutch trunk is provided for those who wish to be brave. In addition to that you will need the solr-client.jar and xpp3-1.1.3.4.O.jar in nutch/lib directory (they are both part of the &lt;a href="http://issues.apache.org/jira/secure/attachment/12348445/solr-client.zip"&gt;solr-client.zip&lt;/a&gt; package from SOLR-20.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Setting up Solr&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;A &lt;a href="http://people.apache.org/builds/lucene/solr/nightly/"&gt;nightly build&lt;/a&gt; of Apache Solr&lt;br /&gt;can be downloaded from Apache site. It is really easy to setup and basically the only thing&lt;br /&gt;requiring special attention is the custom schema to be used (see Solr wiki for more Details&lt;br /&gt;about available schema configuration options). Unpack the archive and go to the example&lt;br /&gt;directory of extracted package.&lt;br /&gt;&lt;br /&gt;I edited the example schema (solr/conf/schema.xml) and added the fields required by Nutch in it's stock configuration:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;&lt;br /&gt; &amp;lt;fields&gt;&lt;br /&gt;  &amp;lt;field name="url" type="string" indexed="true" stored="true"/&gt;&lt;br /&gt;  &amp;lt;field name="content" type="text" indexed="true" stored="true"/&gt;&lt;br /&gt;  &amp;lt;field name="segment" type="string" indexed="false" stored="true"/&gt;&lt;br /&gt;  &amp;lt;field name="digest" type="string" indexed="false" stored="true"/&gt;&lt;br /&gt;  &amp;lt;field name="host" type="string" indexed="true" stored="false"/&gt;&lt;br /&gt;  &amp;lt;field name="site" type="string" indexed="true" stored="false"/&gt;&lt;br /&gt;  &amp;lt;field name="anchor" type="string" indexed="true" stored="false" multiValued="true"/&gt;&lt;br /&gt;  &amp;lt;field name="title" type="text" indexed="true" stored="true"/&gt;&lt;br /&gt;  &amp;lt;field name="tstamp" type="slong" indexed="false" stored="true"/&gt;&lt;br /&gt;  &amp;lt;field name="text" type="text" indexed="true" stored="false" multiValued="true"/&gt;&lt;br /&gt; &amp;lt;/fields&gt;&lt;br /&gt;&lt;br /&gt; &amp;lt;uniqueKey&gt;url&amp;lt;/uniqueKey&gt;&lt;br /&gt;&lt;br /&gt; &amp;lt;defaultSearchField&gt;text&amp;lt;/defaultSearchField&gt;&lt;br /&gt;&lt;br /&gt; &amp;lt;solrQueryParser defaultOperator="AND"/&gt;&lt;br /&gt;&lt;br /&gt; &amp;lt;copyField source="anchor" dest="text"/&gt;&lt;br /&gt; &amp;lt;copyField source="title" dest="text"/&gt;&lt;br /&gt; &amp;lt;copyField source="content" dest="text"/&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;After setting up the schema just start the Solr server with command: &lt;b&gt;java -jar start.jar&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;note: If you use indexing filters in Nutch that will use more fields you need to add them to the Solr schema before you start indexing.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Implementing clue&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;img align="center" src="http://www.foofactory.fi/files/nutch-solr/nutch_solr_integration.png"&gt;&lt;br /&gt;&lt;br /&gt;The integration to Solr server is done with the client posted on &lt;a href="http://issues.apache.org/jira/browse/SOLR-20"&gt;SOLR-20&lt;/a&gt;. We will also implement a new indexer called SolrIndexer which will extend the existing &lt;a href="http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/Indexer.java?view=markup"&gt;Indexer&lt;/a&gt; in Nutch. Basically we would only need to modify the &lt;a href="http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/OutputFormat.java?view=markup"&gt;OutputFormat&lt;/a&gt; of class Indexer but also some additional (duplicate) code needs to be used in order to launch the job with our custom code.&lt;br /&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;  public static class OutputFormat extends org.apache.hadoop.mapred.OutputFormatBase&lt;br /&gt;      implements Configurable {&lt;br /&gt;    &lt;br /&gt;    private Configuration conf;&lt;br /&gt;    SolrClientAdapter adapter;&lt;br /&gt;    &lt;br /&gt;    public RecordWriter getRecordWriter(final FileSystem fs, JobConf job,&lt;br /&gt;        String name, Progressable progress) throws IOException {&lt;br /&gt;&lt;br /&gt;      return new RecordWriter() {&lt;br /&gt;        boolean closed;&lt;br /&gt;&lt;br /&gt;        public void write(WritableComparable key, Writable value)&lt;br /&gt;            throws IOException { // unwrap &amp; index doc&lt;br /&gt;          Document doc = (Document) ((ObjectWritable) value).get();&lt;br /&gt;          LOG.info("Indexing [" + doc.getField("url").stringValue() + "]");&lt;br /&gt;          adapter.index(doc);&lt;br /&gt;        }&lt;br /&gt;&lt;br /&gt;        public void close(final Reporter reporter) throws IOException {&lt;br /&gt;          // spawn a thread to give progress heartbeats&lt;br /&gt;          Thread prog = new Thread() {&lt;br /&gt;            public void run() {&lt;br /&gt;              while (!closed) {&lt;br /&gt;                try {&lt;br /&gt;                  reporter.setStatus("closing");&lt;br /&gt;                  Thread.sleep(1000);&lt;br /&gt;                } catch (InterruptedException e) {&lt;br /&gt;                  continue;&lt;br /&gt;                } catch (Throwable e) {&lt;br /&gt;                  return;&lt;br /&gt;                }&lt;br /&gt;              }&lt;br /&gt;            }&lt;br /&gt;          };&lt;br /&gt;&lt;br /&gt;          try {&lt;br /&gt;            prog.start();&lt;br /&gt;            LOG.info("Executing commit");&lt;br /&gt;            adapter.commit();&lt;br /&gt;          } finally {&lt;br /&gt;            closed = true;&lt;br /&gt;          }&lt;br /&gt;        }&lt;br /&gt;      };&lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;    public Configuration getConf() {&lt;br /&gt;      return conf;&lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;    public void setConf(Configuration conf) {&lt;br /&gt;      this.conf = conf;&lt;br /&gt;      adapter = new SolrClientAdapter(conf);&lt;br /&gt;    }&lt;br /&gt;    &lt;br /&gt;  }&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;In future it might be a good idea to improve the indexing API in Nutch to be more&lt;br /&gt;generic so we could support a variety of different index back ends with same Indexer code.&lt;br /&gt;&lt;br /&gt;The second class we will create is An adapter class towards the Solr java client, this is also&lt;br /&gt;strictly not required but to get better immunity against changes in the client it is a smart&lt;br /&gt;thing to do. The adapter class basically just extracts the required information from the Lucene&lt;br /&gt;Document generated by the Indexer and uses the Solr java client to submit it to Solr server.&lt;br /&gt;&lt;br /&gt;&lt;code class="prettyprint"&gt;&lt;br /&gt;&lt;br /&gt;  /** Adds single Lucene document to index. */&lt;br /&gt;  public void index(Document doc) {&lt;br /&gt;&lt;br /&gt;    SimpleSolrDoc solrDoc = new SimpleSolrDoc();&lt;br /&gt;    for (Enumeration&lt;Field&gt; e = doc.fields(); e.hasMoreElements();) {&lt;br /&gt;      Field field = e.nextElement();&lt;br /&gt;      if (!ignoreFields.contains((field.name()))) {&lt;br /&gt;        solrDoc.fields.put(field.name(), field.stringValue());&lt;br /&gt;      }&lt;br /&gt;    }&lt;br /&gt;    try {&lt;br /&gt;      client.add(solrDoc);&lt;br /&gt;    } catch (Exception e) {&lt;br /&gt;      LOG.warn("Could not index document, reason:" + e.getMessage(), e);&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;  &lt;br /&gt;  /** Commits changes */&lt;br /&gt;  public void commit(){&lt;br /&gt;    try {&lt;br /&gt;      client.commit(true, false);&lt;br /&gt;    } catch (Exception e) {&lt;br /&gt;      LOG.warn("Could not commit, reason:" + e.getMessage(), e);&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Setting up Nutch&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Before starting the crawling process you need to first configure Nutch. If you are not familiar with the way nutch operates it is recommended to first follow the tutorial in &lt;a href="http://lucene.apache.org/nutch/tutorial8.html"&gt;Nutch web site&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Basically the steps required are (make sure you use correct filenames - replace '_' with '-'):&lt;br /&gt;&lt;br /&gt;1. Set up &lt;a href="http://www.foofactory.fi/files/nutch-solr/regex_urlfilter.txt"&gt;conf/regex-urlfilter.txt&lt;/a&gt;&lt;br /&gt;2. Set up &lt;a href="http://www.foofactory.fi/files/nutch-solr/nutch_site.xml"&gt;conf/nutch-site.xml&lt;/a&gt;&lt;br /&gt;3. Generate a list of &lt;a href="http://www.foofactory.fi/files/nutch-solr/urls.txt"&gt;seed urls&lt;/a&gt; into folder urls&lt;br /&gt;4. Grab this &lt;a href="http://www.foofactory.fi/files/nutch-solr/crawl.sh"&gt;simple script&lt;/a&gt; that will help you along in your crawling task.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;After those initial steps you can start crawling by simply executing the &lt;b&gt;crawl.sh&lt;/b&gt; script:&lt;br /&gt;&lt;br /&gt;crawl.sh &amp;lt;basedir&gt;, where basedir will be the folder where your crawling contents will be stored.&lt;br /&gt;&lt;br /&gt;The script will execute one iteration of fetching and indexing. After the first iteration&lt;br /&gt;you can start querying the newly generated index for the content you have crawled - for&lt;br /&gt;example with url like &lt;b&gt;http://127.0.0.1:8983/solr/select?q=apache&amp;start=0&amp;rows=10&amp;fl=title%2Curl%2Cscore&amp;&lt;br /&gt;qt=standard&amp;wt=standard&amp;hl=on&amp;hl.fl=content&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;If you started with the provided seed list your index should contain exactly one document, the Apache front page. You can now fetch more rounds and see how your index will grow.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Deficiencies of the demonstrated integration&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;There are number of things you need to consider and implement before the&lt;br /&gt;integration is at usable level.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Document boost&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The document boosting was left of to keep this post small. If you are seriously planning to use pattern like this then you must add document boosting (not hard at all to add it). Without it you will lose a precious piece of information from the link graph.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Support for multivalued fields&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The anchor texts in Nutch are indexed into multivalued field. The sample code from this post does not do that.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Deleting pages from index&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The deleted pages are not removed from index. One could implement it as part of reduce method by checking the status from &lt;a href="http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup"&gt;CrawlDatum&lt;/a&gt; and post a deletion request if it has status STATUS_FETCH_GONE.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Posting multiple documents at same time&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The naive implementation here posts documents to index one by one over the network.&lt;br /&gt;A better way would be adding multiple documents at a time.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Further improvements - extending index size&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;If you are unwilling to wait for the &lt;a href="http://www.mail-archive.com/general@lucene.apache.org/msg00338.html"&gt;next killer component&lt;/a&gt; in Lucene family you could probably extend the pattern presented here to support even larger indexes than can be handled with single Solr server instance quite easily. &lt;br /&gt;&lt;br /&gt;A small addition in SolrClientAdapter would be sufficient: instead of posting all docs to single Solr instance one would post documents to different indexes, target server could be selected by hashing the document URL for example. This is not however recommended unless you understand the consequences ;)&lt;br /&gt;&lt;br /&gt;&lt;b&gt;UPDATE 2007/07/15&lt;/b&gt;&lt;br /&gt;Ryan has kindly posted a updated &lt;a href="/images/SolrClientAdapter.java"&gt;SolrClientAdapter&lt;/a&gt; that works with client version currently in solr trunk, thanks Ryan!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-793310135590305828?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/793310135590305828/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=793310135590305828' title='20 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/793310135590305828'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/793310135590305828'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html' title='Online indexing - integrating Nutch with Solr'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>20</thr:total></entry><entry><id>tag:blogger.com,1999:blog-830893785995753823.post-4568931330768372260</id><published>2007-01-20T17:33:00.000+02:00</published><updated>2007-01-20T18:31:15.425+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='open source'/><category scheme='http://www.blogger.com/atom/ns#' term='java'/><title type='text'>Website up</title><content type='html'>I got an extra burst of energy yesterday and set up my &lt;a href="http://www.foofactory.fi/"&gt;Company website&lt;/a&gt;. The nice simple layout (also deployed to this blog) is designed by &lt;a href="http://andreasviklund.com/"&gt;Andreas Viklund&lt;/a&gt;, nice work man!&lt;br /&gt;&lt;br /&gt;Another part of credits goes to the creators of &lt;a href="http://www.cromoteca.com/meshcms/index.html"&gt;MeshCMS&lt;/a&gt; which powers the site. It was really easy to set up, no fighting with databases and stuff, just deploy the war and start writing content - another excellent example of how usable open source software is getting these days.&lt;br /&gt; &lt;br /&gt;Regarding the company - everything is still at very early stages and I am working on it part time only (or should I say when there's demand ;). I'll get back to this when there's someting to report.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/830893785995753823-4568931330768372260?l=blog.foofactory.fi' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/4568931330768372260/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='https://www.blogger.com/comment.g?blogID=830893785995753823&amp;postID=4568931330768372260' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/4568931330768372260'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/830893785995753823/posts/default/4568931330768372260'/><link rel='alternate' type='text/html' href='http://blog.foofactory.fi/2007/01/website-up.html' title='Website up'/><author><name>Sami Siren</name><email>noreply@blogger.com</email><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='OpenSocialUserId' value='07360023453929377246'/></author><thr:total xmlns:thr='http://purl.org/syndication/thread/1.0'>0</thr:total></entry></feed>
