My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
http://www.scienceforseo.com
and update your bookmarks.

July 30, 2008

Lemur toolkit

The Lemur toolkit is a natural language processing and information retrieval toolkit. Having a go on this is a nice way of seeing some IR technologies functioning first hand, rather than guessing on a major SE to observe the phenomenon.

It supports all major languages, performs stemming using Porter and Krovetz, indexes loads of file formats, uses part-of-speech tagging and named entity recognition, and has an API of course (C++, C# and Java).

For retrieval:
  • Supports major language modeling approaches such as Indri and KL-divergence, as well as vector space, tf.idf, Okapi and InQuery
  • Relevance- and pseudo-relevance feedback
  • Wildcard term expansion (using Indri)
  • Passage and XML element retrieval
  • Cross-lingual retrieval
  • Smoothing via Dirichlet priors and Markov chains
  • Supports arbitrary document priors (e.g., Page Rank, URL depth)
Best of all, it's free! There's a set of tutorials to get you started here.

There is a new engine from the Lemur project called Indri which uses inference networks.

No comments:

Creative Commons License
Science for SEO by Marie-Claire Jenkins is licensed under a Creative Commons Attribution-Non-Commercial-No Derivative Works 2.0 UK: England & Wales License.
Based on a work at scienceforseo.blogspot.com.