Science for SEO: Lemur toolkit

July 30, 2008

Lemur toolkit

The Lemur toolkit is a natural language processing and information retrieval toolkit. Having a go on this is a nice way of seeing some IR technologies functioning first hand, rather than guessing on a major SE to observe the phenomenon.

It supports all major languages, performs stemming using Porter and Krovetz, indexes loads of file formats, uses part-of-speech tagging and named entity recognition, and has an API of course (C++, C# and Java).

For retrieval:

Supports major language modeling approaches such as Indri and KL-divergence, as well as vector space, tf.idf, Okapi and InQuery
Relevance- and pseudo-relevance feedback
Wildcard term expansion (using Indri)
Passage and XML element retrieval
Cross-lingual retrieval
Smoothing via Dirichlet priors and Markov chains
Supports arbitrary document priors (e.g., Page Rank, URL depth)

Best of all, it's free! There's a set of tutorials to get you started here.

There is a new engine from the Lemur project called Indri which uses inference networks.

Science for SEO

July 30, 2008

Lemur toolkit

No comments:

About Me

Follow me on Twitter

Subcribe

CJ's shared items

My Blog List

Blog Archive

ShareThis

Content Recommendations powered by Evri