My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
http://www.scienceforseo.com
and update your bookmarks.

December 11, 2008

LSA/LSI source code & tools

I'm often asked by students, researchers in other areas and sometimes SEO people where they can find LSI/LSA source code/tools.  My favourite beginners tutorial on LSI is by Genevieve Gorrell from Sheffield University. The term is LSA mostly used in computer science these days but it doesn't matter what you call it.

There are a number of packages which will allow you to use LSA/I and also offer many other useful things regarding semantic analysis, IE and IR for example.

(LSI/A is also applicable to source code too and also images).

For coding your own, you'll need to in short:

- Have a stopword file
- Process each file
- Compute the weights
- Normalize
- Print your data

There's a MATLAB (most unis will have licences allowing you to get a free copy) toolbox called TMG which will allow for clustering, retrieval, indexing, dimensionality reduction and classification - a powerful package indeed! Also MATLAB does a whole load of things because there are plenty of extensions freely available such as the SVM Toolbox.

JLSI is a Java implementation freely available.

The semantic-engine which also uses LSI/A in C++ (Google code).

The semantic vectors package is also available in Java + Lucene.

There's a working online tool at Uni Colorado LSA group.  It also does other types of classification.
 
There's gCLUTO with a nice interface for you - it gives you a graphical representation of clusters.

There's a demo here from Telecordia.

There's also a PLSI parser here.  If you want to try the other variant and compare.

I think that will do for now, I hope that you have fun with these :)
  

No comments:

Creative Commons License
Science for SEO by Marie-Claire Jenkins is licensed under a Creative Commons Attribution-Non-Commercial-No Derivative Works 2.0 UK: England & Wales License.
Based on a work at scienceforseo.blogspot.com.