Science for SEO: Why PLSI is prefered over LSI

Here is quick explanation of why "Latent semantic indexing" LSI is unlikely to be used as is in modern search engines like Google. It's an old technique that has limitations, and solutions to these have been developed over the years since its introduction to IR research.

First, very quickly, a no-maths definition of LSI as a refresher:

"Semantic" = meaning

"Latent" = present but hidden

LSI = The analysis of the hidden meaning of words and how often they occur in a document.

Its advantage over simple keyword analysis (Boolean search = True or False) is that it can infer meaning from words which is not evident, and match words which would not normally happen with other methods. For example, "computer", "PC", "laptop" are all connected. Documents are put together even if it is not obvious that they are connected, because a "latent semantic space" is created.

It uses vectors in a a high-dimensional vector space (lots of them). It creates a term-document matrix from all the documents. Then 3 matrices are created using SVD (singular value decomposition) (also the second matrix houses the singular values of the original matrix in a diagonal matrix). This means that sets of terms or documents can be represented as d-dimensional vectors. Using the cosine of the angle between these vectors, there is now an easy-to-calculate similarity measure between any two sets of terms and/or documents. It can be used in any language because of the way that it's constructed.

But...there are big problems with LSI:

The resulting dimensions can be very difficult to interpret so there are mistakes. It's unclear what the resulting similarities between terms really means.
The input is a bag-of-words so we don't have any text structure information.
A compound term (bull-headed) is treated as 2 terms.
Ambiguous terms create noise in the vector space
There's no way to define the optimal dimensionality of the vector space
There's a time complexity for SVD in dynamic collections

So instead we use Probabilistic latent semantic indexing (PLSI).

It is better:

It has a more robust statistical foundation and provides a proper generative data model
It uses the EM algorithm (Expectation maximization to avoid over-fitting (nodes too specific to noise)) - this makes it far more flexible
It can deal with domain specific synonymy and polysemous words

LSI is different to PLSI:

It analyses co-occurrences of words and documents
It gives us what terms give us a topic (term by topic matrix - topic is also called an "aspect"), and which topics are in which documents term by document matrix)
It uses aspects to create queries
LSI uses the Gaussian model (normal distribution) that can generate negative values and you can't have a negative number of words in a document - PLSI uses a multi-nominal model (probability distribution) which works better.

In short:

The order of the words is lost (but results are still good due to word co-occurrence)
Documents can be represented by numeric vectors in a space of words
It's an unsupervised method (no human involvement necessary)
It retrieves topics
It uses aspects of LSI
Each query uses the cosine similarity metric to find the similarity between vectors.

Thomas Hoffman invented it, you can read his SIGIR paper here if you want more information.

It is currently used not only in standard IR but in collaborative filtering for example and blog search.

LSI is still a fundamental standard techniques used, but PLSI is being preferred for obvious reasons. Saying that a search engine is using LSI is like saying that a car is using petrol/gas. It's not environmentally friendly and it's expensive. LSI also has drawbacks and newer techniques are being used to counter these.

Science for SEO

November 03, 2008

Why PLSI is prefered over LSI

No comments:

About Me

Follow me on Twitter

Subcribe

CJ's shared items

My Blog List

Blog Archive

ShareThis

Content Recommendations powered by Evri