First, very quickly, a no-maths definition of LSI as a refresher:

"Semantic" = meaning

"Latent" = present but hidden

LSI = The analysis of the hidden meaning of words and how often they occur in a document.

Its advantage over simple keyword analysis (Boolean search = True or False) is that it can infer meaning from words which is not evident, and match words which would not normally happen with other methods. For example, "computer", "PC", "laptop" are all connected. Documents are put together even if it is not obvious that they are connected, because a "latent semantic space" is created.

It uses vectors in a a high-dimensional vector space (lots of them). It creates a term-document matrix from all the documents. Then 3 matrices are created using SVD (singular value decomposition) (also the second matrix houses the singular values of the original matrix in a diagonal matrix). This means that sets of terms or documents can be represented as d-dimensional vectors. Using the cosine of the angle between these vectors, there is now an easy-to-calculate similarity measure between any two sets of terms and/or documents. It can be used in any language because of the way that it's constructed.

But...there are big problems with LSI:

- The resulting dimensions can be very difficult to interpret so there are mistakes. It's unclear what the resulting similarities between terms really means.
- The input is a bag-of-words so we don't have any text structure information.
- A compound term (bull-headed) is treated as 2 terms.
- Ambiguous terms create noise in the vector space
- There's no way to define the optimal dimensionality of the vector space
- There's a time complexity for SVD in dynamic collections

So instead we use Probabilistic latent semantic indexing (PLSI).

It is better:

- It has a more robust statistical foundation and provides a proper generative data model
- It uses the EM algorithm (Expectation maximization to avoid over-fitting (nodes too specific to noise)) - this makes it far more flexible
- It can deal with domain specific synonymy and polysemous words

LSI is different to PLSI:

- It analyses co-occurrences of words and documents
- It gives us what terms give us a topic (term by topic matrix - topic is also called an "aspect"), and which topics are in which documents term by document matrix)
- It uses aspects to create queries
- LSI uses the Gaussian model (normal distribution) that can generate negative values and you can't have a negative number of words in a document - PLSI uses a multi-nominal model (probability distribution) which works better.

In short:

- The order of the words is lost (but results are still good due to word co-occurrence)
- Documents can be represented by numeric vectors in a space of words
- It's an unsupervised method (no human involvement necessary)
- It retrieves topics
- It uses aspects of LSI
- Each query uses the cosine similarity metric to find the similarity between vectors.

Thomas Hoffman invented it, you can read his SIGIR paper here if you want more information.

It is currently used not only in standard IR but in collaborative filtering for example and blog search.

LSI is still a fundamental standard techniques used, but PLSI is being preferred for obvious reasons. Saying that a search engine is using LSI is like saying that a car is using petrol/gas. It's not environmentally friendly and it's expensive. LSI also has drawbacks and newer techniques are being used to counter these.

## No comments:

Post a Comment