My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
http://www.scienceforseo.com
and update your bookmarks.

January 05, 2009

Query expansion using MT

The Google patent entitled "Machine Translation for Query Expansion" (25/12/08) is a really interesting read.  It describes a really exciting new method for query expansion.

"Query expansion" is when a users query is modified before the search is performed.  This is done to improve the search results.  To do this techniques such as stemming, spelling correction, and the adding of synonyms are used.The method described deals with query expansion using synonyms.  Usually this is done using thesauri or lexical ontologies but here it is proposed that machine translation be used - ingenious.  

Synonym selection is really not that easy at all.  WordNet and such resources have helped us a lot, but there's room for improvement.  Sometimes a word can have several different meanings and choosing the wrong one would completely change the query.  

"The method includes receiving a search query and selecting a synonym of a term in the search query based on a context of occurrence of the term in the received search query, the synonym having been derived from statistical machine translation of the term. The method also includes expanding the received search query with the synonym and using the expanded search query to search a collection of documents."

Google uses statistical machine translation (as opposed to the rule-based approach).  This type of system includes a language model which is used to figure out which bit of text is in the target language and a translation model which uses certain probabilities to determine the translation.  So it looks at the likelihood of a particular string being the translation of another.  The language model tells the system which proposed translation coming out of the translation model is likely to be right.  

"In general, in another aspect, a method is provided. The method includes receiving a request to search a corpus of documents, the request specifying a search query, using statistical machine translation to translate the specified search query into an expanded search query, the specified search query and the expanded search query being in the same natural language, and in response to the request, using the expanded search query to search a collection of documents."

The end result is that there is an increased likelihood hat the search results are more accurate.  Also it limits the expansion of the query with erroneous words (which is nice technically speaking).  

"Statistical correlations between the occurrences of words in the source language and words in the target language are expressed as alignments between particular words or phrases. When the target language and source language are the same natural language, the principal meaning of an aligned pair is the same. The aligned word or phrase pair is presumed to have similar meaning, i.e., they are presumed to be synonymous. For example, the word "ship" can be aligned under certain circumstances (e.g., in a particular context) with the word "transport". Thus, for those circumstances, "ship" is synonymous with "transport". 

The Google translate system is far from accurate though, using their example:

User query: "How to ship a box"
Google translate French: "Comment une boîte de livraison"
Google translate German: "Wie Sie,ein Feld"

French:
Comment = How
une boîte = a box
livraison = delivery

German:
Wie = how
Sie = you
ein feld = a field

The synonyms and the context are pretty hard to get from these translations - obviously this is a really really simplistic and short test.  It just gives and idea of the thing. It works better with "Achilles heel running injury" btw but...evaluation is not done like this it's a bit more complex.

Here basically the idea is to add more context awareness to the search system.  I like it, it's very clever indeed.  The Google translate engine is therefore capable of being put to use in other ways than just translation.  

Why should you care?

Well this method shows that queries are being expanded to include far more words than are actually present in the query.  This means that going after particular keywords may be useful at the basis but is a very limited approach.  As an SEO expert, you should be seeking to create content rich with not only your top target keywords but also terms and concepts that belong to that topic.  It's time to look at things in more dimensions than one.

No comments:

Creative Commons License
Science for SEO by Marie-Claire Jenkins is licensed under a Creative Commons Attribution-Non-Commercial-No Derivative Works 2.0 UK: England & Wales License.
Based on a work at scienceforseo.blogspot.com.