My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
and update your bookmarks.

September 30, 2008


CLIR, "Cross Language Information Retrieval" (also referred to many names as well as "translingual") has been in research since at least 1996, when the first conference on the topic was held as part of SIGIR.  It involves retrieving information from a user query which is in a different language. The user may ask a question in Dutch and require results in German for exmaple.  

Google translate offers such a service, but research continues in this field.  It provides results both in English and French is you choose those languages, and I think they're of pretty good quality as well.

Researchers from Umass have published a paper entitled "Simultaneous Multilingual Search for Translingual Information Retrieval".  They describe a method which involves integrating document translation and query translation into the retrieval model.  

Basically each document has the text of the document and also its translation into the query language.  Each term in the query and its translation are treated as synonym sets.  So, they run one search instead of 2 separate searches, and also have 1 index instead of 2.

As a result, they state:

"This model leverages simultaneous search in multiple languages against jointly indexed documents to improve the accuracy of results over search using document translation or query translation alone. For query translation, we compared a statistical machine translation (SMT) approach to a dictionary based approach. We found that using a Wikipedia-derived dictionary for named entities combined with an SMT-based dictionary worked better than SMT alone."

They made use of Wikipedia for names but because they were dealing with non-overlapping languages like English and Chinese for example, all of the names had to be translated.  There were occurrences of misspellings so they had to build a set of name variants.  They were restricted by the limitations of machine translation technology, particularly in the area of named entities.

Their evaluation showed that:

"Our experimental results show that this approach significantly outperforms a previous hybrid approach, which merges the results of separate queries issued over separately indexed source and English documents. Our experiments evaluated results for English queries and Chinese documents, but our implementation of SMLIR currently includes three languages, English, Chinese and Arabic, demonstrating the ability to seamlessly integrate multiple languages into one framework."

They continue to work on their project to improve the performance of the system.  Please read their paper for more precise information and a better understanding of their interesting work.

CLIR is an important area of research for the future of search engines because the greatest number of Internet users are from Asian countries, meaning that their searches and the amount of data produced by them is likely to be in their own language and not English.  It is important for us to be able to access these documents and understand their information as otherwise we'd be missing out quite considerably.  The same goes the other way and for countries where there are fewer users who would need to access information mostly in a foreign language.

How does SEO fit in?  I guess you still make your site relevant and content rich, and maybe you'll be able to translate it to see how it looks in Chinese or whatever other language.  The method of search is the same though, I mean the search results are the same anyway.  You can check in Google translate.  

New developments though may change things a little at least.  Watch this space.    

No comments:

Creative Commons License
Science for SEO by Marie-Claire Jenkins is licensed under a Creative Commons Attribution-Non-Commercial-No Derivative Works 2.0 UK: England & Wales License.
Based on a work at