My blog has moved!

You should be automatically redirected in 6 seconds. If not, visit
and update your bookmarks.

November 11, 2008

About machine translation

Machine translation (MT) is all about translating text (or speech even) from one language to another.  It's part of computational linguistics, and uses a lot of NLP methods as well as statistical methods, rule-based methods, corpus techniques, some AI too, amongst other things.  Apparently it was started in the 17th century and in the 1950's the Georgetown experiment wen on, but it didn't really work so funding was really reduced, meaning that a lot of research in this area was terminated.  In the 1980's it made a comeback.

It's important today, in the age of the Internet, because a lot of data is in different languages, and when we can't understand another language, we are deprived from what may be the most relevant content to our query.

First you have to pull apart the source text to make sense of it, and then you have to re-engineer it into the target language so it makes perfect sense to a target language reader.  Not only do you have to understand all the grammatical elements, the syntax, the idioms, the semantics, and so on, you also have to have a good grasp of the culture associated with the target language.  

Different systems use different approaches, here is a brief description:

Rule-based systems:
It's basically made up of a load of rules relating to translation between the two languages.  It can use a dictionary and map to that.  You can use a parallel corpus to find those rules, which means that you map between ready made translations and pick out the common patterns, then feed these into a machine.  I did this and it wasn't very precise.  Google used SYSTRAN for many years, and this is a rule-based system.

Statistical methods:
Google translate now works with these.  It involves generating a load of statistics derived from a large corpus.  The problem is finding a very large corpus.  this isn't too much of a problem for Google, but not very many corpora exist, but even Google used the united nations corpus to add 200 billion words to it's system.  These are used to train the system.  

The main issues:
Word disambiguation is very difficult.  This is when words can have more than one meaning.  Google doesn't do so well in this area.  There are 2 methods that are known to deal with this, the shallow approach (looking at surrounding words and drawing statistical information from this), and the deep approach (providing a comprehensive definition to the system).  The deep approach takes a lot of time, and isn't so precise, so statistical methods tend to do better.

Consider this for example: "Cleaning fluids can be dangerous" - does it mean that cleaning fluids IS dangerous or that they ARE dangerous?

There are so many difficult issues in handling language anyway, seeing as it requires natural language understanding, which is far from performing right now.  There is a lot of research going on though, and eventually machine translation will work, but I'm not so sure how soon that will be.

What does it mean for SEO?
Well your keywords and your content is going to look a lot different in other languages, and the text may also be modified and re-written in places.  This means that you have a lot less control over how these pages rank in other languages.  The solution?  Maybe it would be worth having multi-lingual staff :) 

Read more here, from the University of Essex.  
There's also good information at Microsoft research (MT labs).
John Hutchins is a great source of information.
And check out Carnegie Mellon University MT labs too.

No comments:

Creative Commons License
Science for SEO by Marie-Claire Jenkins is licensed under a Creative Commons Attribution-Non-Commercial-No Derivative Works 2.0 UK: England & Wales License.
Based on a work at