Imagine you are at a restaurant in a foreign country, and by trying to avoid tourist traps, you found yourself at a nice local restaurant in a quiet neighborhood, no tourists except for you. The only problem is that the menu is in a foreign language... no English menu. What's the problem, actually? Pick your favorite machine translation system (Google Translate, Bing Translator, BabelFish, etc.) and translate the menu to a language you understand!
The Tower of Babel, by Pieter Bruegel the Elder. Oil on board, 1563.
|
I'm going to focus on statistical machine translation. Translation means taking a sentence in one language (the source language) and producing a sensible sentence in another language (the target language) that has the same meaning. Machine means that it's done by software rather than a human translator. What does statistical mean?
It means that rather than coding expert knowledge into software and creating a lexicon and grammatical rules for translation between two specific languages, these systems are based on statistics on texts in the source and target languages. This is what makes it possible to produce translation between any source and target languages without additional effort, and without having to hire someone that actually speaks these languages. The only thing you need is a large amount of text in both languages.
Statistical Machine Translation
What makes a translation good?
- It is as similar as possible in meaning to the original sentence in the source language.
- It sounds correct in the target language, e.g., grammatically.
SMT systems have a component for each of these requirements. The translation model makes sure that the translation is adequate and the language model is responsible for the fluency of the translation in the target language.
Language Model
I mentioned language models in my NLP overview post. They are used for various NLP applications. A language model (of a specific language, say English) receives as input a sentence in English and returns the probability of composing this sentence in the language. This is a score between 0 and 1, determining how fluent a sentence is in English - the higher the score, the more fluent the sentence is. Language models (LM) can capture grammatical rules (LM("she eat pizza") < LM("she eats pizza")), correct word order (LM("love I cats") < LM("I love cats")), better word choice (LM("powerful coffee") < LM("strong coffee")), and even some logic and world knowledge (LM("good British food") < LM("good Italian food")).
These models are obtained from a large corpus (structured set of texts) in the target language (e.g. English).In the next post I will elaborate on how this is done (edit 12/09/15: you can read in the next post how this is done).
These models are obtained from a large corpus (structured set of texts) in the target language (e.g. English).
Translation Model
A translation model (from the source language to the target language) receives as input a pair of sentences / words / phrases, one in each language, and returns the probability of translating the first sentence to the second. As the language model, it also gives a score between 0 and 1, determining how adequate a translation is - the higher the score, the more adequate the translation is.
These models are obtained from a parallel corpora (plural of corpus) - each corpus contains the same texts in a different language (one in the source language and one in the target language).
Given these two components, the language model and the translation model, how does the translation work? The translation model provides a table with words and phrases in the source language and their possible translations to the target language, each with a score. Given a sentence in the source language, the system uses this table to translate phrases from the source sentence to phrases in the target language.
There are multiple possible translations for the source sentence; first, because the source sentence could be segmented to phrases in multiple ways. For example, take the sentence Machine translation is a piece of cake. The most intuitive thing to do would be to split it to words. This will yield a very literal translation (in Hebrew: תרגום מכונה הוא חתיכה של עוגה), which doesn't make much sense. The translation table probably also has an entry for the phrase piece of cake, translating it to a word or an idiom with the same meaning in the target language (in Hebrew: קלי קלות. Ask Google).
Second, even for a certain segmentation of the source sentence, some phrases have multiple translations in the target language. It happens both because the word in the source language is polysemous (has more than one meaning) (e.g. piece), and because one word in the source language can have many synonyms in the target language (e.g. cake).
The translation system chooses how to segment the source sentence and how to translate each of its phrases to the target language, using the scores that the two models give the translation. It multiplies the translation score for each phrase, and the language model score for the entire target sentence, for example:
P(תרגום מכונה הוא חתיכת עוגה|Machine translation is a piece of cake) = TM(תרגום,translation) TM(מכונה,machine) TM(הוא,is) TM(חתיכת,piece) TM(עוגה,cake) LM(תרגום מכונה הוא חתיכת עוגה)
There are multiple possible translations for the source sentence; first, because the source sentence could be segmented to phrases in multiple ways. For example, take the sentence Machine translation is a piece of cake. The most intuitive thing to do would be to split it to words. This will yield a very literal translation (in Hebrew: תרגום מכונה הוא חתיכה של עוגה), which doesn't make much sense. The translation table probably also has an entry for the phrase piece of cake, translating it to a word or an idiom with the same meaning in the target language (in Hebrew: קלי קלות. Ask Google).
Second, even for a certain segmentation of the source sentence, some phrases have multiple translations in the target language. It happens both because the word in the source language is polysemous (has more than one meaning) (e.g. piece), and because one word in the source language can have many synonyms in the target language (e.g. cake).
The translation system chooses how to segment the source sentence and how to translate each of its phrases to the target language, using the scores that the two models give the translation. It multiplies the translation score for each phrase, and the language model score for the entire target sentence, for example:
P(תרגום מכונה הוא חתיכת עוגה|Machine translation is a piece of cake) = TM(תרגום,translation) TM(מכונה,machine) TM(הוא,is) TM(חתיכת,piece) TM(עוגה,cake) LM(תרגום מכונה הוא חתיכת עוגה)
This score could be understood as the conditional probability of translating Machine translation is a piece of cake to תרגום מכונה הוא חתיכת עוגה, but I'll spare you the formulas. The intuition behind multiplying the scores for the different translation components is joint probability of independent events.
Some things to note: the word of disappeared from the translation, and the words machine and translation switched places in the target sentence. These things happen and are allowed. Machine translation is a bit more complex than what I've told you. Just a bit :)
So each possible translation receives a final score, indicating both how adequate the translation is and how fluent it is in the target language, and the system chooses the translation with the highest score. In this case, Google ironically gets this one wrong.
Google ironically translates "Machine translation is a piece of cake" incorrectly. |
Why is it a real bad idea to rely on machine translation when you wish to speak / write in a language that you don't speak?
Because you may say things that you don't mean.
I'll give some examples of problems in translation.
Ambiguity - as you probably remember, this problem keeps coming back in every NLP task. In translation, the problem is that a polysemous word in the source language may be translated to different words in the target language for different senses. For example, wood can be translated in Hebrew to עץ (a piece of a tree) or to יער (a geographical area with many trees). While a human translator can pick the correct translation according to context, machines find it more difficult.
It gets even worse when you use a polysemous word in its less common meaning; A few months ago I needed to send an email to the PC (program committee) chairs of the conference in which I published my paper. I've noticed something funny about my mail, and I had to check how Google Translate can handle it. My mail started with "Dear PC chairs". I translated it to Hebrew (and back to English, for the non-Hebrew speakers in the audience):
Dear PC chairs => כסאות מחשב יקרים => expensive computer chairs
Don't expect SMT systems to always understand what you mean |
So what happened here? The word chair has two meanings; I meant the less common one, chairman, while Google translated it to the more common sense (furniture). Acronyms are much worse when it comes to polysemy, and PC refers, almost 100% of the times, to Personal Computer. On top of that, the adjective dear is translated in Hebrew to יקר, which means both dear and expensive. Google chose the wrong sense, creating a funny translation. However, given the knowledge about how SMT systems work, it's understandable that selecting the more common senses of words yields better scores for both the language model and the translation model. I can't blame Google for this translation.
This is just one example of a problem in machine translation. There are so many other problems: different languages have different word order (e.g. adjective before the noun in English, but after the noun in Hebrew, French and many other languages); in some languages nouns have gender while in others they don't; idioms are really tough for SMT systems - sometimes they are translated literally, like the piece of cake example (when it was a part of a sentence).
A good translation for an idiom. |
These problems are handled by more complex machine translation systems, that enable word re-ordering and translation at the syntactic level. Nevertheless, as you probably encounter from time to time, this task is not yet performed perfectly.
Since machine translation systems are not very accurate, it is very funny to translate a sentence to a random foreign language and back to English several times and see how you often get a totally different sentence (sometimes meaningless) in the end of this process. This is what Bad translator does. I tried it several times, and it was very amusing. Their example from the Ten Commandments inspired me to try other commandments, resulting in very funny bad translations:
Thou shalt not make unto thee any graven image => You can move the portrait
Thou shalt not kill => You must remove.
And some good ones:
Remember the sabbath day, to keep it holy => Don't forget to consider Saturday.
Thou shalt not commit adultery => Because you're here, try three
Thou shalt not steal => woman
And some good ones:
Remember the sabbath day, to keep it holy => Don't forget to consider Saturday.
Honour thy father and thy mother => honor your father and mother
You are welcome to try it and post your funny bad translations in the comments!