This is the last part of the machine translation overview, in which I will discuss translation models. To recall, a statistical machine translation system produces a translation that is required to be both adequate, that is, as close as possible in its meaning to the source sentence, and fluent in the target language. Fluency is the responsibility of the target language model, that scores a every candidate translation according to its likelihood in the target language. The translation model, which will be presented in this post, takes care of adequacy: it scores candidate translations with respect to the original sentence in the source language - higher scores for sentences that better preserve the meaning of the original sentence.
As in language models, you don't need an expert to build the translation model. You don't even need to speak either the source or the target language. Using statistical methods, you can (theoretically), build a translation model from Swahili to Yiddish. The only requirement is to have a parallel corpus - a large amount of the same text, written in both languages. For example, movie subtitles or book translations in both languages. The texts are usually aligned at the sentence-level, so it can be regarded as a large collection of sentences in the source language and their translations to the target language. For example, the first sentence from George Orwell's novel 1984, in the original edition and in the Hebrew translation:
en: It was a bright cold day in April, and the clocks were striking thirteen.
he: יום אפריל צח וצונן, השעונים מצלצלים שלוש-עשרה.
can be considered as mutual translations. So do the rest of the sentence-pairs, as long as the translator is not too creative.
en: It was a bright cold day in April, and the clocks were striking thirteen.
he: יום אפריל צח וצונן, השעונים מצלצלים שלוש-עשרה.
can be considered as mutual translations. So do the rest of the sentence-pairs, as long as the translator is not too creative.
History Lesson
Here's a nice anecdote about using a parallel corpus for translation - it's actually not a modern technique at all. It has been here since the 19th century. The Rosetta Stone is an ancient Egyptian stone inscribed with a decree issued at Egypt, in 196 BC. The text on the stone is written in three scripts: ancient Egyptian hieroglyphs, Demotic script, and ancient Greek. Ancient Egyptian hieroglyphs were used until the end of the fourth century, after which the knowledge of how to read them was lost. For hundreds of years, scholars have tried to decode the ancient Egyptian hieroglyphs. In 1799, the Rosetta stone was rediscovered near the town of Rosetta in the Nile, and brought with it a major advancement in the decoding of the ancient Egyptian hieroglyphs. It was the recognition that the stone offered three versions of the same text that enabled the advancement, making it the first parallel corpus used for statistical translation (at this time, without machines). It was finally decoded in 1822 by the French scholar Jean-François Champollion. The stone is on public display at the British Museum (and is the most interesting exhibit there, in my opinion).
Here's a nice anecdote about using a parallel corpus for translation - it's actually not a modern technique at all. It has been here since the 19th century. The Rosetta Stone is an ancient Egyptian stone inscribed with a decree issued at Egypt, in 196 BC. The text on the stone is written in three scripts: ancient Egyptian hieroglyphs, Demotic script, and ancient Greek. Ancient Egyptian hieroglyphs were used until the end of the fourth century, after which the knowledge of how to read them was lost. For hundreds of years, scholars have tried to decode the ancient Egyptian hieroglyphs. In 1799, the Rosetta stone was rediscovered near the town of Rosetta in the Nile, and brought with it a major advancement in the decoding of the ancient Egyptian hieroglyphs. It was the recognition that the stone offered three versions of the same text that enabled the advancement, making it the first parallel corpus used for statistical translation (at this time, without machines). It was finally decoded in 1822 by the French scholar Jean-François Champollion. The stone is on public display at the British Museum (and is the most interesting exhibit there, in my opinion).
The Rosetta Stone |
Using sentence pairs from a parallel corpus as a translation table is nice, but not enough. You can always generate a sentence in the source language that didn't occur in the corpus, so it wouldn't be in the table. However, a sentence is composed of phrases (words and multi-word expressions), so instead of constructing a sentence translation table, a phrase translation table could be built, enabling a phrase-by-phrase translation. If the corpus is large enough, you can assume that it covers at least most of the common words and phrases in these languages.
This is what an excerpt from a phrase table from English to Hebrew might look like:
source | target | score |
day | יום | 1.0 |
April | אפריל | 1.0 |
bright | צח | 0.58 |
bright | בהיר | 0.42 |
cold | קר | 0.7 |
cold | צונן | 0.3 |
thirteen | שלוש עשרה | 0.41 |
thirteen | שלושה עשר | 0.21 |
thirteen | 13 | 0.38 |
Each entry contains a source language phrase, a target language phrase and the score (probability) of translating the source phrase to the target phrase. These are not trivial to compute, since the corpus is aligned at the sentence level. All we know is that "יום אפריל צח וצונן, השעונים מצלצלים שלוש-עשרה" is a (possible) translation of "It was a bright cold day in April, and the clocks were striking thirteen", but we don't know which words in English are translated to which words in Hebrew. The assumption is that each word in the source sentence is translated to 0, 1 or more words in the target language. In the simple case, it is translated to one word. In other cases, a word may disappear in translation (for example, the determiner "a" in English doesn't exist in Hebrew) or be translated to a multi-word phrase (e.g. the word "thirteen" is translated to "שלוש עשרה").
The solution is, again, to use statistical methods. In particular, aligning these sentence pairs at the word level using the corpus statistics. The most basic alignment model is IBM model 1. It goes over all the sentence pairs in the corpus, and counts for each source word its occurrences in the same sentence pair with target words - since every target word could be its translation. In the example sentences-pair, the Hebrew word יום is counted once with every one of the English words It, was, a, bright, cold, day, in, April, and, the, clocks, were, striking, thirteen. If it appears in another sentence pair, for example, "איזה יום יפה" and "what a beautiful day", the word day will have two occurrences with יום. Since this is the true translation, the word day will occur in every sentence pair in which the word יום occurs. These counts are used to estimate the probability of translating the source word to a target word. In some cases, an English word may have several possible translations, such as cold that could be translated both to צונן and קר. In this case, the English word cold will appear in some cases with צונן and in others with קר. The probability will be computed accordingly (and will be higher for the more common translation).
This is the basic model, and there are other IBM models (2-5) that handle some of the problems that the basic model doesn't solve (e.g. considering the distance between aligned words). This phase's output is a word-to-word table, and then another algorithm is applied to create a phrase table, merging multi-word expressions to one phrase (e.g. "hot dog" which is translated differently from "hot" and "dog").
The word-level alignment of a sentence-pair. |
This is the basic model, and there are other IBM models (2-5) that handle some of the problems that the basic model doesn't solve (e.g. considering the distance between aligned words). This phase's output is a word-to-word table, and then another algorithm is applied to create a phrase table, merging multi-word expressions to one phrase (e.g. "hot dog" which is translated differently from "hot" and "dog").
Putting it all together
The decoder is responsible for performing the actual translation: given the source sentence, it constructs a new sentence in the target language, using the translation model to offer phrase translations and their scores, and the language model to rank the fluency of the translation.
There are multiple ways to segment the source sentence to phrases (e.g., should "hot dog" be regarded as a phrase, or segmented to "hot" and "dog"?), and in most cases there are also multiple ways to translate each phrase in the source language to a phrase in the target language (e.g., should "cold" be translated to "צונן" or to "קר"?). In addition, the phrases in the target language may be re-ordered to follow grammar rules in the target language (e.g. adjective before noun in English, but after noun in many languages such as Hebrew, Romanian and French). The decoder tries many of these segmentations, translations and orders and produces candidate translations.
Each candidate translation is scored by three components: the language model scores the translation according to its fluency in the target language. The re-ordering model (which we haven't discussed in details) gives a score based on the changes in the order of words in both languages. The last score is the one given by the translation model. Each phrase-to-phrase translation score is the probability to translate one phrase to the other. So the translation model's score for the entire sentence is the product of all phrase translation scores, for example, if the source sentence is "It's not cold in April":
score(לא קר באפריל) = TM(לא,not) TM(קר,cold) TM(ב,in) TM(אפריל,April) LM(לא קר באפריל) RM(לא קר באפריל, It's not cold in April)
And eventually the decoder would select the candidate translation with the highest score it could find.
As always, I'll end the post with hedging myself by saying that I really haven't presented the entire world of translation, just gave you a taste of it. I tried to simplify the basic models that I told you about, but they are a bit less simple than I described. Also, there are newer and more accurate models that involve machine learning techniques, or consider the syntax of the source and target sentences. I hope I could convey the basics clearly and interestingly enough :)