Tuesday, September 29, 2015

Translation Models

This is the last part of the machine translation overview, in which I will discuss translation models. To recall, a statistical machine translation system produces a translation that is required to be both adequate, that is, as close as possible in its meaning to the source sentence, and fluent in the target language. Fluency is the responsibility of the target language model, that scores a every candidate translation according to its likelihood in the target language. The translation model, which will be presented in this post, takes care of adequacy: it scores candidate translations with respect to the original sentence in the source language - higher scores for sentences that better preserve the meaning of the original sentence.

Toilet sign at a restaurant in Mestre, Italy. Some kind of machine translation was used, translating toilet in Hebrew to the makeup. If you recognize funny translations in other languages, please comment!
As in language models, you don't need an expert to build the translation model. You don't even need to speak either the source or the target language. Using statistical methods, you can (theoretically), build a translation model from Swahili to Yiddish. The only requirement is to have a parallel corpus - a large amount of the same text, written in both languages. For example, movie subtitles or book translations in both languages. The texts are usually aligned at the sentence-level, so it can be regarded as a large collection of sentences in the source language and their translations to the target language. For example, the first sentence from George Orwell's novel 1984, in the original edition and in the Hebrew translation:

en: It was a bright cold day in April, and the clocks were striking thirteen.
he: יום אפריל צח וצונן, השעונים מצלצלים שלוש-עשרה.

can be considered as mutual translations. So do the rest of the sentence-pairs, as long as the translator is not too creative.

History Lesson
Here's a nice anecdote about using a parallel corpus for translation it's actually not a modern technique at all. It has been here since the 19th century. The Rosetta Stone is an ancient Egyptian stone inscribed with a decree issued at Egypt, in 196 BC. The text on the stone is written in three scripts: ancient Egyptian hieroglyphs, Demotic script, and ancient Greek. Ancient Egyptian hieroglyphs were used until the end of the fourth century, after which the knowledge of how to read them was lost. For hundreds of years, scholars have tried to decode the ancient Egyptian hieroglyphs. In 1799, the Rosetta stone was rediscovered near the town of Rosetta in the Nile, and brought with it a major advancement in the decoding of the ancient Egyptian hieroglyphs. It was the recognition that the stone offered three versions of the same text that enabled the advancement, making it the first parallel corpus used for statistical translation (at this time, without machines). It was finally decoded in 1822 by the French scholar Jean-François Champollion. The stone is on public display at the British Museum (and is the most interesting exhibit there, in my opinion).

The Rosetta Stone

Learning the translation model
Using sentence pairs from a parallel corpus as a translation table is nice, but not enough. You can always generate a sentence in the source language that didn't occur in the corpus, so it wouldn't be in the table. However, a sentence is composed of phrases (words and multi-word expressions), so instead of constructing a sentence translation table, a phrase translation table could be built, enabling a phrase-by-phrase translation. If the corpus is large enough, you can assume that it covers at least most of the common words and phrases in these languages.

This is what an excerpt from a phrase table from English to Hebrew might look like:

thirteenשלוש עשרה0.41
thirteenשלושה עשר0.21
Each entry contains a source language phrase, a target language phrase and the score (probability) of translating the source phrase to the target phrase. These are not trivial to compute, since the corpus is aligned at the sentence level. All we know is that "יום אפריל צח וצונן, השעונים מצלצלים שלוש-עשרה" is a (possible) translation of "It was a bright cold day in April, and the clocks were striking thirteen", but we don't know which words in English are translated to which words in Hebrew. The assumption is that each word in the source sentence is translated to 0, 1 or more words in the target language. In the simple case, it is translated to one word. In other cases, a word may disappear in translation (for example, the determiner "a" in English doesn't exist in Hebrew) or be translated to a multi-word phrase (e.g. the word "thirteen" is translated to "שלוש עשרה").

The word-level alignment of a sentence-pair.
The solution is, again, to use statistical methods. In particular, aligning these sentence pairs at the word level using the corpus statistics. The most basic alignment model is IBM model 1. It goes over all the sentence pairs in the corpus, and counts for each source word its occurrences in the same sentence pair with target words - since every target word could be its translation. In the example sentences-pair, the Hebrew word יום is counted once with every one of the English words It, was, a, bright, cold, day, in, April, and, the, clocks, were, striking, thirteen. If it appears in another sentence pair, for example, "איזה יום יפה" and "what a beautiful day", the word day will have two occurrences with יום. Since this is the true translation, the word day will occur in every sentence pair in which the word יום occurs. These counts are used to estimate the probability of translating the source word to a target word. In some cases, an English word may have several possible translations, such as cold that could be translated both to צונן and קר. In this case, the English word cold will appear in some cases with צונן and in others with קר. The probability will be computed accordingly (and will be higher for the more common translation).

This is the basic model, and there are other IBM models (2-5) that handle some of the problems that the basic model doesn't solve (e.g. considering the distance between aligned words). This phase's output is a word-to-word table, and then another algorithm is applied to create a phrase table, merging multi-word expressions to one phrase (e.g. "hot dog" which is translated differently from "hot" and "dog"). 

Putting it all together
The decoder is responsible for performing the actual translation: given the source sentence, it constructs a new sentence in the target language, using the translation model to offer phrase translations and their scores, and the language model to rank the fluency of the translation.

There are multiple ways to segment the source sentence to phrases (e.g., should "hot dog" be regarded as a phrase, or segmented to "hot" and "dog"?), and in most cases there are also multiple ways to translate each phrase in the source language to a phrase in the target language (e.g., should "cold" be translated to "צונן" or to "קר"?). In addition, the phrases in the target language may be re-ordered to follow grammar rules in the target language (e.g. adjective before noun in English, but after noun in many languages such as Hebrew, Romanian and French). The decoder tries many of these segmentations, translations and orders and produces candidate translations.

Each candidate translation is scored by three components: the language model scores the translation according to its fluency in the target language. The re-ordering model (which we haven't discussed in details) gives a score based on the changes in the order of words in both languages. The last score is the one given by the translation model. Each phrase-to-phrase translation score is the probability to translate one phrase to the other. So the translation model's score for the entire sentence is the product of all phrase translation scores, for example, if the source sentence is "It's not cold in April":

score(לא קר באפריל) = TM(לא,not) TM(קר,cold) TM(ב,in) TM(אפריל,April) LM(לא קר באפריל) RM(לא קר באפריל, It's not cold in April)

And eventually the decoder would select the candidate translation with the highest score it could find.

As always, I'll end the post with hedging myself by saying that I really haven't presented the entire world of translation, just gave you a taste of it. I tried to simplify the basic models that I told you about, but they are a bit less simple than I described. Also, there are newer and more accurate models that involve machine learning techniques, or consider the syntax of the source and target sentences. I hope I could convey the basics clearly and interestingly enough :)

Saturday, September 12, 2015

Language Models

In my previous post about Machine TranslationI mentioned how language models are used in statistical machine translation. Language models are used in many NLP applications. In this post, I will explain about language models in general, and how to learn a certain kind of language models: n-gram language models.

A language model is for a specific language, for example, an English language model. It receives as input a sequence of words in English (sentence / phrase / word). For simplicity, let's say it receives a sentence. The language model score for a sentence s, P(s), is a score between 0 and 1, that can be interpreted as the probability of composing this sentence in English. This score determines how fluent s is in English; the higher the score, the more fluent the sentence is. Language models can capture some interesting language phenomena: 
  • Which sentence is grammatically correct? - P("he eat pizza") < P("he eats pizza") 
  • Which word order is correct? - P("love I cats") < P("I love cats")
and even some logic and world knowledge:
  • What is more likely? - P("good British food") < P("good Italian food")

It can also tell you that pdf is the fourth largest religion:

Google suggests words that are likely to complete the query. From here.

Learning a language model
What does it take to build such a language model? Just a large English text corpus (a large and structured set of texts). We are interested in the probability of sentences, words and phrases in the language, but we don't know the real distribution of words and sentences in the language. We can use a large-enough corpus to estimate this probability. The basic method is to use relative frequency (Maximum Likelihood). The probability of a certain word w to occur in English, p(w) is approximated by the ratio of the occurrences of w in the corpus (the number of occurrences of w / the number of any word occurrence). For example: 
  • The word cat occurred 3853 times, out of total of 100,000,000 words, so its estimated probability is 0.00003853.
  • The word no, on the other hand, occurs more frequently: 226,985 times. So its probability is 0.00226985, and therefore when you compose a sentence in English, you are much more likely to say the word no than cat.
But we are also interested in computing the probability of multi-word expressions, phrases and sentences. Since any of them is simply a sequence of words, we can use the chain rule to compute the probability:1  

(1) P(A1,A2,...,Am) = P(A1) P(A2|A1) ... P(An|A1A2,...,Am-1)

where P(Ai|A1A2,...,Ai-1) denotes the probability that the word Ai is the next word in the sequence A1A2,...,Ai-1. For example, P(I love my cat) = P(I) P(love|I) P(my |I love) P(cat |I love my). We can assume that the words are independent of each other, and get a much simpler formula: 

(2) P(A1,A2,...,Am) = P(A1) P(A2) ... P(Am)

So whenever you pick an extra word to continue your sentence, you choose it by its distribution in the language and regardless of the previous words. This doesn't make much sense though. The probability of the word cat is lower than that of the word no. However, in the context of the incomplete sentence "I love my", the word cat is much more likely to complete the sentence than the word no

To estimate the conditional probability of a word Ai (cat) given any number of preceding words A1,A2,...,Ai-1 (I love my), we need to count the number of occurrences of Ai after A1,A2,...,Ai-1 (how many times the sentence "I love my cat" appears in the corpus) and divide it in the number of times that A1,A2,...,Ai-1 occurred with any following word (how many times the sentence "I love my *" appears in the corpus for any word *). You would expect that P(cat|I love my) would be higher than P(no|I love my).

You would also see that the conditional probability P(cat|I love my) is different from the prior probability P(cat). I'm not sure if it would be higher though; but I'm sure that P(cat|Persian) > P(cat): you are more likely to say "cat" if you already said "Persian", than just like that out of the blue.

However, assuming that every word in the sentence depends on all the previous words is not necessary, and it causes a problem of sparsity. In simple words, there is not enough data to estimate the probabilities. In order to compute the probability of the word cat to complete the sentence My friend John once had a black, you would need the sequences "My friend John once had a black" and "My friend John once had a black cat" to actually appear in the corpus. The corpus is big, but it doesn't contain any sentence that anyone has ever said.  

What's the solution? Markov assumption. We can assume every word only depends on k preceding words. For example, if k=1, we get:

(3) P(I love my cat) = P(I) P(love|I) P(my|love) P(cat|my)

This kind of language model is called n-gram language model, where an n-gram is a contiguous sequence of n words.2  The model works with n-grams, so the assumption is that every word depends on the preceding (n-1) words. For example, a unigram (n=1) language model considers the words independent of each other (P(I love my cat) = P(I) P(love) P(my) P(cat)). A bigram language model assumes that every word depends on the previous word (P(I love my cat) = P(I) P(love|I) P(my|love) P(cat|my)). There are also trigram (n=3) and 4-gram language models; larger ns are less commonly used, to the best of my knowledge.

While choosing a small n reduces the sparsity, it doesn't solve the problem completely. Some rare words (e.g. absorbefacient, yes, it's an actual word) or n-grams (e.g. blue wine) may never occur in the corpus, but still be valid in the language. If we use a word's relative frequency as its probability, a word that never occurs in the corpus receives zero probability. If a sentence contains such a word (e.g. I went to the pub and ordered a glass of blue wine), its probability will be zero. While we would probably like this sentence to have a very low probability, we wouldn't want it to be zero; we are aware of the fact that our corpus may be missing some valid English words.

Smoothing solves this problem. The simplest smoothing technique "hallucinates" additional k occurrences of every word in the sentence. For example, add-1 smoothing would consider that the word absorbefacient occurred once (if it hasn't occurred at all in the corpus), and that the word cat occurred 3854 times (when it actually occurred 3853 times). The new probability is:

(4) P(cat) = (3853 + 1) / (100,000,000 + V)

where V is the size of the vocabulary (the number of words we "added" to the corpus).
The same thing applies for n-grams. With this new formula, the probability of unseen words (and n-grams) is small, but never zero.

And as always, there are more complex smoothing techniques (Back-off, Kneser-Ney, etc.), that I will not discuss in this post.

Do you want to try it yourself? I implemented a simple language model for this post.3 Type a sentence, hit the button and you'll get the probability of the sentence (after a while...). Try it!

What can you do with language models?
As the demo shows, you can compute the probability of a sentence in a certain language.
As I explained in the previous post, statistical machine translation systems use a language model of the target language to prefer translations that are more fluent in the target language.

In the other direction, language models can be used to
generate the next word in a sequence of words, by sampling from the distribution of words (given the previous word). It can complete your search query or suggest corrections to your text messages. One of the funnest things is to generate a whole new sentence. To illustrate, I used my language model and generated the not-very-sensible sentences4:

Who indulge yourself.
The american opinion have attachments of miners from hard lump sum of his search far as to ? and considerable number of the cavity.
As in Massachusetts.
To start a pretext for which the orbit and rostov smiled at mrs. rucastle seemed to give the society it must be the European settlement?'' said he was all men were to himself to your monstrosity such things he had already passed in the contrary to which alone , beyond the tarsal bones is she 's eyes were drawn up because he wanted.
It was about 87,000 soldiers.
sentences generated with a bigram language model

Better sentences could be generated with larger n or with smarter (not n-gram) language models. Anyway, generating sentences can fill up hours of fun.

Choosing the corpus from which you learn the language model greatly affects the final outcome. Needless to say, choosing the language of the corpus is crucial. If you want a French language model, you need a French corpus, etc. Furthermore, if you base your language model on Shakespeare's writings, and then try to use it to estimate the probabilities of recent posts by your Facebook friends, they will probably be very unlikely. If your corpus is from the medical domain, sentences with medical terms will have a higher probability than those discussing rock bands. So you must choose your corpus carefully according to your needs. For purposes such as machine translation, the corpus should be general and contain texts from diverse domains. However, if you develop a machine translation system for a specific application, e.g. a medical application, you may want your corpus to contain relevant documents, for instance medical documents.

Having trained your language model on a very specific corpus, e.g. a corpus of recipes or a corpus of all the songs by The Smiths, you can go along and generate a whole new sequence of words. If your language model is good enough, you might get a brand new recipe (dare to try it?) or a song by the Smiths that Morrissey has never heard about.5   In fact, language models don't have to be trained on a text corpus at all. You can train them on musical notes and compose a new melody. Here are some examples for melodies that were generated from musical notes n-gram language model.6 

And just so you won't think that n-gram language models are the state-of-the-art: there are other language models, some perform much better. Maybe I'll mention some of them in other posts.

1 To be more accurate, P(A1) represents the prior probability of A1 (the probability of the word A1 to occur in English), while we are interested in the conditional probability of A1 given the beginning of a sentence. Therefore, the beginning of each sentence in the corpus is marked with a special sign <S>, and P(A1) is replaced by P(A1|<S>). This was omitted from the rest of the formulas for simplicity.

2 A good source for n-grams count is Google Ngrams, extracted from Google Books. 

3 The language model in the demo is a bigram language model with add-1 smoothing. I trained it using the corpus big.txt from Peter Norvig's website.

4 I started with the special sign <S> and sampled the next word from the distribution given the previous word, until period was sampled.  

5 In fact, I tried it, but it didn't work well because the corpus was too small. The Smiths were only active for 5 years and they don't have enough songs.

6 These examples are taken from Implementing A Music Improviser Using N-Gram Models by Kristen Felch and Yale Song. They were not the first to implement a musical n-gram model (I found a previous work, and I'm sure there are others), but they published some sample songs that are pretty good.