Monday, August 31, 2015

Machine Translation Overview

Imagine you are at a restaurant in a foreign country, and by trying to avoid tourist traps, you found yourself at a nice local restaurant in a quiet neighborhood, no tourists except for you. The only problem is that the menu is in a foreign language... no English menu. What's the problem, actually? Pick your favorite machine translation system (Google Translate, Bing Translator, BabelFish, etc.) and translate the menu to a language you understand!

So, there's no need to elaborate on the motivation for translation. What I would like to do is give you an overview of how this magic works, and some idea of why it doesn't always work as well as you would expect.

The Tower of Babel, by Pieter Bruegel the Elder. Oil on board, 1563.
I'm going to focus on statistical machine translation. Translation means taking a sentence in one language (the source language) and producing a sensible sentence in another language (the target language) that has the same meaning. Machine means that it's done by software rather than a human translator. What does statistical mean?

It means that rather than coding expert knowledge into software and creating a lexicon and grammatical rules for translation between two specific languages, these systems are based on statistics on texts in the source and target languages. This is what makes it possible to produce translation between any source and target languages without additional effort, and without having to hire someone that actually speaks these languages. The only thing you need is a large amount of text in both languages.

Statistical Machine Translation
What makes a translation good?
  • It is as similar as possible in meaning to the original sentence in the source language.
  • It sounds correct in the target language, e.g., grammatically.
The first demands that the translation is adequate and the second that it is fluent.

SMT systems have a component for each of these requirements. The translation model makes sure that the translation is adequate and the language model is responsible for the fluency of the translation in the target language.

Language Model
I mentioned language models in my NLP overview post. They are used for various NLP applications. A language model (of a specific language, say English) receives as input a sentence in English and returns the probability of composing this sentence in the language. This is a score between 0 and 1, determining how fluent a sentence is in English - the higher the score, the more fluent the sentence is. Language models (LM) can capture grammatical rules (LM("she eat pizza") < LM("she eats pizza")), correct word order (LM("love I cats") < LM("I love cats")), better word choice (LM("powerful coffee") < LM("strong coffee")), and even some logic and world knowledge (LM("good British food") < LM("good Italian food")). 
These models are obtained from a large corpus (structured set of texts) in the target language (e.g. English). In the next post I will elaborate on how this is done (edit 12/09/15: you can read in the next post how this is done).

Translation Model
A translation model (from the source language to the target language) receives as input a pair of sentences / words / phrases, one in each language, and returns the probability of translating the first sentence to the second. As the language model, it also gives a score between 0 and 1, determining how adequate a translation is  - the higher the score, the more adequate the translation is. 
These models are obtained from a parallel corpora (plural of corpus) - each corpus contains the same texts in a different language (one in the source language and one in the target language). I will elaborate on how this is done in another post (edit 29/09/15: you can read in this post how it is done).

Given these two components, the language model and the translation model, how does the translation work? The translation model provides a table with words and phrases in the source language and their possible translations to the target language, each with a score. Given a sentence in the source language, the system uses this table to translate phrases from the source sentence to phrases in the target language. 

There are multiple possible translations for the source sentence; first, because the source sentence could be segmented to phrases in multiple ways. For example, take the sentence Machine translation is a piece of cake. The most intuitive thing to do would be to split it to words. This will yield a very literal translation (in Hebrew: תרגום מכונה הוא חתיכה של עוגה), which doesn't make much sense. The translation table probably also has an entry for the phrase piece of cake, translating it to a word or an idiom with the same meaning in the target language (in Hebrew: קלי קלות. Ask Google).

Second, even for a certain segmentation of the source sentence, some phrases have multiple translations in the target language. It happens both because the word in the source language is polysemous (has more than one meaning) (e.g. piece), and because one word in the source language can have many synonyms in the target language (e.g. cake).

The translation system chooses how to segment the source sentence and how to translate each of its phrases to the target language, using the scores that the two models give the translation. It multiplies the translation score for each phrase, and the language model score for the entire target sentence, for example:

P(תרגום מכונה הוא חתיכת עוגה|Machine translation is a piece of cake) = TM(תרגום,translation) TM(מכונה,machine) TM(הוא,is) TM(חתיכת,piece) TM(עוגה,cake) LM(תרגום מכונה הוא חתיכת עוגה)

This score could be understood as the conditional probability of translating Machine translation is a piece of cake to תרגום מכונה הוא חתיכת עוגה, but I'll spare you the formulas. The intuition behind multiplying the scores for the different translation components is joint probability of independent events.

Some things to note: the word of disappeared from the translation, and the words machine and translation switched places in the target sentence. These things happen and are allowed. Machine translation is a bit more complex than what I've told you. Just a bit :)

So each possible translation receives a final score, indicating both how adequate the translation is and how fluent it is in the target language, and the system chooses the translation with the highest score. In this case, Google ironically gets this one wrong.

Google ironically translates "Machine translation is a piece of cake" incorrectly.

Why is it a real bad idea to rely on machine translation when you wish to speak / write in a language that you don't speak?

Because you may say things that you don't mean. 

I'll give some examples of problems in translation.

Ambiguity - as you probably remember, this problem keeps coming back in every NLP task. In translation, the problem is that a polysemous word in the source language may be translated to different words in the target language for different senses. For example, wood can be translated in Hebrew to עץ (a piece of a tree) or to יער (a geographical area with many trees). While a human translator can pick the correct translation according to context, machines find it more difficult.

It gets even worse when you use a polysemous word in its less common meaning; A few months ago I needed to send an email to the PC (program committee) chairs of the conference in which I published my paper. I've noticed something funny about my mail, and I had to check how Google Translate can handle it. My mail started with "Dear PC chairs". I translated it to Hebrew (and back to English, for the non-Hebrew speakers in the audience):

Dear PC chairs => כסאות מחשב יקרים => expensive computer chairs

Don't expect SMT systems to always understand what you mean

So what happened here? The word chair has two meanings; I meant the less common one, chairman, while Google translated it to the more common sense (furniture). Acronyms are much worse when it comes to polysemy, and PC refers, almost 100% of the times, to Personal Computer. On top of that, the adjective dear is translated in Hebrew to יקר, which means both dear and expensive. Google chose the wrong sense, creating a funny translation. However, given the knowledge about how SMT systems work, it's understandable that selecting the more common senses of words yields better scores for both the language model and the translation model. I can't blame Google for this translation.

This is just one example of a problem in machine translation. There are so many other problems: different languages have different word order (e.g. adjective before the noun in English, but after the noun in Hebrew, French and many other languages); in some languages nouns have gender while in others they don't; idioms are really tough for SMT systems - sometimes they are translated literally, like the piece of cake example (when it was a part of a sentence).

A good translation for an idiom.
These problems are handled by more complex machine translation systems, that enable word re-ordering and translation at the syntactic level. Nevertheless, as you probably encounter from time to time, this task is not yet performed perfectly.

Since machine translation systems are not very accurate, it is very funny to translate a sentence to a random foreign language and back to English several times and see how you often get a totally different sentence (sometimes meaningless) in the end of this process. This is what Bad translator does. I tried it several times, and it was very amusing. Their example from the Ten Commandments inspired me to try other commandments, resulting in very funny bad translations:

Thou shalt not make unto thee any graven image => You can move the portrait
Thou shalt not kill => You must remove.
Thou shalt not commit adultery => Because you're here, try three
Thou shalt not steal => woman

And some good ones:

Remember the sabbath day, to keep it holy => Don't forget to consider Saturday.
Honour thy father and thy mother => honor your father and mother

You are welcome to try it and post your funny bad translations in the comments!

Friday, August 21, 2015


How likely are you to read this through? If your answer is a numerical value between 0 and 1, you may skip this post. You already know the material.

Why am I writing about probability? First of all, because I really LOVE probability. If you don't, I hope that by the end of this post you would like it a little bit more. Second, we use probability in everyday life: when we plan an outdoor activity, we estimate the probability of rain. When we make life decisions, we think of the probable consequences, since we can't tell the future... Unfortunately, most people use it wrong, without basic understanding of probability.
And last, I was about to write a post about Machine Translation, and realized that I can't explain anything without first introducing probability. Probability is widely used in NLP, as in many other computer science fields.

Probability is a numerical value between 0 and 1 measuring the likeliness that an event will occur. 0 means that the event will not occur, and 1 means that the event will certainly occur. An intuitive example is tossing a coin; there are two possible outcomes: "heads" and "tails". If the coin is fair, the probability (chance) of each outcome is ½ (50%): P(heads) = P(tails) = ½.

A fair coin and the two outcomes of its tossing.

In general, when conducting an experiment, such as tossing a coin, there are several possible outcomes (e.g. { heads, tails }). Every outcome's event (e.g. "the coin landed on heads") has a probability between 0 and 1. Since every experiment must have an outcome, the probability that any of the possible outcomes occurred is 1. In this example, P("heads or tails") = 1. This event represents the entire "probability space".

As you can see from the example above, an event can be composed of several outcomes. Think about another experiment: rolling a die. The possible outcomes are: {1, 2, 3, 4, 5, 6}. The event "the outcome is an odd number" is composed of the outcomes 1, 3, and 5. We can write it as A={1,3,5}.

A die with 3 out of 6 possible outcomes showing.

If two events have no common outcomes, they are called disjoint, and the probability that any one of them will occur is the sum of probabilities that each of them will occur. For example, A={1} and B={2}. P(A or B), denoted P(A ∪ B), is the probability that either A or B occurred. We know that a die can only show one number, so A and B can't both occur, and:
P(A ∪ B) = P(A) + P(B).

We already know that the probability of the entire probability space (all possible outcomes) is 1, so: P({1, 2, 3, 4, 5, 6}) = 1. We also know that the events are disjoint, therefore P({1, 2, 3, 4, 5, 6}) equals to the sum of probabilities of all events. If the die is fair (it is not biased towards a certain outcome), then the probability of all outcomes is equal. Therefore, P(1) = ... = P(6) = ⅙. This is called uniform distribution. In most real-world examples, this is not the case, otherwise, probability would have been boring (and probability is fascinating! Really!).

Every event has a complement. For example, the event A="the die shows an odd number"={1, 3, 5} has a complement Ā="the die shows an even number"={2, 4, 6}. The event B={1} has a complement B̄={2, 3, 4, 5, 6}. The complement of an event is all the other possible outcomes. Now you must notice that by definition, "A or Ā" is the entire probability space, and A and Ā are disjoint. Using the two properties we've just discussed:
1) the probability of the entire probability space is 1.
2) the probability that any of disjoint events occurred is the sum of their probabilities.
We can tell that P(A) + P(Ā) = 1. So if you know the probability that it would rain tomorrow P(R), you also know the probability that it won't rain tomorrow: 1 - P(R).

Joint & Conditional Probabilities
We can also discuss the joint probability of events A and B: this is the probability that both events will occur. For example, what is the probability of rolling an even number which is bigger than 2? Let's define two events. A is the event of even outcomes: A = {2, 4, 6}, and B is the event of outcomes larger than 2: B = {3, 4, 5, 6}. Then C is the intersection of A and B, denoted A ∩ B: it contains all the outcomes that are both even (in A) and larger than 2 (in B): C = {4, 6}. Since {4} and {6} are disjoint, and the probability of each outcome is ⅙, then P(C), which is also denoted as the joint probability of A and B, P(A, B) = P({4}) + P({6}) = ⅙ + ⅙ = ⅓.

Say that you know event A occurred, for example, you know that it rains today. Does it change the probability of another event B to occur, for example, that you will be late to work today? The probability of event B to occur, given that event A occurred, is the conditional probability P(B|A) (B given A). If A and B are dependent, this probability is different from the prior probability of B: P(B) (the probability of B, without having knowledge about A). The conditional probability of B given A is the ratio of how likely it is that A and B occur together, given that A has occurred: 

(1) P(B|A) = P(A,B) / P(A)

For example, when rolling a die, let A={1,3,5} (odd outcome) and B={4,5,6} (outcome greater than 3). Then P(A,B) = P(A ∩ B) = P({5}) = ⅙. P(A) = ⅙ + ⅙ + ⅙ = ½. 
Therefore, P(B|A) = P(A,B) / P(A) =  / ½ = ⅓ < P(B) = P({4, 5, 6}) = ⅙ + ⅙ + ⅙ = ½. So If you know that outcome was odd, the probability that it was greater than 3 has reduced.

On the other hand, some events may be independent. For example, if B is the event of outcomes larger than 2: B = {3, 4, 5, 6}, and A remains the same, then P(A,B) = P(A ∩ B) = P({3,5}) = ⅙ + ⅙ = . P(A) remains ½, and P(B|A) = P(A,B) / P(A) =  / ½ = ⅔ = P(B) = P({3, 4, 5, 6}) = ⅙ + ⅙ + ⅙ + ⅙ = . So knowing that the outcome was odd doesn't affect the chances the the outcome is greater than 2, and A and B are independent.

If two events A and B are independent, then P(B|A) = P(B), P(A|B) = P(A), and using equation (1) we get that P(A,B) = P(A)P(B). So if you know that two events are independent, and you want to know the probability that both of them will occur, you need to multiply the probabilities that each of them will occur. For example, what is the probability that a die will have an odd outcome (A) and a coin will show heads (B)? Intuitively, these two experiments are independent, so P(A) = ½, P(B) = ½, and P(A,B) = ½ * ½ = ¼. But don't trust your intuition, and always make sure that these events are really independent. Sometimes two events seem independent, while they are actually not (as in the butterfly effect).

If this butterfly flapped his wings yesterday, what are the chances I will be late to work next week?

Bayes Rule
Using equation (1) we get that P(A,B) = P(A) * P(B|A), and this gives us Bayes Rule:

(2) P(A|B) = P(A) * P(B|A) / P(B)

This can be useful in some cases when you know the conditional probability in one direction and would like to compute the other. For example, let's say that there is a clinical test that should diagnose a specific illness. This test is not a 100% accurate: if a person is ill, it will come out positive in 98% of the times. If the person is healthy, it will come out positive in 1% of the times. The ratio of ill people in the population is 2%. Say that someone took this test, and it came out positive. Does it necessarily mean that he is sick? No. It has some probability, that we can compute.

A is the event that a person has this illness. B is the event that the test came out positive. We know that P(B|A)=0.98 (the probability that the test came out positive for an ill person). We also know that P(A)=0.02 (the probability to suffer from this illness). We would like to compute the probability P(A|B). We can use equation (2), but we need to know P(B) - the probability that the test came out positive.

We can use the law of total probability according to which:

(3) P(B) = P(A,B) + P(Ā,B) = P(B|A) P(A) + P(B|) P()

In this example, what is the probability that the test came out positive? There are two cases, one in which the person is ill, and another in which he is healthy. These events are disjoint (because a person is either ill or healthy but never both). 

So we get that P(B) P(B|A) P(A) + P(B|) P() = 0.98*0.02 + 0.01*(1-0.02) = 0.0294, and using Bayes rule, P(A|B) = P(A) * P(B|A) / P(B) = 0.02*0.98 / 0.0294  ⅔. Since the test is not very accurate, and the illness is so rare, if someone is tested positive for this illness, there is a probability of ⅓ that he is actually healthy!

The Chain Rule
We've seen that P(A,B) = P(A) * P(B|A). Sometimes it is useful in this direction. This is called the chain rule, and it can be extended to more than two events. In some cases, we would like to compute the probability of multiple events, rather than just one or two. For example, say that you know the probabilities of private names in a certain country. When a child is born, he has a certain probability to be named John (P(John)) and other probabilities for other names. If you know the names of his older siblings, this may affect the probability of his name; if his older sibling is called John, it reduces the probability that he will also be named John. And if his sister's name is Ablah, then he is more likely to be named Mohammad than David. If you want to compute the probability of the names of all children in the family, for example P(John, Jane, David), you can use the chain rule. You will need to know the prior probability of the name John (what is the probability that a kid is called John, if you don't have any knowledge about his siblings, or if he is the first child). Then, you will need to know the probability of a girl being called Jane, given that her brother's name is John. Last, you will need to know the probability of a boy being called David, given that he has two siblings named John and Jane. In general, the probability that events A1,A2,...,Aoccurred is the multiplication of the probability of each event, given that the previous events occurred:

(4) P(A1,A2,...,An) = P(A1) P(A2|A1) ... P(An|A1A2,...,An-1)

This, again, is called the chain rule.

In some cases, we can make a Markov assumption that the probability of an event depends only on the preceding k events (for some fixed number k). For example, if a family has 5 children, the probability of their names will be: 
P(A1,A2,A3,A4,A5) = P(A1) P(A2|A1P(A3|A1A2) P(A4|A1A2A3) P(A5|A1A2A3A4)
But if we assume that a child's name only depends on his two immediate older siblings' names, then we get:
P(A1,A2,A3,A4,A5) = P(A1) P(A2|A1P(A3|A1A2) P(A4|A2A3) P(A5|A3A4)
which is easier to compute.

Let's return to the names example. What if you don't know the probability function of names, but you do have access to the list of all given names in a certain time and country? You can estimate (approximate) the probability. One simple way of doing this is by counting. This method is called Maximum Likelihood Estimation (MLE). If you want to know what's the probability of a child being named John, check what's the ratio of people called John in the entire population, so that: P*(John) = #John/N, where N is the number of people in the list you have. Since this is not a real probability but an approximation, it is denoted by P* and not by P.

If you want to know the probability of someone being called Jane given that her immediate older sibling's name is John, count all the pairs of John followed by Jane and divide by all pairs of John followed by any name:  P*(Jane|John) = #(John,Jane)/#(John,*).

There are more complex methods to approximate a probability function, but I think that's enough for one post. 

So, given that you are reading this sentence now, what is the probability that you read the entire post?