Wednesday, November 11, 2015

Recommender Systems

Recommender systems suggest users specific content according to their preferences, by predicting the users' rating or preference of items.

If this doesn't ring a bell, let me tell you how common recommender systems are in your world.
  • Shopping sites, such as AmazoneBay, and AliExpress, recommend items to purchase based on your previous purchases.*
Amazon recommended me to buy a recycling bin sticker because I had recycle bins in my shopping basket
  • IMDB recommends movies you might like, based on your other movie ratings.
  • IMDB recommended my husband to watch "Pirates of Silicon Valley", based on his interest in similar movies.

  • Music sites, such as Rdio, Spotify and the late GrooveShark, recommend artists similar to the artists you listen to.

Rdio suggested artists that it considered similar to those I've been listening to
  • And of course, YouTube recommends videos based on your previous watches.
YouTube identified that I like budgies and Muse and that I'm a beginner electric guitar player
You can't overestimate the business value of these recommendations. Studies have shown that they increase sales. So how does Amazon know which products are likely to interest me? How does Rdio know to recommend I listen to Kings of Leon and would never recommend listening to Taylor Swift? In general, these systems know that I like certain products/artists/movies/videos and would like to predict other products/artists/movies/videos I might like. Many of these systems use a quite simple algorithm, called Collaborative Filtering.

Collaborative Filtering

Let's take music recommendation as an example. The system has many registered users, and many artists. For simplicity, assume that rating applies to artists rather than albums or songs. Users can rate artists in a scale of 1-5. An average user would rate a few artists; there would still be many other artists that he doesn't rate, either because he doesn't know them or because he didn't listen to them through the website.

In many cases, the user doesn't actively rate the artist, but the system infers a rating based on the user's behavior - for example, if the user listened to a certain artist many times, it counts as high rating. On the other hand, if the system offered the user this artist a couple of times and he always clicked "skip", he must really not like this artist. The rating technique doesn't really matter.

User preferences are stored in a large matrix (table), in which each row represents a user, and each column represents an artist:

A sample user artist ranking matrix. with 5 users and 7 artists. The ranking is between 1 (hate) to 5 (love). If a user hasn't ranked a certain artist, the rank is unknown and is marked with a question mark. This is a toy example; much more data is needed for accurate inference.
The idea behind the algorithm is basically to get rid of the question marks in the matrix, and replace them with predicted ratings. The basic assumptions are:

  1. If I like a certain artist (e.g. Muse), I would like similar artists (e.g. Royal Blood).
  2. If I'm user A, and I agree with user B's ratings on many artists, I might agree with user B's ratings of other artists, which I haven't rated (and maybe don't know yet). For example, if both user B and I really love Taylor Swift and Miley Cyrus, and user B also really likes Bruno Mars, which I haven't rated, then I might also like Bruno Mars.
From these two reasonable assumptions, the algorithm diverges to two possible implementations:
  1. Based on user similarity
    This implementation looks at a certain user and tries to complete missing ratings for this user. According to the second assumption, all we need to do is measure to what extent this user's preferences are similar to those of any other user, and then use the similar users' ratings to complete missing ratings.

    It is easy to see that each user is represented by a row vector. The dimension of this vector is the number of artists, and every cell in the vector represents the user's rating of a certain artist. For example, this is user 2's vector:

    We can measure the correlation of user 2's ratings with another user's rating, by looking at the subset of artists that both users rated. We treat each rating with respect to the user's average rating of artists. For example, user 2's average rating is (1 + 5 + 4 + 4) / 4 = 3.5, so his rating of Arctic Monkeys is -2.5 below the average, while his rating for Taylor Swift is +1.5 above the average. High correlation between users occurs when they have many mutual ratings with similar distance from the average. I'll spare you the formula, but let's look at an example. It seems that users 1 and 3 share a fairly similar taste in music:
    Rating vectors of user 1 (top) and 3 (bottom). The mutual ratings are similar.
    While they both differ from user 5, that seems to hate both Arctic Monkeys and Muse:

    If we would like to predict user 1's missing rating of Bruno Mars, we first need to find the k most similar users to user 1 that rated Bruno Mars (k being a parameter of the system). For each of these similar users, we compute the distance of their rating of Bruno Mars from their average rating. For example, +1, -2.5, etc. Then, we compute a weighted average of these distances, with the user-similarity as the weight. The weighted average means that the more similar a certain user is to the target user (e.g. user 1), the more weight we put on his preferences.

    This weighted average is the predicted distance from the average rating of user 1. For example, setting k=1 will take into account user 3 as the most similar user. User 3's average rating is (5 + 1 + 5 + 2 + 5 + 1) / 6 = 3.167. The distance of his rating for Bruno Mars from the average is 1 - 3.617 = -2.617. Since we chose k=1, this would be exactly the distance for user 1. His average rating is (4 + 5 + 3) / 3 = 4, so his predicted rating for Bruno Mars is 4 - 2.617 = 1.83. Pretty reasonable considering that both users like rock, but user 1 shows more tolerance to pop music by liking Adele a bit more than user 3.
  2. Based on item similarity
    This variant is pretty straightforward after understanding the user-based similarity. We now look at each artist as a column vector, trying to predict what users would rate this artist. 
    The column vector for Arctic Monkeys (left) and Muse (right). They are fairly similar in the given toy example (while not so similar in real life, given other rock artists).

    The k most similar artists of each artist are computed in a similar fashion. Again, with respect to the distance of each user's rating from his average (that's because different users have different rating scales - some are more generous than others). The predicted rating of a certain artist by a certain user is the weighted average of the ratings of this user for similar artists. For example, to predict user 4's rating of Arctic Monkeys with k=1, we simply take his rating of Muse.
Next time that some website miraculously predicts which artist/video/movie/product you would like, you should know it wasn't a wild guess but a rather simple heuristic.

* No items were purchased during the writing of this post.

Friday, October 9, 2015

Fun with Google Ngrams

Google N-Grams is a dataset released by Google a few years back, which is based on Google Books. The dataset is available for multiple languages. All Google books of a certain language were scanned, and the frequency of each n-gram, for n = 1 to 5, was counted by the publish year. For example, how many times did the word "no" occur in books from 1800 to 1900? How many times did the trigram "freedom of speech" occur in the year 2000? These counts were normalized and the result is approximated probability of n-grams throughout time. The data is available for download, and can also be viewed via the viewer.

While this data is very helpful for training NLP models, it can also provide some cultural, historical and sociological insights... and hours of fun!

Let's warm up with a simple example, exploring the change in the English language throughout time. I took a few synonyms of the word happy, and compared their usage between 1800 and 2008 (the last year available in the viewer). While the curves of merry, cheerful and delighted look pretty similar, the gay curve departs from the others in the 1980s. There's a reason for that, and that could be explained with the following graph: The word gay in the meaning of homosexual, has been in use since the 1950s, boosting its usage since.

The frequency of a certain term sometimes correlates with historical events. For example, while the word war is constantly in use, it was mostly prominent in books during and after World War I and World War II. See the peaks in the graph:

Another interesting thing to notice is that people (at least authors) are actually peaceful. Whenever they talk about wars, they also talk about peace. Look how similar the war and peace curves are:

The curve similarity may suggest that the same books that discuss war also mention peace, but since the war curve dominates the peace curve, I can only assume that war is the books' main topic and peace is only mentioned a couple of times. I hope that they say good things about peace.

Searching for World Trade Center shows that it was first mentioned around its construction in 1973, then there were a few years that it was hardly discussed, and then came along 09/11 and made it a very common topic. In some cases, the correlation with historical events is through new words that describe concepts or products. The time they start appearing in books is by the time of their invention / foundation. For example:

Facebook was founded in 2004.
Google was founded in 1998.
Twitter was founded in 2006. However, twitter is an English word that was already in use before 2006 (and as it seems, sometimes appeared capitalized, probably in the beginning of a sentence).
What about some older inventions?
The invention of the telephone, which is attributed to Alexander Graham Bell, in fact involved other inventors such as Antonio Meucci and Thomas Watson. They started in 1844, but Bell granted patent for the telephone in 1876. The television was invented in 1926. Which of them had greater influence on the world? If there is any correlation between being mentioned in books and having influence on the world, it seems like the television did. Having said that, telephone is commonly referred to as phone, and in recent years also includes cellphone and smartphone. So putting all these together changes the picture:
Some words were mentioned for periods of times and then just disappeared. Take for example this list of diseases, each relevant in different eras: Except for historical events, you can try to use the data to search for correlations between events or phenomena. Judge for yourself: Bear in mind that correlation doesn't imply cause-effect relation, and not even that a third factor impacts these two phenomena. Sometimes they just happen at the same time.

Just for the fun of it, can you guess which is the most important day of the week? It's Sunday! I expected that from English books, but I thought that Saturday will be more prominent in Hebrew books. That wasn't the case - the Hebrew graph was similar with Sunday way ahead of the other days. Maybe this happens because of translated books. Happy weekend everyone!

Tuesday, September 29, 2015

Translation Models

This is the last part of the machine translation overview, in which I will discuss translation models. To recall, a statistical machine translation system produces a translation that is required to be both adequate, that is, as close as possible in its meaning to the source sentence, and fluent in the target language. Fluency is the responsibility of the target language model, that scores a every candidate translation according to its likelihood in the target language. The translation model, which will be presented in this post, takes care of adequacy: it scores candidate translations with respect to the original sentence in the source language - higher scores for sentences that better preserve the meaning of the original sentence.

Toilet sign at a restaurant in Mestre, Italy. Some kind of machine translation was used, translating toilet in Hebrew to the makeup. If you recognize funny translations in other languages, please comment!
As in language models, you don't need an expert to build the translation model. You don't even need to speak either the source or the target language. Using statistical methods, you can (theoretically), build a translation model from Swahili to Yiddish. The only requirement is to have a parallel corpus - a large amount of the same text, written in both languages. For example, movie subtitles or book translations in both languages. The texts are usually aligned at the sentence-level, so it can be regarded as a large collection of sentences in the source language and their translations to the target language. For example, the first sentence from George Orwell's novel 1984, in the original edition and in the Hebrew translation:

en: It was a bright cold day in April, and the clocks were striking thirteen.
he: יום אפריל צח וצונן, השעונים מצלצלים שלוש-עשרה.

can be considered as mutual translations. So do the rest of the sentence-pairs, as long as the translator is not too creative.

History Lesson
Here's a nice anecdote about using a parallel corpus for translation it's actually not a modern technique at all. It has been here since the 19th century. The Rosetta Stone is an ancient Egyptian stone inscribed with a decree issued at Egypt, in 196 BC. The text on the stone is written in three scripts: ancient Egyptian hieroglyphs, Demotic script, and ancient Greek. Ancient Egyptian hieroglyphs were used until the end of the fourth century, after which the knowledge of how to read them was lost. For hundreds of years, scholars have tried to decode the ancient Egyptian hieroglyphs. In 1799, the Rosetta stone was rediscovered near the town of Rosetta in the Nile, and brought with it a major advancement in the decoding of the ancient Egyptian hieroglyphs. It was the recognition that the stone offered three versions of the same text that enabled the advancement, making it the first parallel corpus used for statistical translation (at this time, without machines). It was finally decoded in 1822 by the French scholar Jean-François Champollion. The stone is on public display at the British Museum (and is the most interesting exhibit there, in my opinion).

The Rosetta Stone

Learning the translation model
Using sentence pairs from a parallel corpus as a translation table is nice, but not enough. You can always generate a sentence in the source language that didn't occur in the corpus, so it wouldn't be in the table. However, a sentence is composed of phrases (words and multi-word expressions), so instead of constructing a sentence translation table, a phrase translation table could be built, enabling a phrase-by-phrase translation. If the corpus is large enough, you can assume that it covers at least most of the common words and phrases in these languages.

This is what an excerpt from a phrase table from English to Hebrew might look like:

thirteenשלוש עשרה0.41
thirteenשלושה עשר0.21
Each entry contains a source language phrase, a target language phrase and the score (probability) of translating the source phrase to the target phrase. These are not trivial to compute, since the corpus is aligned at the sentence level. All we know is that "יום אפריל צח וצונן, השעונים מצלצלים שלוש-עשרה" is a (possible) translation of "It was a bright cold day in April, and the clocks were striking thirteen", but we don't know which words in English are translated to which words in Hebrew. The assumption is that each word in the source sentence is translated to 0, 1 or more words in the target language. In the simple case, it is translated to one word. In other cases, a word may disappear in translation (for example, the determiner "a" in English doesn't exist in Hebrew) or be translated to a multi-word phrase (e.g. the word "thirteen" is translated to "שלוש עשרה").

The word-level alignment of a sentence-pair.
The solution is, again, to use statistical methods. In particular, aligning these sentence pairs at the word level using the corpus statistics. The most basic alignment model is IBM model 1. It goes over all the sentence pairs in the corpus, and counts for each source word its occurrences in the same sentence pair with target words - since every target word could be its translation. In the example sentences-pair, the Hebrew word יום is counted once with every one of the English words It, was, a, bright, cold, day, in, April, and, the, clocks, were, striking, thirteen. If it appears in another sentence pair, for example, "איזה יום יפה" and "what a beautiful day", the word day will have two occurrences with יום. Since this is the true translation, the word day will occur in every sentence pair in which the word יום occurs. These counts are used to estimate the probability of translating the source word to a target word. In some cases, an English word may have several possible translations, such as cold that could be translated both to צונן and קר. In this case, the English word cold will appear in some cases with צונן and in others with קר. The probability will be computed accordingly (and will be higher for the more common translation).

This is the basic model, and there are other IBM models (2-5) that handle some of the problems that the basic model doesn't solve (e.g. considering the distance between aligned words). This phase's output is a word-to-word table, and then another algorithm is applied to create a phrase table, merging multi-word expressions to one phrase (e.g. "hot dog" which is translated differently from "hot" and "dog"). 

Putting it all together
The decoder is responsible for performing the actual translation: given the source sentence, it constructs a new sentence in the target language, using the translation model to offer phrase translations and their scores, and the language model to rank the fluency of the translation.

There are multiple ways to segment the source sentence to phrases (e.g., should "hot dog" be regarded as a phrase, or segmented to "hot" and "dog"?), and in most cases there are also multiple ways to translate each phrase in the source language to a phrase in the target language (e.g., should "cold" be translated to "צונן" or to "קר"?). In addition, the phrases in the target language may be re-ordered to follow grammar rules in the target language (e.g. adjective before noun in English, but after noun in many languages such as Hebrew, Romanian and French). The decoder tries many of these segmentations, translations and orders and produces candidate translations.

Each candidate translation is scored by three components: the language model scores the translation according to its fluency in the target language. The re-ordering model (which we haven't discussed in details) gives a score based on the changes in the order of words in both languages. The last score is the one given by the translation model. Each phrase-to-phrase translation score is the probability to translate one phrase to the other. So the translation model's score for the entire sentence is the product of all phrase translation scores, for example, if the source sentence is "It's not cold in April":

score(לא קר באפריל) = TM(לא,not) TM(קר,cold) TM(ב,in) TM(אפריל,April) LM(לא קר באפריל) RM(לא קר באפריל, It's not cold in April)

And eventually the decoder would select the candidate translation with the highest score it could find.

As always, I'll end the post with hedging myself by saying that I really haven't presented the entire world of translation, just gave you a taste of it. I tried to simplify the basic models that I told you about, but they are a bit less simple than I described. Also, there are newer and more accurate models that involve machine learning techniques, or consider the syntax of the source and target sentences. I hope I could convey the basics clearly and interestingly enough :)

Saturday, September 12, 2015

Language Models

In my previous post about Machine TranslationI mentioned how language models are used in statistical machine translation. Language models are used in many NLP applications. In this post, I will explain about language models in general, and how to learn a certain kind of language models: n-gram language models.

A language model is for a specific language, for example, an English language model. It receives as input a sequence of words in English (sentence / phrase / word). For simplicity, let's say it receives a sentence. The language model score for a sentence s, P(s), is a score between 0 and 1, that can be interpreted as the probability of composing this sentence in English. This score determines how fluent s is in English; the higher the score, the more fluent the sentence is. Language models can capture some interesting language phenomena: 
  • Which sentence is grammatically correct? - P("he eat pizza") < P("he eats pizza") 
  • Which word order is correct? - P("love I cats") < P("I love cats")
and even some logic and world knowledge:
  • What is more likely? - P("good British food") < P("good Italian food")

It can also tell you that pdf is the fourth largest religion:

Google suggests words that are likely to complete the query. From here.

Learning a language model
What does it take to build such a language model? Just a large English text corpus (a large and structured set of texts). We are interested in the probability of sentences, words and phrases in the language, but we don't know the real distribution of words and sentences in the language. We can use a large-enough corpus to estimate this probability. The basic method is to use relative frequency (Maximum Likelihood). The probability of a certain word w to occur in English, p(w) is approximated by the ratio of the occurrences of w in the corpus (the number of occurrences of w / the number of any word occurrence). For example: 
  • The word cat occurred 3853 times, out of total of 100,000,000 words, so its estimated probability is 0.00003853.
  • The word no, on the other hand, occurs more frequently: 226,985 times. So its probability is 0.00226985, and therefore when you compose a sentence in English, you are much more likely to say the word no than cat.
But we are also interested in computing the probability of multi-word expressions, phrases and sentences. Since any of them is simply a sequence of words, we can use the chain rule to compute the probability:1  

(1) P(A1,A2,...,Am) = P(A1) P(A2|A1) ... P(An|A1A2,...,Am-1)

where P(Ai|A1A2,...,Ai-1) denotes the probability that the word Ai is the next word in the sequence A1A2,...,Ai-1. For example, P(I love my cat) = P(I) P(love|I) P(my |I love) P(cat |I love my). We can assume that the words are independent of each other, and get a much simpler formula: 

(2) P(A1,A2,...,Am) = P(A1) P(A2) ... P(Am)

So whenever you pick an extra word to continue your sentence, you choose it by its distribution in the language and regardless of the previous words. This doesn't make much sense though. The probability of the word cat is lower than that of the word no. However, in the context of the incomplete sentence "I love my", the word cat is much more likely to complete the sentence than the word no

To estimate the conditional probability of a word Ai (cat) given any number of preceding words A1,A2,...,Ai-1 (I love my), we need to count the number of occurrences of Ai after A1,A2,...,Ai-1 (how many times the sentence "I love my cat" appears in the corpus) and divide it in the number of times that A1,A2,...,Ai-1 occurred with any following word (how many times the sentence "I love my *" appears in the corpus for any word *). You would expect that P(cat|I love my) would be higher than P(no|I love my).

You would also see that the conditional probability P(cat|I love my) is different from the prior probability P(cat). I'm not sure if it would be higher though; but I'm sure that P(cat|Persian) > P(cat): you are more likely to say "cat" if you already said "Persian", than just like that out of the blue.

However, assuming that every word in the sentence depends on all the previous words is not necessary, and it causes a problem of sparsity. In simple words, there is not enough data to estimate the probabilities. In order to compute the probability of the word cat to complete the sentence My friend John once had a black, you would need the sequences "My friend John once had a black" and "My friend John once had a black cat" to actually appear in the corpus. The corpus is big, but it doesn't contain any sentence that anyone has ever said.  

What's the solution? Markov assumption. We can assume every word only depends on k preceding words. For example, if k=1, we get:

(3) P(I love my cat) = P(I) P(love|I) P(my|love) P(cat|my)

This kind of language model is called n-gram language model, where an n-gram is a contiguous sequence of n words.2  The model works with n-grams, so the assumption is that every word depends on the preceding (n-1) words. For example, a unigram (n=1) language model considers the words independent of each other (P(I love my cat) = P(I) P(love) P(my) P(cat)). A bigram language model assumes that every word depends on the previous word (P(I love my cat) = P(I) P(love|I) P(my|love) P(cat|my)). There are also trigram (n=3) and 4-gram language models; larger ns are less commonly used, to the best of my knowledge.

While choosing a small n reduces the sparsity, it doesn't solve the problem completely. Some rare words (e.g. absorbefacient, yes, it's an actual word) or n-grams (e.g. blue wine) may never occur in the corpus, but still be valid in the language. If we use a word's relative frequency as its probability, a word that never occurs in the corpus receives zero probability. If a sentence contains such a word (e.g. I went to the pub and ordered a glass of blue wine), its probability will be zero. While we would probably like this sentence to have a very low probability, we wouldn't want it to be zero; we are aware of the fact that our corpus may be missing some valid English words.

Smoothing solves this problem. The simplest smoothing technique "hallucinates" additional k occurrences of every word in the sentence. For example, add-1 smoothing would consider that the word absorbefacient occurred once (if it hasn't occurred at all in the corpus), and that the word cat occurred 3854 times (when it actually occurred 3853 times). The new probability is:

(4) P(cat) = (3853 + 1) / (100,000,000 + V)

where V is the size of the vocabulary (the number of words we "added" to the corpus).
The same thing applies for n-grams. With this new formula, the probability of unseen words (and n-grams) is small, but never zero.

And as always, there are more complex smoothing techniques (Back-off, Kneser-Ney, etc.), that I will not discuss in this post.

Do you want to try it yourself? I implemented a simple language model for this post.3 Type a sentence, hit the button and you'll get the probability of the sentence (after a while...). Try it!

What can you do with language models?
As the demo shows, you can compute the probability of a sentence in a certain language.
As I explained in the previous post, statistical machine translation systems use a language model of the target language to prefer translations that are more fluent in the target language.

In the other direction, language models can be used to
generate the next word in a sequence of words, by sampling from the distribution of words (given the previous word). It can complete your search query or suggest corrections to your text messages. One of the funnest things is to generate a whole new sentence. To illustrate, I used my language model and generated the not-very-sensible sentences4:

Who indulge yourself.
The american opinion have attachments of miners from hard lump sum of his search far as to ? and considerable number of the cavity.
As in Massachusetts.
To start a pretext for which the orbit and rostov smiled at mrs. rucastle seemed to give the society it must be the European settlement?'' said he was all men were to himself to your monstrosity such things he had already passed in the contrary to which alone , beyond the tarsal bones is she 's eyes were drawn up because he wanted.
It was about 87,000 soldiers.
sentences generated with a bigram language model

Better sentences could be generated with larger n or with smarter (not n-gram) language models. Anyway, generating sentences can fill up hours of fun.

Choosing the corpus from which you learn the language model greatly affects the final outcome. Needless to say, choosing the language of the corpus is crucial. If you want a French language model, you need a French corpus, etc. Furthermore, if you base your language model on Shakespeare's writings, and then try to use it to estimate the probabilities of recent posts by your Facebook friends, they will probably be very unlikely. If your corpus is from the medical domain, sentences with medical terms will have a higher probability than those discussing rock bands. So you must choose your corpus carefully according to your needs. For purposes such as machine translation, the corpus should be general and contain texts from diverse domains. However, if you develop a machine translation system for a specific application, e.g. a medical application, you may want your corpus to contain relevant documents, for instance medical documents.

Having trained your language model on a very specific corpus, e.g. a corpus of recipes or a corpus of all the songs by The Smiths, you can go along and generate a whole new sequence of words. If your language model is good enough, you might get a brand new recipe (dare to try it?) or a song by the Smiths that Morrissey has never heard about.5   In fact, language models don't have to be trained on a text corpus at all. You can train them on musical notes and compose a new melody. Here are some examples for melodies that were generated from musical notes n-gram language model.6 

And just so you won't think that n-gram language models are the state-of-the-art: there are other language models, some perform much better. Maybe I'll mention some of them in other posts.

1 To be more accurate, P(A1) represents the prior probability of A1 (the probability of the word A1 to occur in English), while we are interested in the conditional probability of A1 given the beginning of a sentence. Therefore, the beginning of each sentence in the corpus is marked with a special sign <S>, and P(A1) is replaced by P(A1|<S>). This was omitted from the rest of the formulas for simplicity.

2 A good source for n-grams count is Google Ngrams, extracted from Google Books. 

3 The language model in the demo is a bigram language model with add-1 smoothing. I trained it using the corpus big.txt from Peter Norvig's website.

4 I started with the special sign <S> and sampled the next word from the distribution given the previous word, until period was sampled.  

5 In fact, I tried it, but it didn't work well because the corpus was too small. The Smiths were only active for 5 years and they don't have enough songs.

6 These examples are taken from Implementing A Music Improviser Using N-Gram Models by Kristen Felch and Yale Song. They were not the first to implement a musical n-gram model (I found a previous work, and I'm sure there are others), but they published some sample songs that are pretty good. 

Monday, August 31, 2015

Machine Translation Overview

Imagine you are at a restaurant in a foreign country, and by trying to avoid tourist traps, you found yourself at a nice local restaurant in a quiet neighborhood, no tourists except for you. The only problem is that the menu is in a foreign language... no English menu. What's the problem, actually? Pick your favorite machine translation system (Google Translate, Bing Translator, BabelFish, etc.) and translate the menu to a language you understand!

So, there's no need to elaborate on the motivation for translation. What I would like to do is give you an overview of how this magic works, and some idea of why it doesn't always work as well as you would expect.

The Tower of Babel, by Pieter Bruegel the Elder. Oil on board, 1563.
I'm going to focus on statistical machine translation. Translation means taking a sentence in one language (the source language) and producing a sensible sentence in another language (the target language) that has the same meaning. Machine means that it's done by software rather than a human translator. What does statistical mean?

It means that rather than coding expert knowledge into software and creating a lexicon and grammatical rules for translation between two specific languages, these systems are based on statistics on texts in the source and target languages. This is what makes it possible to produce translation between any source and target languages without additional effort, and without having to hire someone that actually speaks these languages. The only thing you need is a large amount of text in both languages.

Statistical Machine Translation
What makes a translation good?
  • It is as similar as possible in meaning to the original sentence in the source language.
  • It sounds correct in the target language, e.g., grammatically.
The first demands that the translation is adequate and the second that it is fluent.

SMT systems have a component for each of these requirements. The translation model makes sure that the translation is adequate and the language model is responsible for the fluency of the translation in the target language.

Language Model
I mentioned language models in my NLP overview post. They are used for various NLP applications. A language model (of a specific language, say English) receives as input a sentence in English and returns the probability of composing this sentence in the language. This is a score between 0 and 1, determining how fluent a sentence is in English - the higher the score, the more fluent the sentence is. Language models (LM) can capture grammatical rules (LM("she eat pizza") < LM("she eats pizza")), correct word order (LM("love I cats") < LM("I love cats")), better word choice (LM("powerful coffee") < LM("strong coffee")), and even some logic and world knowledge (LM("good British food") < LM("good Italian food")). 
These models are obtained from a large corpus (structured set of texts) in the target language (e.g. English). In the next post I will elaborate on how this is done (edit 12/09/15: you can read in the next post how this is done).

Translation Model
A translation model (from the source language to the target language) receives as input a pair of sentences / words / phrases, one in each language, and returns the probability of translating the first sentence to the second. As the language model, it also gives a score between 0 and 1, determining how adequate a translation is  - the higher the score, the more adequate the translation is. 
These models are obtained from a parallel corpora (plural of corpus) - each corpus contains the same texts in a different language (one in the source language and one in the target language). I will elaborate on how this is done in another post (edit 29/09/15: you can read in this post how it is done).

Given these two components, the language model and the translation model, how does the translation work? The translation model provides a table with words and phrases in the source language and their possible translations to the target language, each with a score. Given a sentence in the source language, the system uses this table to translate phrases from the source sentence to phrases in the target language. 

There are multiple possible translations for the source sentence; first, because the source sentence could be segmented to phrases in multiple ways. For example, take the sentence Machine translation is a piece of cake. The most intuitive thing to do would be to split it to words. This will yield a very literal translation (in Hebrew: תרגום מכונה הוא חתיכה של עוגה), which doesn't make much sense. The translation table probably also has an entry for the phrase piece of cake, translating it to a word or an idiom with the same meaning in the target language (in Hebrew: קלי קלות. Ask Google).

Second, even for a certain segmentation of the source sentence, some phrases have multiple translations in the target language. It happens both because the word in the source language is polysemous (has more than one meaning) (e.g. piece), and because one word in the source language can have many synonyms in the target language (e.g. cake).

The translation system chooses how to segment the source sentence and how to translate each of its phrases to the target language, using the scores that the two models give the translation. It multiplies the translation score for each phrase, and the language model score for the entire target sentence, for example:

P(תרגום מכונה הוא חתיכת עוגה|Machine translation is a piece of cake) = TM(תרגום,translation) TM(מכונה,machine) TM(הוא,is) TM(חתיכת,piece) TM(עוגה,cake) LM(תרגום מכונה הוא חתיכת עוגה)

This score could be understood as the conditional probability of translating Machine translation is a piece of cake to תרגום מכונה הוא חתיכת עוגה, but I'll spare you the formulas. The intuition behind multiplying the scores for the different translation components is joint probability of independent events.

Some things to note: the word of disappeared from the translation, and the words machine and translation switched places in the target sentence. These things happen and are allowed. Machine translation is a bit more complex than what I've told you. Just a bit :)

So each possible translation receives a final score, indicating both how adequate the translation is and how fluent it is in the target language, and the system chooses the translation with the highest score. In this case, Google ironically gets this one wrong.

Google ironically translates "Machine translation is a piece of cake" incorrectly.

Why is it a real bad idea to rely on machine translation when you wish to speak / write in a language that you don't speak?

Because you may say things that you don't mean. 

I'll give some examples of problems in translation.

Ambiguity - as you probably remember, this problem keeps coming back in every NLP task. In translation, the problem is that a polysemous word in the source language may be translated to different words in the target language for different senses. For example, wood can be translated in Hebrew to עץ (a piece of a tree) or to יער (a geographical area with many trees). While a human translator can pick the correct translation according to context, machines find it more difficult.

It gets even worse when you use a polysemous word in its less common meaning; A few months ago I needed to send an email to the PC (program committee) chairs of the conference in which I published my paper. I've noticed something funny about my mail, and I had to check how Google Translate can handle it. My mail started with "Dear PC chairs". I translated it to Hebrew (and back to English, for the non-Hebrew speakers in the audience):

Dear PC chairs => כסאות מחשב יקרים => expensive computer chairs

Don't expect SMT systems to always understand what you mean

So what happened here? The word chair has two meanings; I meant the less common one, chairman, while Google translated it to the more common sense (furniture). Acronyms are much worse when it comes to polysemy, and PC refers, almost 100% of the times, to Personal Computer. On top of that, the adjective dear is translated in Hebrew to יקר, which means both dear and expensive. Google chose the wrong sense, creating a funny translation. However, given the knowledge about how SMT systems work, it's understandable that selecting the more common senses of words yields better scores for both the language model and the translation model. I can't blame Google for this translation.

This is just one example of a problem in machine translation. There are so many other problems: different languages have different word order (e.g. adjective before the noun in English, but after the noun in Hebrew, French and many other languages); in some languages nouns have gender while in others they don't; idioms are really tough for SMT systems - sometimes they are translated literally, like the piece of cake example (when it was a part of a sentence).

A good translation for an idiom.
These problems are handled by more complex machine translation systems, that enable word re-ordering and translation at the syntactic level. Nevertheless, as you probably encounter from time to time, this task is not yet performed perfectly.

Since machine translation systems are not very accurate, it is very funny to translate a sentence to a random foreign language and back to English several times and see how you often get a totally different sentence (sometimes meaningless) in the end of this process. This is what Bad translator does. I tried it several times, and it was very amusing. Their example from the Ten Commandments inspired me to try other commandments, resulting in very funny bad translations:

Thou shalt not make unto thee any graven image => You can move the portrait
Thou shalt not kill => You must remove.
Thou shalt not commit adultery => Because you're here, try three
Thou shalt not steal => woman

And some good ones:

Remember the sabbath day, to keep it holy => Don't forget to consider Saturday.
Honour thy father and thy mother => honor your father and mother

You are welcome to try it and post your funny bad translations in the comments!