Saturday, September 12, 2015

Language Models

In my previous post about Machine TranslationI mentioned how language models are used in statistical machine translation. Language models are used in many NLP applications. In this post, I will explain about language models in general, and how to learn a certain kind of language models: n-gram language models.

A language model is for a specific language, for example, an English language model. It receives as input a sequence of words in English (sentence / phrase / word). For simplicity, let's say it receives a sentence. The language model score for a sentence s, P(s), is a score between 0 and 1, that can be interpreted as the probability of composing this sentence in English. This score determines how fluent s is in English; the higher the score, the more fluent the sentence is. Language models can capture some interesting language phenomena: 
  • Which sentence is grammatically correct? - P("he eat pizza") < P("he eats pizza") 
  • Which word order is correct? - P("love I cats") < P("I love cats")
and even some logic and world knowledge:
  • What is more likely? - P("good British food") < P("good Italian food")

It can also tell you that pdf is the fourth largest religion:


Google suggests words that are likely to complete the query. From here.

Learning a language model
What does it take to build such a language model? Just a large English text corpus (a large and structured set of texts). We are interested in the probability of sentences, words and phrases in the language, but we don't know the real distribution of words and sentences in the language. We can use a large-enough corpus to estimate this probability. The basic method is to use relative frequency (Maximum Likelihood). The probability of a certain word w to occur in English, p(w) is approximated by the ratio of the occurrences of w in the corpus (the number of occurrences of w / the number of any word occurrence). For example: 
  • The word cat occurred 3853 times, out of total of 100,000,000 words, so its estimated probability is 0.00003853.
  • The word no, on the other hand, occurs more frequently: 226,985 times. So its probability is 0.00226985, and therefore when you compose a sentence in English, you are much more likely to say the word no than cat.
But we are also interested in computing the probability of multi-word expressions, phrases and sentences. Since any of them is simply a sequence of words, we can use the chain rule to compute the probability:1  

(1) P(A1,A2,...,Am) = P(A1) P(A2|A1) ... P(An|A1A2,...,Am-1)


where P(Ai|A1A2,...,Ai-1) denotes the probability that the word Ai is the next word in the sequence A1A2,...,Ai-1. For example, P(I love my cat) = P(I) P(love|I) P(my |I love) P(cat |I love my). We can assume that the words are independent of each other, and get a much simpler formula: 

(2) P(A1,A2,...,Am) = P(A1) P(A2) ... P(Am)

So whenever you pick an extra word to continue your sentence, you choose it by its distribution in the language and regardless of the previous words. This doesn't make much sense though. The probability of the word cat is lower than that of the word no. However, in the context of the incomplete sentence "I love my", the word cat is much more likely to complete the sentence than the word no

To estimate the conditional probability of a word Ai (cat) given any number of preceding words A1,A2,...,Ai-1 (I love my), we need to count the number of occurrences of Ai after A1,A2,...,Ai-1 (how many times the sentence "I love my cat" appears in the corpus) and divide it in the number of times that A1,A2,...,Ai-1 occurred with any following word (how many times the sentence "I love my *" appears in the corpus for any word *). You would expect that P(cat|I love my) would be higher than P(no|I love my).

You would also see that the conditional probability P(cat|I love my) is different from the prior probability P(cat). I'm not sure if it would be higher though; but I'm sure that P(cat|Persian) > P(cat): you are more likely to say "cat" if you already said "Persian", than just like that out of the blue.


However, assuming that every word in the sentence depends on all the previous words is not necessary, and it causes a problem of sparsity. In simple words, there is not enough data to estimate the probabilities. In order to compute the probability of the word cat to complete the sentence My friend John once had a black, you would need the sequences "My friend John once had a black" and "My friend John once had a black cat" to actually appear in the corpus. The corpus is big, but it doesn't contain any sentence that anyone has ever said.  

What's the solution? Markov assumption. We can assume every word only depends on k preceding words. For example, if k=1, we get:

(3) P(I love my cat) = P(I) P(love|I) P(my|love) P(cat|my)

This kind of language model is called n-gram language model, where an n-gram is a contiguous sequence of n words.2  The model works with n-grams, so the assumption is that every word depends on the preceding (n-1) words. For example, a unigram (n=1) language model considers the words independent of each other (P(I love my cat) = P(I) P(love) P(my) P(cat)). A bigram language model assumes that every word depends on the previous word (P(I love my cat) = P(I) P(love|I) P(my|love) P(cat|my)). There are also trigram (n=3) and 4-gram language models; larger ns are less commonly used, to the best of my knowledge.

Smoothing
While choosing a small n reduces the sparsity, it doesn't solve the problem completely. Some rare words (e.g. absorbefacient, yes, it's an actual word) or n-grams (e.g. blue wine) may never occur in the corpus, but still be valid in the language. If we use a word's relative frequency as its probability, a word that never occurs in the corpus receives zero probability. If a sentence contains such a word (e.g. I went to the pub and ordered a glass of blue wine), its probability will be zero. While we would probably like this sentence to have a very low probability, we wouldn't want it to be zero; we are aware of the fact that our corpus may be missing some valid English words.


Smoothing solves this problem. The simplest smoothing technique "hallucinates" additional k occurrences of every word in the sentence. For example, add-1 smoothing would consider that the word absorbefacient occurred once (if it hasn't occurred at all in the corpus), and that the word cat occurred 3854 times (when it actually occurred 3853 times). The new probability is:


(4) P(cat) = (3853 + 1) / (100,000,000 + V)

where V is the size of the vocabulary (the number of words we "added" to the corpus).
The same thing applies for n-grams. With this new formula, the probability of unseen words (and n-grams) is small, but never zero.


And as always, there are more complex smoothing techniques (Back-off, Kneser-Ney, etc.), that I will not discuss in this post.


Do you want to try it yourself? I implemented a simple language model for this post.3 Type a sentence, hit the button and you'll get the probability of the sentence (after a while...). Try it!






What can you do with language models?
As the demo shows, you can compute the probability of a sentence in a certain language.
As I explained in the previous post, statistical machine translation systems use a language model of the target language to prefer translations that are more fluent in the target language.

In the other direction, language models can be used to
generate the next word in a sequence of words, by sampling from the distribution of words (given the previous word). It can complete your search query or suggest corrections to your text messages. One of the funnest things is to generate a whole new sentence. To illustrate, I used my language model and generated the not-very-sensible sentences4:


Who indulge yourself.
The american opinion have attachments of miners from hard lump sum of his search far as to ? and considerable number of the cavity.
As in Massachusetts.
To start a pretext for which the orbit and rostov smiled at mrs. rucastle seemed to give the society it must be the European settlement?'' said he was all men were to himself to your monstrosity such things he had already passed in the contrary to which alone , beyond the tarsal bones is she 's eyes were drawn up because he wanted.
It was about 87,000 soldiers.
sentences generated with a bigram language model

Better sentences could be generated with larger n or with smarter (not n-gram) language models. Anyway, generating sentences can fill up hours of fun.

Choosing the corpus from which you learn the language model greatly affects the final outcome. Needless to say, choosing the language of the corpus is crucial. If you want a French language model, you need a French corpus, etc. Furthermore, if you base your language model on Shakespeare's writings, and then try to use it to estimate the probabilities of recent posts by your Facebook friends, they will probably be very unlikely. If your corpus is from the medical domain, sentences with medical terms will have a higher probability than those discussing rock bands. So you must choose your corpus carefully according to your needs. For purposes such as machine translation, the corpus should be general and contain texts from diverse domains. However, if you develop a machine translation system for a specific application, e.g. a medical application, you may want your corpus to contain relevant documents, for instance medical documents.

Having trained your language model on a very specific corpus, e.g. a corpus of recipes or a corpus of all the songs by The Smiths, you can go along and generate a whole new sequence of words. If your language model is good enough, you might get a brand new recipe (dare to try it?) or a song by the Smiths that Morrissey has never heard about.5   In fact, language models don't have to be trained on a text corpus at all. You can train them on musical notes and compose a new melody. Here are some examples for melodies that were generated from musical notes n-gram language model.6 



And just so you won't think that n-gram language models are the state-of-the-art: there are other language models, some perform much better. Maybe I'll mention some of them in other posts.


1 To be more accurate, P(A1) represents the prior probability of A1 (the probability of the word A1 to occur in English), while we are interested in the conditional probability of A1 given the beginning of a sentence. Therefore, the beginning of each sentence in the corpus is marked with a special sign <S>, and P(A1) is replaced by P(A1|<S>). This was omitted from the rest of the formulas for simplicity.

2 A good source for n-grams count is Google Ngrams, extracted from Google Books. 

3 The language model in the demo is a bigram language model with add-1 smoothing. I trained it using the corpus big.txt from Peter Norvig's website.

4 I started with the special sign <S> and sampled the next word from the distribution given the previous word, until period was sampled.  

5 In fact, I tried it, but it didn't work well because the corpus was too small. The Smiths were only active for 5 years and they don't have enough songs.

6 These examples are taken from Implementing A Music Improviser Using N-Gram Models by Kristen Felch and Yale Song. They were not the first to implement a musical n-gram model (I found a previous work, and I'm sure there are others), but they published some sample songs that are pretty good. 

2 comments: