We're already after 6 posts on the topic of natural language processing, and I can't believe I haven't discussed this basic topic yet. So today I'm going to discuss words; More accurately - I will discuss how words are represented in natural language processing.
A word is the most basic unit in semantic tasks. While other, low-level tasks, such as part-of-speech tagging (detecting the part of speech of each word in a sentence) and lemmatization (finding the lemma - a word's basic form) might be interested in the characters (mostly affixes) of the word, semantic tasks are concerned with meaning. A word is the most basic unit that conveys meaning.
While a word is basically stored in the computer as a string - a sequence of characters, this representation says nothing about the word's meaning. It allows detecting a certain similarity, in very specific cases. For example, morphological derivations (e.g. sing-singer) and inflections (e.g. cat-cats, listen-listened) modify words, creating related words (with a different part-of-speech, plurality form, etc). Such words are therefore similar both in meanings and in their string representations, since they share common characters. This similarity could be detected using Levenshtein distance.
Needless to say, most of the words that are similar by meaning do not share many common characters. Synonyms such as elevator and lift and related words such as food and eat, are not similar at the character-level.
In addition, performing operations on strings is highly inefficient. Computers can handle numbers much better than strings. Better representations are needed also for faster computation. I assume that I've convinced you why words need better representations. Now I can move on to telling you how words could be better represented as vectors (arrays).
one-hot vectors
As I mentioned, working with strings is computationally inefficient. A simple solution to this inefficiency is to convert every string to a number. First, we need to define the vocabulary: this is the set of all distinct words in the language (or at least those that we observed in a very large corpus). Unlike a dictionary, in which the entries are word basic forms, a vocabulary contains different entries for inflections and derivations of the same basic form (e.g., cat and cats).
When a semantic application processes a text, it can now replace every word in the text with its index in the vocabulary, e.g. cat is 12424, cats is 12431, dog is 15879, etc.
When a semantic application processes a text, it can now replace every word in the text with its index in the vocabulary, e.g. cat is 12424, cats is 12431, dog is 15879, etc.
This is the most compact representation for a word - all it takes to store a word is one number. So what do vectors have to do with this? Another way to look at this representation is as a vector of dimension |V| (with V denoting the vocabulary), with zeros in all entries, and one set bit at the word's index (see figure 1).
Figure 1: an illustration of one-hot vectors of cat, cats and dog. The only non-zero entry in each vector is the index of the word in the vocabulary. |
While one-hot vectors could be stored efficiently, their main problem is that they don't capture any information about the similarity between words. They even lose the information about string-similarity, that sometimes indicates semantic similarity (when words share the same lemma / basic form). Since we're interested in representing words for semantic tasks, a better solution is needed.
Distributional vectors
One important characteristic of a word is the company it keeps. According to the distributional hypothesis [1] , words that occur in similar contexts (with the same neighboring words), tend to have similar meanings (e.g. elevator and lift will both appear next to down, up, building, floor, and stairs); simply put, "tell me who your friends are and I will tell you who you are" - the words version.
This idea was leveraged to represent words by their contexts. Suppose that each word is again represented by a |V|-dimensional vector. Each entry in the vector corresponds to an index in the vocabulary. Now, instead of marking a word with its own index, we mark the occurrences of other words next to it. Given a large corpus, we search for all the occurrences of a certain word (e.g. elevator). We take a window of a pre-defined size k around elevator, and every word in this window is considered as a neighbor of elevator. For instance, if the corpus contains the sentence "The left elevator goes down to the second floor", with a window of size 5, we get the following neighbors: the, left, goes, down. A larger k will also include characteristic words such as floor.
One important characteristic of a word is the company it keeps. According to the distributional hypothesis [1] , words that occur in similar contexts (with the same neighboring words), tend to have similar meanings (e.g. elevator and lift will both appear next to down, up, building, floor, and stairs); simply put, "tell me who your friends are and I will tell you who you are" - the words version.
This idea was leveraged to represent words by their contexts. Suppose that each word is again represented by a |V|-dimensional vector. Each entry in the vector corresponds to an index in the vocabulary. Now, instead of marking a word with its own index, we mark the occurrences of other words next to it. Given a large corpus, we search for all the occurrences of a certain word (e.g. elevator). We take a window of a pre-defined size k around elevator, and every word in this window is considered as a neighbor of elevator. For instance, if the corpus contains the sentence "The left elevator goes down to the second floor", with a window of size 5, we get the following neighbors: the, left, goes, down. A larger k will also include characteristic words such as floor.
We then update elevator's vector by increasing the number of occurrences in the indices of the, left, goes, and down. At the end of this process, each word vector contains the frequencies of all its neighbors. We can normalize the vector to get a probability distribution (how likely each word is to appear as a neighbor of the target word). There are some more complex metrics, but this is the main idea. These methods are also referred to as "counting methods".
The main advantage of distributional vectors is that they capture similarity between words: similar words => similar neighbors => similar vectors. Measuring similarity between vectors is possible, using measures such as cosine similarity, for example. We get a simple method of comparing similarities between words - we can expect elevator and lift to yield a higher similarity score than elevator and, say, cat (and I wouldn't miss the chance to watch a cat in an elevator video). These simple vector similarities are highly effective in recognizing word similarity in semantic tasks (e.g., to overcome lexical variability).
Yet, distributional vectors pose computational obstacles -- the vocabulary size is typically very high (at least a few hundred thousand words). Storing each of these |V| words in a |V|-dimensional vector results in a |V|2 cells matrix - this is quite large, and performing operations on all words is computationally heavy.
Word embeddings
Word embeddings to the rescue. The basic idea is to store the same contextual information in a low-dimensional vector; each word is now represented by a D-dimensional vector, where D is a relatively small number (typically between 50 and 1000). This approach was first presented in 2003 by Bengio et al. [2] , but gained extreme popularity with word2vec [3] in 2013. There are also some other variants of word embeddings, like GloVe [4].
Instead of counting occurrences of neighboring words, the vectors are now predicted (learned). Without getting too technical, this is what these algorithms basically do: they start from a random vector for each word in the vocabulary. Then they go over a large corpus and at every step, observe a target word and its context (neighbors within a window). The target word's vector and the context words' vectors are then updated to bring them close together in the vector space (and therefore increase their similarity score). Other vectors (all of them, or a sample of them) are updated to become less close to the target word. After observing many such windows, the vectors become meaningful, yielding similar vectors to similar words.
What are the advantages of word embeddings? First of all, despite the low dimensions, the information regarding word similarity is kept. Similar words still have similar vectors - this is what the algorithms aim to do while learning these vectors. Second, they are compact, so any operation on these vectors (e.g. computing similarities) is efficient and fast.
Moreover, people have done some amazing things with these vectors. Some of these things (if not all) are possible with high-dimensional vectors as well, but are computationally difficult.
Word embeddings to the rescue. The basic idea is to store the same contextual information in a low-dimensional vector; each word is now represented by a D-dimensional vector, where D is a relatively small number (typically between 50 and 1000). This approach was first presented in 2003 by Bengio et al. [2] , but gained extreme popularity with word2vec [3] in 2013. There are also some other variants of word embeddings, like GloVe [4].
Instead of counting occurrences of neighboring words, the vectors are now predicted (learned). Without getting too technical, this is what these algorithms basically do: they start from a random vector for each word in the vocabulary. Then they go over a large corpus and at every step, observe a target word and its context (neighbors within a window). The target word's vector and the context words' vectors are then updated to bring them close together in the vector space (and therefore increase their similarity score). Other vectors (all of them, or a sample of them) are updated to become less close to the target word. After observing many such windows, the vectors become meaningful, yielding similar vectors to similar words.
What are the advantages of word embeddings? First of all, despite the low dimensions, the information regarding word similarity is kept. Similar words still have similar vectors - this is what the algorithms aim to do while learning these vectors. Second, they are compact, so any operation on these vectors (e.g. computing similarities) is efficient and fast.
Moreover, people have done some amazing things with these vectors. Some of these things (if not all) are possible with high-dimensional vectors as well, but are computationally difficult.
- Most similar words - we can easily find the most similar words to a certain word, by finding the most similar vectors. For instance, the 5 most similar words to July are June, January, October, November, and February. We can also find a word which is similar to a set of other words: for example, the most similar word to Israel + Lebanon + Syria + Jordan is Iraq, while the most similar word to France + Germany + Italy + Switzerland is Austria! Isn't this cool? If that's not impressive enough, using techniques for projecting high-dimensional vectors to 2-dimensional space (t-SNE, PCA), one can visualize word embeddings, and get some really nice insights about what the vectors capture:
- Analogies - Following the above figure, the authors of this paper presented some nice results regarding the vectors' ability to solve analogy questions: a is to b as c is to d; given a, b, and c, find the missing word d. Turns out that this can be solved pretty accurately by selecting a vector which is similar to b and c, but not to a. This is done by addition and subtraction, and the most famous example is the "king + woman - man = queen", demonstrated in the following figure:
Figure 4: two-dimensional projection of word2vec vectors of countries and their capitals, from the paper. The lines between countries and their capitals are approximately parallel, indicating that the vectors capture this relation. |
Figure 5: an illustration of the analogical male-female relation between word pairs such as (king, queen), (man, woman) and (uncle, aunt), from the paper. |
While these results are very impressive (I'm still impressed and I've known about them for a long time now...), it still isn't clear how useful these vectors are. Turns out they are amazingly useful. Almost any semantic task has been re-implemented recently with the help of word embeddings. Many of them benefit from a performance boost, and it's mostly just very easy to use them. In future posts, I'll give some examples of tasks that rely on word embeddings.
Suggested Reading:
Deep Learning, NLP, and Representations, from Christopher Olah's blog (more technical and advanced).
Deep Learning, NLP, and Representations, from Christopher Olah's blog (more technical and advanced).
References:
[2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin, "A neural probabilistic language model", The Journal of Machine Learning Research, 2003. ↩
[3] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient estimation of word representations in vector space." CoRR, 2013.. ↩
[4] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. "GloVe: Global Vectors for Word Representation." EMNLP 2014. ↩
[2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin, "A neural probabilistic language model", The Journal of Machine Learning Research, 2003. ↩
[3] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient estimation of word representations in vector space." CoRR, 2013.. ↩
[4] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. "GloVe: Global Vectors for Word Representation." EMNLP 2014. ↩