Sunday, January 3, 2016

Representing Words

We're already after 6 posts on the topic of natural language processing, and I can't believe I haven't discussed this basic topic yet. So today I'm going to discuss words; More accurately - I will discuss how words are represented in natural language processing.

A word is the most basic unit in semantic tasks. While other, low-level tasks, such as part-of-speech tagging (detecting the part of speech of each word in a sentence) and lemmatization (finding the lemma - a word's basic form) might be interested in the characters (mostly affixes) of the word, semantic tasks are concerned with meaning. A word is the most basic unit that conveys meaning. 

While a word is basically stored in the computer as a string - a sequence of characters, this representation says nothing about the word's meaning. It allows detecting a certain similarity, in very specific cases. For example, morphological derivations (e.g. sing-singer) and inflections (e.g. cat-cats, listen-listened) modify words, creating related words (with a different part-of-speech, plurality form, etc). Such words are therefore similar both in meanings and in their string representations, since they share common characters. This similarity could be detected using Levenshtein distance.

Needless to say, most of the words that are similar by meaning do not share many common characters. Synonyms such as elevator and lift and related words such as food and eat, are not similar at the character-level. 

In addition, performing operations on strings is highly inefficient. Computers can handle numbers much better than strings. Better representations are needed also for faster computation. I assume that I've convinced you why words need better representations. Now I can move on to telling you how words could be better represented as vectors (arrays).

one-hot vectors
As I mentioned, working with strings is computationally inefficient. A simple solution to this inefficiency is to convert every string to a number. First, we need to define the vocabulary: this is the set of all distinct words in the language (or at least those that we observed in a very large corpus). Unlike a dictionary, in which the entries are word basic forms, a vocabulary contains different entries for inflections and derivations of the same basic form (e.g., cat and cats).
When a semantic application processes a text, it can now replace every word in the text with its index in the vocabulary, e.g. cat is 12424, cats is 12431, dog is 15879, etc.

This is the most compact representation for a word - all it takes to store a word is one number. So what do vectors have to do with this? Another way to look at this representation is as a vector of dimension |V| (with V denoting the vocabulary), with zeros in all entries, and one set bit at the word's index (see figure 1).

Figure 1: an illustration of one-hot vectors of cat, cats and dog. The only non-zero entry in each vector is the index of the word in the vocabulary.
While one-hot vectors could be stored efficiently, their main problem is that they don't capture any information about the similarity between words. They even lose the information about string-similarity, that sometimes indicates semantic similarity (when words share the same lemma / basic form). Since we're interested in representing words for semantic tasks, a better solution is needed.

Distributional vectors
One important characteristic of a word is the company it keeps. According to the distributional hypothesis [1] , words that occur in similar contexts (with the same neighboring words), tend to have similar meanings (e.g. elevator and lift will both appear next to downupbuildingfloor, and stairs); simply put, "tell me who your friends are and I will tell you who you are" - the words version. 

This idea was leveraged to represent words by their contexts. Suppose that each word is again represented by a |V|-dimensional vector. Each entry in the vector corresponds to an index in the vocabulary. Now, instead of marking a word with its own index, we mark the occurrences of other words next to it. Given a large corpus, we search for all the occurrences of a certain word (e.g. elevator). We take a window of a pre-defined size k around elevator, and every word in this window is considered as a neighbor of elevator. For instance, if the corpus contains the sentence "The left elevator goes down to the second floor", with a window of size 5, we get the following neighbors: the, left, goes, down. A larger k will also include characteristic words such as floor.
We then update elevator's vector by increasing the number of occurrences in the indices of the, left, goes, and down. At the end of this process, each word vector contains the frequencies of all its neighbors. We can normalize the vector to get a probability distribution (how likely each word is to appear as a neighbor of the target word). There are some more complex metrics, but this is the main idea. These methods are also referred to as "counting methods".

The main advantage of distributional vectors is that they capture similarity between words: similar words => similar neighbors => similar vectors. Measuring similarity between vectors is possible, using measures such as cosine similarity, for example. We get a simple method of comparing similarities between words - we can expect elevator and lift to yield a higher similarity score than elevator and, say, cat (and I wouldn't miss the chance to watch a cat in an elevator video). These simple vector similarities are highly effective in recognizing word similarity in semantic tasks (e.g., to overcome lexical variability).

Figure 2: an illustration of distributional vectors of foodeat and laptop. Notice that the vectors of the similar words food and eat are similar, while different from the vector of the dissimilar word laptop.

Yet, distributional vectors pose computational obstacles -- the vocabulary size is typically very high (at least a few hundred thousand words). Storing each of these |V| words in a |V|-dimensional vector results in a |V|cells matrix - this is quite large, and performing operations on all words is computationally heavy.

Word embeddings
Word embeddings to the rescue. The basic idea is to store the same contextual information in a low-dimensional vector; each word is now represented by a D-dimensional vector, where D is a relatively small number (typically between 50 and 1000). This approach was first presented in 2003 by Bengio et al. [2] , but gained extreme popularity with word2vec [3] in 2013. There are also some other variants of word embeddings, like GloVe [4].

Instead of counting occurrences of neighboring words, the vectors are now predicted (learned). Without getting too technical, this is what these algorithms basically do: they start from a random vector for each word in the vocabulary. Then they go over a large corpus and at every step, observe a target word and its context (neighbors within a window). The target word's vector and the context words' vectors are then updated to bring them close together in the vector space (and therefore increase their similarity score). Other vectors (all of them, or a sample of them) are updated to become less close to the target word. After observing many such windows, the vectors become meaningful, yielding similar vectors to similar words.

What are the advantages of word embeddings? First of all, despite the low dimensions, the information regarding word similarity is kept. Similar words still have similar vectors - this is what the algorithms aim to do while learning these vectors. Second, they are compact, so any operation on these vectors (e.g. computing similarities) is efficient and fast.

Moreover, people have done some amazing things with these vectors. Some of these things (if not all) are possible with high-dimensional vectors as well, but are computationally difficult.

  • Most similar words - we can easily find the most similar words to a certain word, by finding the most similar vectors. For instance, the 5 most similar words to July are JuneJanuaryOctoberNovember, and February. We can also find a word which is similar to a set of other words: for example, the most similar word to Israel Lebanon Syria Jordan is Iraq, while the most similar word to France Germany Italy Switzerland is Austria! Isn't this cool? If that's not impressive enough, using techniques for projecting high-dimensional vectors to 2-dimensional space (t-SNE, PCA), one can visualize word embeddings, and get some really nice insights about what the vectors capture:

Figure 3: a t-SNE visualization of a tiny sample of GloVe word embeddings. It seems to have noticed that bird and fly are related, as well as bird and cat, but love, marriage and wedding are not, for some reason.
    Figure 4: two-dimensional projection of word2vec vectors of countries and their capitals, from the paper. The lines between countries and their capitals are approximately parallel, indicating that the vectors capture this relation. 
  • Analogies - Following the above figure, the authors of this paper presented some nice results regarding the vectors' ability to solve analogy questions: a is to b as c is to d; given a, b, and c, find the missing word d. Turns out that this can be solved pretty accurately by selecting a vector which is similar to b and c, but not to a. This is done by addition and subtraction, and the most famous example is the "king + woman - man = queen", demonstrated in the following figure:
Figure 5: an illustration of the analogical male-female relation between word pairs such as (king, queen), (man, woman) and (uncle, aunt), from the paper.

While these results are very impressive (I'm still impressed and I've known about them for a long time now...), it still isn't clear how useful these vectors are. Turns out they are amazingly useful. Almost any semantic task has been re-implemented recently with the help of word embeddings. Many of them benefit from a performance boost, and it's mostly just very easy to use them. In future posts, I'll give some examples of tasks that rely on word embeddings.

Suggested Reading:
Deep Learning, NLP, and Representations, from Christopher Olah's blog (more technical and advanced).

[1] Harris, Zellig S. "Distributional structure." Word. 1954. 
[2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin, "A neural probabilistic language model", The Journal of Machine Learning Research, 2003. 
[3] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient estimation of word representations in vector space." CoRR, 2013.. 
[4] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. "GloVe: Global Vectors for Word Representation." EMNLP 2014. 


  1. Following a mail comment: how is the low dimensionality of word embeddings achieved?

    The main idea is that these vectors are predefined at a low dimension, e.g. 300, and they are initialized randomly. Then, the algorithm reads the corpus, and for each word, it updates its vector so that its similarity scores with the neighbours are high and its similarity scores with the rest of the word vectors (or with a sample of non-neighbours) are low. After reading a very large corpus, words that occur with similar neighbours will be similar due to the training objective.

    Another point that I didn't mention in the post is that prior to word embeddings, it was common to apply dimensionality reduction to the word co-occurrence matrix (count-based vectors) to get low-dimensional word vectors. They perform similarly or slightly worse than word embeddings (it depends who you ask... [1], [2]).

    [1] Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. Marco Baroni, Georgiana Dinu, and Germán Kruszewski. ACL 2014.
    [2] Improving distributional similarity with lessons learned from word embeddings. Omer Levy, Yoav Goldberg, and Ido Dagan. TACL 2015.

  2. Thank you for this excellent excellent post on the topic. Very easy to understanddd!! :)

    Thank You!! :)

  3. Thank you for this well articulated post!

  4. Thank you. Very well written. Can you please consider writing on why word embeddings generalize. Maybe another excellent blog on how word embeddings and neural networks work together, with some technical details as well. Thanks for considering the request.

    1. You're welcome, thanks for the feedback! I will consider that, but can't promise anything... I have yet to learn how to explain neural networks to laymen :)

  5. So helpful. Thanks a lot.

  6. Thanks a lot for your post. Well written an easy to understand to newbies (like me) on the nlp field.


  7. I read a dozen articles trying to understand this concept, i came across your quora answer on the same which directed me here. I am glad I read this, crystal clear explanation. Kudos to you!

  8. Hey, Vered!

    As one of the many people who are setting their foot into the world of NLP, I have always had second thoughts about keeping myself up-to-date about recent advances that happen in its area of research without knowing my fundamentals about the concepts involved. After having searched through a lot of online resources to understand about word embeddings better, I reached your blog post through Quora. I loved reading this article of yours as it helped me gain an immensely interesting glimpse on word embeddings in the first shot as a commoner! Thank you so much for sharing this article. And, I would love to read more such articles written by you.

    I shall be eagerly waiting for more to read and catch up on! Thanks again :)

  9. Thank you so much, it's great to get such feedback! :)

  10. Thanks for awesome explanation, I loved it.

  11. Thanks for great explanations, it was awesome. I still have a question. like we have several measurement formula to get the similarity between strings, is it correct to say that all them works only if we find the similarity of them based on the context?
    I mean for example the index of elevator is 1424, and the index of lift is 1550, so finding the similarity without having a context seems stupid?

    sorry for my naive question, I am kind of newbie

    1. So I dive more and got that the similarity measure works based on the frequency of the words in a DOC, but if you can provide me with a solid information that will be great

    2. Thanks for the feedback, I'm glad you find the post helpful!

      I'm not sure whether this answers your question, but yes, using cosine similarity for one-hot vectors indeed does not make sense. The other two types -- high-dimensional distributional vectors and word embeddings -- both produce similar vectors for semantically similar words, so vector similarity measures like cosine indicate whether two words are similar.

      Word vectors are usually learned based on the distributional hypothesis, so based on their contexts, but there has been work on learning word embeddings based on relations in WordNet, for example:,

      I hope this answers your questions (I'm not sure if I understand the second comment, about the words in the doc).

  12. Great contribution!

    Could you tell me what the source for figure 2 is?
    Thank you!

    1. Thank you! I drew it, as well as figure 1. Figures taken from other sources normally link to the sources.

      (If you're asking because you've recently saw the figure in a paper... I know. They took it from here).

    2. Thats great!

      Did you use it in any paper of yours? I would like to use the figure in my thesis, but if possible with a reference to an academic paper

    3. I'm glad you find it helpful! I didn't use it in my papers, just here. Feel free to use it with a link to the blog :)

  13. Thank you for the wonderful explanation of words in the vector space.

    Few doubts:
    1. On last method (Word embeddings), transforming higher dimensions to lower dimension will cause the loss of some features?
    2. How to decide the number of neurons in the hidden layer as this will be act as transformed lower dimension. (above we say 300/500).

    Waiting !!! :)

  14. I really like this post, thank you for your help