Tuesday, August 14, 2018

Deep Learning in NLP

This post is an old debt. Since I’ve started this blog 3 years ago, I’ve been refraining from writing about deep learning (DL), with the exception of occasionally discussing a method that uses it, without going into details. It’s a challenge to explain deep learning using simple concepts and without the caveat of remaining at a very high level. But perhaps worse than that, DL somewhat contradicts the description of my blog: “human-interpretable computer science”. We don’t really know how or why it works, and attempts to interpret it still only scratch the surface. On the other hand, it has been so prominent in NLP in the last few years (Figure 1), that it’s no longer reasonable to ignore it in a blog about NLP. So here’s my attempt to talk about it. 
Figure 1: word cloud of the words in the titles of the accepted papers of ACL 2018, from https://acl2018.org/2018/07/31/conference-stats/. Note the prevalence of deep learning-related words such as “embeddings”, “network”, “rnns”.

This writing is based on many resources and most of the material is a summary of the main points I have taken from them. Credit is given in the form of linking to the original source. I would be happy to get corrections and comments!
    Going Deep
    The Lack of Interpretability

Who is this post meant for?

As always, this post will probably be too basic for most NLP researchers, but you’re welcome to distribute it to people who are new to the field! And if you’re like me and you enjoy reading people’s views about things you already know, you’re welcome to read too!

What this post is not

This post is NOT an extensive list of all the most up-to-date recent innovations in deep learning for NLP; I don’t know many of them myself. I tried to remain in a relatively high-level, so there are no formulas or formal definitions. If you want to read an extensive, detailed overview of how deep learning methods are used in NLP, I strongly recommend Yoav Goldberg’s “Neural Network Methods for Natural Language Processing” book. My post will also not teach you anything practical. If you’re looking to learn DL for NLP in 3 days and strike a fortune, there are tons of useful guides on the web. Not this post!

Let’s talk about deep learning already!

OK! Deep learning is a subfield of machine learning. As with machine learning, the goal is to give computers the ability to “learn” to solve some task without being explicitly programmed to solve it, but rather by using data and applying statistical methods to it. In the common case of supervised learning, the idea is to make the computer estimate some function, that takes some input and returns a prediction with respect to some question. I know this is very vague, so let’s take the task of named entity recognition (NER) as an example. The task is to determine for each word in a sequence of words (say, a sentence), whether it is a part of a named entity, and if so, of which type. In the text, “Prince Harry attended a wedding with a hole in his shoe” (I didn’t make it up), the words “Prince” and “Harry” should be classified as PERSON, and other words as NONE. A machine learning algorithm expects as input a set of properties pertaining to each word (called “feature vector”), and predicts a label for this word, out of a set of possible labels (e.g. PERSON, ORGANIZATION, LOCATION, NONE).

In the context of NLP, machine learning has been used for a long time, so what is new? Before we go through the differences, let’s think of the characteristics of a traditional machine learning algorithm, specifically a supervised classifier. In general, a traditional supervised classification pipeline was as follows.

Figure 2: pipeline of a traditional supervised classifier  - extracting features, then approximating the function.

The input is raw (e.g. text), along with the gold labels. Phase (1) is the extraction of human-designed features from raw input, and representing it as vectors of features. Going back to the NER example, the person designing the learning algorithm would ask themselves: “what can indicate the type of entity of a given word?”. For example, a capitalized word is a good indication that a word is a part of a named entity in general. When the previous word is “The”, it can hint that the entity is an organization rather than a person. Someone had to come up with all these ideas. Here is an example of features that were previously used for NER:

Figure 3: features for NER, from this paper. The example is taken from Stanford CS224d course: Deep Learning for Natural Language Processing.

Phase (2) is the learning/training phase, in which the computer tries to approximate a function that takes as input the feature vectors and predicts the correct labels. The function is now a function of the features, i.e., in the simple case of a linear classifier, it assigns a weight for each feature with respect to each label. Features which are highly indicative of a label (e.g. previous word = The for ORGANIZATION) would be assigned a high weight for that label, and those which are highly indicative of not belonging to that label would be assigned a very low weight. The final prediction is done by summing up all the weighted feature values, and choosing the label that got the highest score.

Note that I’m not discussing how these weights are learned from the training examples and the gold labels. This is not super important for the current discussion, and you can read about it elsewhere (e.g. here).

Now let’s talk about the differences between traditional machine learning and deep learning.

What has improved?

First of all, it works, empirically. The state-of-the-art performance in nearly every NLP task today is achieved by a neural model. Literally every task listed on this website “Tracking Progress in Natural Language Processing” has a neural model as the best performing model. Why is it working so much better from previous approaches?

Going Deep

“Deep learning” is a buzzword referring to deep neural networks - powerful learning algorithms inspired by the brain’s computation mechanism. A simple single-layered feed-forward neural network is basically the same as the linear classifier we discussed. A deep neural network is a network that contains one or more hidden layers, which are also learned. This is the visual difference between a linear classifier (left, neural network with no hidden layers) and a neural network with a single hidden layer (right):

Figure 4: shallow neural network (left) vs. deep (1 hidden layer) neural network (right).

In general, the deeper the network is, the more complex functions it can estimate, and the better it can (theoretically) approximate them.

In mathematical notation, each layer is a multiplication by another learned matrix, but more importantly, it also goes into a non-linear activation function (e.g. hyperbolic tangent or the simple rectifier that sets values under a certain threshold to zero). Some functions can’t be approximated accurately enough using linear models. I’m deliberately not going into details, but you can read about linear separability here. With the help of the multiple layers and the non-linear activations, neural networks can approximate better functions for many tasks, resulting in improved performance.

Representation Learning

Another key aspect that changed is how the input is represented. Traditional machine learning methods worked because someone designed a well-thought feature vector to represent the inputs, as we can see in figure 3. Deep learning obviates the need to come up with a meaningful representation, and learns a representation from raw input (e.g. words). The new pipeline looks like this:

Figure 5: pipeline of a deep supervised classifier - learning both the input representation and the other model parameters.
Now, rather than feeding the learning algorithms hand-engineered feature vectors, we only give it the raw texts. In the case of NER, we can now just feed as features a window of words around each target word. For example, in the sentence “John worked at the Post Office in the city until last year”, the feature vector of the target word “Office” with a window of size 3 would be [Post, Office, in]. The learning algorithm, in addition to learning the network parameters (the function from representation to output), as it did before, learns also the word representations suitable for the task at hand. In other words, one of the additional parameters that the network learns is the word embeddings. We can think about it as a lookup table whose index is a word (string) and its output is a vector.

It is common to initialize this lookup table with pre-trained word embeddings. We discussed word embeddings in this blog post. They are trained using a large text collection, based on a linguistic hypothesis which states that words with similar meanings appear in the same contexts (next to the same “neighbour” words). Pre-trained embeddings are useful because they are often trained on a lot more data than available for the end task itself, and the more data, the more high-quality the vectors are.

On the other hand, pre-trained embeddings provide a general notion of “similarity” between words, which is not necessarily the same similarity our task needs. Think for example about sentiment analysis. Let’s say we are trying to predict the sentiment of restaurant reviews. Generic pre-trained embeddings will tell us that good and bad are highly similar, because they appear near the same words. But in the context of sentiment analysis we’d like their vectors to be further apart. The developer can choose between initializing the lookup table with the pre-trained embeddings or randomly, and also between updating the embeddings as additional network parameters, to fit the task better, or keeping them fixed. We’ll touch upon that when we discuss overfitting. 

Back to the NER example, this is what a network with a window size of 3 would look like:

Figure 6: a neural network for NER, which uses a window of 3 words to classify a word.

In a traditional model, the feature “next word” is a discrete variable than can accept as value any word in the vocabulary. Let’s say that during training, the model encountered a word like “ltd” many times as the next word of an organization, and figured that it is a good indication of the ORGANIZATION class. If during test time, the model needs to classify a word followed by “Inc”, it may have no information about this word, and can’t generalize the knowledge about the similar word “ltd”. When the feature vector is composed of word embeddings, since the word embeddings of “ltd” and “Inc” are similar, now that inputs are similar, and the model can use knowledge about similar words to output the correct prediction.

Recurrent Neural Networks

The network we discussed so far is called a feed-forward neural network. In the NER example we used a fixed-size window of words around each target word. But what if we want to use the entire sentence? For example, in the sentence “John worked at the Post Office in the city until last year, and hated this organization” it is beneficial for the model to be aware of the last word “organization” while predicting the labels of “Post” and “Office”. One problem is that we don’t know in advance how many words each input sentence will contain. Another problem is that the representation of each word in the window model is independent of the other words - and we would like the representation to be more contextualized. For instance, the representation of “post” in the context of “post office” should be different from its representations in “blog post” or “post doc”.

Recurrent neural networks (RNNs) solve both problems. An RNN is a network that takes as input a sequence (e.g. a sentence as a sequence of words, or a sequence of characters, etc.) and outputs vectors representing each subsequence (i.e. the first item, the first two items, ..., the entire sequence). At each time step, the RNN considers both the previous memory (of the sequence until the previous input) and the current input. The last output vector can be considered as representing of the entire sequence, like a sentence embedding. Intermediate vectors can represent a word in its context.

The output vectors can then be used for many things, among which: representation - a fixed-size vector representation for arbitrarily long texts; as feature vectors for classification (e.g. representing a word in context, and predicting its NER tag); or for generating new sequences (e.g. in translation, a sequence-to-sequence or seq2seq model encodes the source sentence and then decodes = generates the translation). In the case of the NER example, we can now use the output vector corresponding to each word as the word’s feature vector, and predict the label based on the entire preceding context.

Figure 7: a NER model that uses an RNN to represent each word in its context.

Technical notes: an LSTM (long short-term memory) is a specific type of RNN that works particularly well and is commonly used. The differences between various RNN architectures are in the implementation of the internal memory. A bidirectional RNN/LSTM or a biLSTM processes the sequences from both sides - right to left and left to right, such that each output vector contains information pertaining to both the previous and the subsequent items in the sequence. For a much more complete overview of RNNs, I refer you to Andrej Karpathy's blog post "The Unreasonable Effectiveness of Recurrent Neural Networks".

So these are the main differences. There are many other new techniques on top of them, such as attention, Generative Adversarial Networks (GAN), deep reinforcement learning, and multi-task learning, but we won’t discuss them.

Interestingly, neural networks are not a new idea. They have been around since the 1950s, but have become increasingly popular in recent years thanks to the advances in computing power and the amount of available text on the web.

What is not yet working perfectly?

Although the popular media likes to paint an optimistic picture of “AI is solved” (or rather, a very pessimistic, false picture of “AI is taking over humanity”), in practice there are still many limitations to current deep methods. Here are several of them, mostly from the point of view of someone who works on semantics (feel free to add more limitations in the comments!).

The Need for Unreasonable Amount of Data

It’s difficult to convey the challenges in our work to people outside the NLP field (and related fields), when there are already many products out there that perform so well. Indeed, we have come a long way to be able to generally claim that some tasks are “solved”. This is especially true to low-level text analysis tasks such as part of speech tagging. But the performance on more complex semantic tasks like machine translation is surprisingly good as well. So what is the secret sauce of success, and why am I still unsatisfied?

First, let’s look at a few examples of tasks “solved” by DL, not necessarily in NLP:
  • Automatic Speech Recognition (ASR) - also known as speech-to-text. Deep learning-based methods have reported human-level performance last year, but this interesting blog post tells us differently. According to this post, while the the recent improvements are impressive, the claims about human-level performance are too broad. ASR works very well on American accented English with high signal-to-noise ratios. It has been trained on conversations by mostly American native English speakers with little background noise, which is available in large-scale. It doesn’t work well, definitely not human-level performance, for other languages, accents, non-native speakers, etc.

  • Facial Recognition - another task claimed to be solved, but this article from the New York Times (based on a study from MIT), says that it’s only solved for white men. An algorithm trained to identify gender from images was 99% accurate on images of white man, but far less accurate -- only 65% -- for dark-skinned women. Why is that so? a widely used dataset for facial recognition was estimated to be more than 75% male and more than 80% white.

  • Machine Translation - not claimed to be solved yet, but the release of Google Translate neural models in 2016 reported large performance improvements. The paper reported “60% reduction in translation errors on several popular language pairs”. The language pairs were English→Spanish, English→French, English→Chinese, Spanish→English, French→English, and Chinese→English. All these languages are considered “high-resource” languages, or in simple words, languages for which there is a massive amount of training data. As we discussed in the blog post about translation models, the training data for machine translation systems is a large collection of the same texts written in both the source language (e.g. English) and the target language (e.g. French). Think book translations as an example for a source of training data.

    In other news, popular media was recently worried that Google Translate spits out some religious nonsense, completely unrelated to the source text. Here is an example. The top example is with religious content, the bottom example is not religious, only unrelated to the source text.

    Figure 8: “who has been using these technologies for a long time”? Hopefully not the Igbo speakers, translating to English. Google Translate makes up things when translating Gibberish from the low-resource language “Igbo”.
    This excellent blog post offers a simple explanation to this phenomenon: “low-resource” language pairs (e.g. Igbo and English), for which there is not a lot of available data, have worse neural translation models. Not surprising so far. The reason the translator generates nonsense is that when it is given unknown inputs (the input nonsense in the figure 8), the system tries to provide a fluent translation and ends up “hallucinating” sentences. Why religious texts? Because religious texts like the Bible and the Koran exist in many languages, and they are probably the major part of the available training data for translations between low-resource languages.

Can you spot a pattern among the successful tasks? Having a tremendous amount of training data increases the chances of training high-quality neural models for the task. Of course, it is a necessary-but-not-sufficient condition. In her RepL4NLP keynote, Yejin Choi talked about “solved” tasks and said that they all have in common a lot of training data and enough layers. But with respect to NLP tasks, there is another factor. Performing well on machine translation is possible without having a model with deep text understanding abilities, but rather by relying on the strong alignment between the input and the output. Other tasks, which require deeper understanding, such as recognizing fake news, summarizing a document, or making a conversation, have not been solved yet. (And may not be solvable just by adding more training data or more layers?).

The models of these “solved” tasks are only applicable for inputs which are similar to the training data. ASR for Scottish accent, facial recognition for black women, and translation from Igbo to English are not solved yet. Translation is not the only NLP example; whenever someone tells you some NLP task is solved, it probably only applies to very specific domains in English.

The Risk of Overfitting

Immediately following the previous point, when the training data is limited, we have a risk of overfitting. By definition, overfitting happens when a model performs extremely well on the training set, while performing poorly on the test set. It happens because the model memorizes specific patterns in the training data instead of looking at the big picture. If these patterns are not indicative of the actual task, and are not present in the test data, the model would perform badly on the test data. For example, let’s say you’re working on a very simplistic variant of text classification, in which you need to distinguish between news and sports articles. Your training data contains articles about sports and tweets from news agencies. Your model may learn that an article with less than 280 characters is related to news. The performance would be great on the training set, but what your model actually learned is not to distinguish between news and sports articles but rather between tweets and other texts. This definitely won’t be helpful in production when your news examples can be full-length articles.

Overfitting is not new to DL, but what’s changed from traditional machine learning are two main aspects: (1) it’s more difficult to “debug” DL models and detect overfitting, because we no longer have nice manually-designed features, but automatically learned representations; and (2) the models have many many more parameters than traditional machine learning models used to have - the more layers, the more parameters. This means that a model can now learn more complex functions, but it’s not guaranteed to learn the best function for the task, but rather the best function for the given data. Unfortunately, it’s not always the same, and this is something to keep in mind!

With respect to the first aspect, updating the pre-trained word embeddings during training can lead to overfitting. Given that the training set is limited and doesn’t cover the entire vocabulary, we’re only moving some words in the embedding space but keeping others in their place. The words we move (e.g. kangaroo) now have better vectors for the specific task, but they’re further away from their (distributional) neighbours which are not found in the training set. This hurts the model generalization abilities: when it encounters an unobserved word (e.g. wallaby), its vector is no longer located next to similar words (kangaroo) which have been moved, so the model doesn’t know much about it. Updating the embeddings during training is a good idea only if your task has very different needs from its embeddings than just plain similarity (and then you may want to start with a random initialization rather than pre-trained embeddings), and only if you have enough training data that covers a broad vocabulary.

For the next, related point, I’m going to broaden the definition of “overfitting” to the phenomenon of a model that memorizes the peculiarities in the data rather than what’s actually important for the task, regardless of its performance on the test set.

The Artificial Data Problem

Machine learning enthusiasts like to say that given enough training data, DL can learn to estimate any function. I personally take the more pessimistic stand that some language tasks are too complex and nuanced to solve using DL with any reasonable amount of training data. Nevertheless, like everyone else in the field, I’m constantly busy thinking of ways to get more data quickly and without spending too much money. 

One representative example (of many) to a task in which the release of a large amount of training data raised researchers’ interest in the task is recognizing textual entailment (RTE, sometimes called “natural language inference” or NLI). This is an artificial task that was invented because many language understanding abilities we develop can be reduced to this task. In this task, two sentences -- a premise and a hypothesis -- are given. The task is for a model to automatically determine what a human reading the premise can say with respect to the hypothesis. Let’s look at the definitions along with a simple example from the SNLI dataset. For the premise “A street performer is doing his act for kids” and the hypothesis:

  • Entailment: the hypothesis must also be true.
    Hypothesis = “A person performing for children on the street”.
    The premise tells us there is a street performer, so we can infer that he is a person and he is on the street. The premise also tells us that he is doing his act - therefore performing, and for kids = for children. The hypothesis only repeats the information conveyed in the premise and information which can be inferred from it.

  • Neutral: the hypothesis may or may not be true.
    Hypothesis = “A juggler entertaining a group of children on the street”.
    In addition to repeating information from the premise, the hypothesis now tells us that it’s a juggler. The premise is more general and can refer to other types of street performers (e.g. a guitar player), so this may or may not be a juggler.

  • Contradiction: the hypothesis must be false.
    Hypothesis = “A magician performing for an audience in a nightclub”.
    The hypothesis describes a completely different event happening at a nightclub.

This task is very difficult for humans and machines, and requires background knowledge, commonsense, knowledge about the relationship between words (whether they mean the same thing, one of them is more specific than the other, or do they contradict each other), recognizing that different mentions refer to same entity (coreference), dealing with syntactic variations, etc. For an extensive list, I recommend reading this really good summary

In the early days, methods’ performance was mediocre. The available data for this task contained a few hundreds of annotated examples. They required many different types of knowledge to answer correctly, and were diverse in the their topics. Unfortunately, there were too few of them to throw a neural network at... A few years ago, the huge SNLI dataset was released, containing half a million examples. What enabled scaling up to such a large dataset was (1) taking the premises from an already available collection of image captions; and (2) asking people to generate entailed, neutral, and contradicting hypotheses. This made the data collection simple enough to not require experts but rather be able to use crowdsourcing workers. 

Following the release of this dataset, the interest of the NLP community in RTE peaked. Specifically, many neural methods have been developed. What they basically do is encode each sentence (premise, hypothesis) into a vector, normally by running each sentence through an RNN. The premise and hypothesis vectors are then combined using arithmetic operations and sent to a classifier that outputs the label (entailment, neutral, or contradiction). This approach (left side of figure 9) can get you as far as ~87% accuracy on the test set, and the difference between the specific methods is mostly in the technicalities. More sophisticated methods encode the hypothesis conditioned on the premise, using the neural attention mechanism (right side of figure 9). In simple words, the hypothesis encoder is allowed to “look at” the premise words, and it roughly aligns each hypothesis word to a related premise word. For example, given the premise “A street performer is doing his act for kids” and the hypothesis “A juggler entertaining a group of children on the street”, the alignments would be juggler-performer, act-entertaining, street-street, etc. (In practice, it’s not a 1-on-1 alignment, but a weighted 1-to-many attention). This approach gets you to over 90% accuracy today, which is beyond human performance on this dataset. Yes, you read correctly. A statistical DL method is better than humans on this dataset. 

Figure 9: two common architectures of neural RTE systems. Left: sentence encoding models that encode each sentence separately. Right: attention model that encodes the hypothesis conditioned on the premise. Both models extract features from the premise and hypothesis vectors and use them for classification.

Anyone who’s ever worked on textual entailment and who is even a tiny bit skeptical of DL as the solution to everything, had to be suspicious. Indeed, many of us were. Fast forward a few months, a flood of papers confirming that the task is indeed not solved. Instead of solving the general, very difficult textual entailment task, our models are memorizing very specific peculiarities of the training data (which are also common in the test data). 1, 2 and 3 concurrently showed that a model which has access only to the hypothesis can solve the task with performance which is much better than a random guess (which is what we’d expect from a model that has no access to the premise). They all pointed out some peculiarities in the data that enable that. For example, hypotheses of contradicting examples tend to contain more negative words. This happens because the premises are image captions (“a dog is running in the park”). Image captions rarely describe something that doesn’t happen. The easiest thing for a person asked to write a contradicting sentence is to add negation: “a dog is not running in the park”. 

Funnily, 1 also showed that the appearance of the word “cat” in the hypothesis can indicate contradiction, as there were many dog images, and what contradicts a dog better than a cat? In reality, cats are lovely creatures, and a sentence with a cat doesn’t immediately contradict any other sentence.

We also showed that state-of-the-art models fail on examples that require knowledge about relations between words, even when the example is super simple. For instance, the models would think that “a man starts his day in India” and “a man starts his day in Malaysia” are entailing, just because India and Malaysia are similar (although mutually exclusive) words. We showed that the models only learns to distinguish between such words if the specific words appear enough times in the training data. For example, many contradiction examples in the training data have a man in the premise doing something and a woman in the premise doing the same thing. Having observed enough of these examples, the models learn that man and woman are mutually exclusive. But they fail in the India/Malaysia example, because they didn’t observe this exact pair of words in enough training examples. Since it’s unreasonable to rely on the training set to provide enough examples of each possible pair of words, a different solution, probably involving incorporating external knowledge from dictionaries and taxonomies, is needed. 

The main lesson from this story should not be that DL methods are unsophisticated parrots that can only repeat exactly what they saw in the training data. Instead, there are several things to consider:

  1. Good performance on the test set doesn’t necessarily indicate solving the task. Whenever our training and test data are not “natural” but rather generated in a somewhat artificial way, we run the risk that they will both contain the same peculiarities which are not properties of the actual task. A model learning these peculiarities is wasting energy on remembering things that are unhelpful in real-world usage, and creating an illusion of solving an unsolved task. I’m not saying we shouldn’t in any case process our data - just that we should be aware of this. If your model is getting really good performance on a really difficult task, there’s reason to be suspicious.

  2. DL methods can only be as good as the input they get. When we make inferences, we employ a lot of common sense and world knowledge. This knowledge is simply not available in the training data, and we can never expect the training data to be extensive enough to contain it. Domain knowledge is not redundant, and in the near future someone will come up with smart ways to incorporate it into a neural model, and achieve good performance on newer, less simple datasets.

The Shaky Ground of Distributional Word Embeddings

At the core of many deep learning methods lie pre-trained word embeddings. They are a great tool for capturing semantic similarity between words. They mostly capture topical similarity, or relatedness (e.g. elevator-floor), but they can also capture functional similarity (e.g. elevator-escalator). Relying on pre-trained word embeddings is a great way to make your model generalize to new inputs, and I recommend it for any task that doesn’t have loads of available data.

With that said, it’s not perfect. In many tasks we need to know the exact relationship between two words. It’s not enough to know that elevator and escalator (or India and Malaysia) are similar - we need to know that they are mutually exclusive. And word embeddings don’t tell us that. In fact, they conflate lots of semantic relations together.

I think I have a good way to simulate that, and I’ve been using it in my talks for the last few months. The idea is to take some text, say lyrics of a song, a script of a TV series, a famous speech, anything you like. Then, go over the text and replace each noun with its most similar word in word2vec. (It doesn’t have to be word2vec and doesn’t have to be only for nouns - this is what I did. The code and some other examples are available here.) Here is a part of my favorite example: Martin Luther King’s “I Have a Dream” speech. This is what you get when you replace words by their word2vec neighbours:

Figure 10: a part of Martin Luther King’s “I Have a Dream” speech after replacing nouns with most similar word2vec words.

Apart from being funny, this is a good illustration of this phenomenon: words have been replaced with other words in different relationships with them. For example, country instead of nation is not quite the same, but synonymous enough. Kids instead of children is perfectly fine. But a daydream is a type of a dream, week and day are mutually exclusive (they share a mutual category of time unit), Classical.com is completely unrelated to content (yes, statistical methods have errors…), and protagonist is synonymous with the original word character, but in the sense of a character in the book - and not of individual’s qualities.

In the last 3 years there are also multiple methods for learning a different type of word embeddings, that captures -- in addition to or instead of this fuzzy similarity -- semantic relations from taxonomies like WordNet. For example, the Retrofitting method started with distributional (regular) word embeddings and then moved vectors in the space such that two words that appear together in WordNet as synonyms would be close to each other in the vector space. The Attract-Repel method did the same but also made sure that antonyms would be further apart (e.g. think again of the good/bad vectors in sentiment analysis). Other methods include Order Embeddings, PoincarĂ© Embeddings, LEAR, and many more. While these methods are elegant, and have shown to capture the semantic relations they get as input, they have yet to improve the performance of NLP applications upon a version of the system that uses regular embeddings. 

The Unsatisfactory Representations Beyond the Word Level

Recurrent neural networks allow us to process texts in arbitrary length: from phrases consisting of several words to sentences, paragraphs, and even full documents. Does this also mean that we can have phrase, sentence, and document embeddings, that will capture the meaning of these texts?

Now is a good time to remember the famous quote from Ray Mooney:

To be honest, I completely agree with this opinion when talking about general-purpose sentence embeddings. While we have good methods to represent sentences for specific tasks and objectives, it’s not clear to me what a “generic” sentence embedding needs to capture and how to learn such a thing.

Many researchers think differently, and sentence embeddings have been pretty common in the last few years. To name a few, the Skip-Thought vectors build upon the assumption that a meaningful sentence representation can help predicting the next sentence. Given that even I as a human can rarely predict the next sentence in a text I’m reading, I think this is a very naive assumption (could you predict that I’ll say that?...). But it probably predicts the next sentence with more accuracy than it would predict a completely unrelated sentence, creating kind of a topical, rather than meaning representation. In the example in figure 11, the model considered lexically-similar sentences (i.e. sentences that share the same words or contain similar words) as more similar to the target sentence than a sentence with a similar meaning but a very different phrasing. I’m not surprised.

Figure 11: The most similar sentences to the target sentence “A man starts his day in India” out of the 3 other sentences that were encoded, using the sent2vec demo.

Another approach is the Autoencoder, which creates a vector representing the input text, and is trained to reproduce the input from that vector. The core assumption is that to be able to predict the original sentence, the representation must capture important aspects of the sentence’s meaning. You can think about it as a type of compression.

Finally, the byproduct of tasks that represent sentences as vectors, like textual entailment models (for classification) or machine translation models (for text generation) - are the sentence embeddings! Yes, they are trained to capture a specific aspect relating to their end task (entailment / translation), but assuming that these tasks require deep understanding of the meaning of a sentence, the embeddings can be used as general representations. 

So what aspects of the sentence do these representations capture? This paper did a pretty extensive analysis of various types of sentence embeddings. They defined some interesting properties which may be conveyed in a sentence, starting from shallow things like the sentence length (number of words) and whether a specific word is in the sentence or not; moving on to syntactic properties such as the order of words in the sentences; and finally semantic properties like whether the sentence is topically coherent (in other words, is it possible to distinguish between a “real” sentence and a sentence in which one word was replaced with a completely random word). To find out which of these properties are encoded in which sentence embeddings approach, they used the sentence embeddings as inputs to very simple classifiers, each trained to recognize a single property (e.g. using the vector of “this is a vector” in the sentence length classifier to predict 4). The performance of the various sentence embeddings on all methods was somewhere between the performance of a very simple baseline method (e.g. using random vectors) and the human performance on the task. Not surprisingly, there is more room for improvement on the complex and more semantic tasks.

I’d like at this point to repeat Ray Mooney’s quote and say that you still can’t cram the meaning of a whole sentence into a single vector. It’s impressive that we have gone so far to have representations that capture all these properties, but is this all there is to a sentence’s meaning? Here are some things that I don’t know whether the embeddings capture or not, but mostly assume they don’t:

  1. Do they capture things which are not said explicitly, but can be implied? (I didn’t eat anything since the morning implies I’m hungry).

  2. Do they capture the meaning of the sentence in the context it is said? (No, thanks can mean I don’t want to eat if you’ve just offered me food).

  3. Do they always assume that the meaning of a sentence is literal and compositional (composed of the meanings of the individual words), or do they have good representations for idioms? (I will clean the house when pigs fly means I will never clean the house).

  4. Do they capture pragmatics? (Can you tell me about yourself actually means tell me about yourself. You don’t want your sentence vectors to be like the interviewee in this joke).

  5. Do they capture things which are not said explicitly because the speakers have a common background? (If I tell a local friend that the prime minister must go home, we both know the specific prime minister I’m talking about).

I can go on and on, but I think these are enough examples to show that we have yet to come up with a meaning representation that mimics whatever representation we have in our heads, which is derived by making inferences and basing on common sense and world knowledge. If you need more examples, I recommend taking a look at the slides from Emily Bender’s “100 Things You Always Wanted to Know about Semantics & Pragmatics But Were Afraid to Ask” tutorial.

The Non Existing Robustness

Anyone who’s ever tried to reimplement neural models and reproduce the published results knows we have a problem. More often than not, you wouldn’t get the exact same published results. Sometimes not even close. The reason this happens is often due to differences in the values of “hyper-parameters”: technical settings that have to do with training such as the number of epochs, the regularization values and methods, and others. While they are seemingly not super important, in practice hyper-parameter values can make large performance differences.

The problem starts when training new models. You come up with an elegant model, you implement and train it, but the results are not as expected. This doesn't mean that the architecture or the data is not good; it often means that you need to tweak the hyper-parameter values and re-train the model, yielding completely different and hopefully better performance. Unfortunately, it’s almost impossible to tell in advance which values would yield better performance. There is no best-practice, just a lot of trial and error. We train many different models with various settings, then test their performance on the validation set (a set of examples separate from the training and the test sets) and choose the best performing model (which is then tested on the test set). 

Hyper-parameter tuning is an exhausting and often frustrating process. I’m sure that many good models get lost on the way because the researcher lost the patience or ran out of computational resources (yes, neural models also take longer to train… and strong machines cost a lot of money).

It’s pretty discouraging to think that achieving good performance on some test set is sometimes due to arbitrary settings rather than thanks to sound scientific ideas and model design.

The Lack of Interpretability

Last but not least, the interpretability issue is probably the worst caveat of DL. 

Machines don’t give an explanation for their predictions. While this was also true for traditional ML, it was easier to analyze the predictions and come up with an explanation in retrospect. Having algorithms which also learn the representations and networks with a lot more parameters makes this much more difficult. To put it into simple words, we generally have no idea what’s happening inside the networks we train, they are “black box” models. 

Why is this a problem? First, being able to interpret our models would help us debug our models and easily understand why they are not working and what it is exactly they are learning. It will make the development cycle much shorter and our models more robust and trustable. While it’s nearly impossible today, there are people working to change it

Second, and more importantly, in some tasks, transparency and accountability is crucial. Specifically, tasks concerned with safety, or which can discriminate against particular groups. Sometimes there is even a legal requirement to provide an explanation. Think of the following examples (not necessarily NLP): 

  • Self-driving cars 
  • Predicting probability of a prisoner to commit another crime (who needs to be released?) 
  • Predicting probability of death for patients (who is a better healthcare “financial investment”) 
  • Deciding who should be approved a loan 
  • more... 

In the next post I will elaborate on some of these examples and how ML sometimes discriminates against particular groups. This post is long enough already!


  1. A much needed primer for DL for NLP. Enjoyed it as usual. Would you be writing a followup with more concerete use case?

    1. About the hype vs gloom of "AI"in general I wrote this recently - https://goo.gl/JQ5YRg

    2. On the Interpretability - Cathy On Niel's TED talk, Weapons of Math destruction might interest you. https://goo.gl/zWSRrV

    1. Thanks Prithiviraj! I will probably write followups (specifically, I plan one about fairness, which is related to the interpretability issue). I would be happy to get additional topic ideas :)

      Thanks for the links, I'll look into them!

  2. This comment has been removed by a blog administrator.

  3. This comment has been removed by a blog administrator.