Probably Approximately a Scientific Blog: 2018

Friday, September 14, 2018

Ethical Machine Learning

With machine learning increasingly automating many previously manual decision making processes, it’s time to reflect not only on the algorithms’ performances but also on the ethical issues involved. Here are some questions about ethical concerns:

Fairness: are the outputs of our algorithms fair towards everyone? Is it possible that they discriminate against people based on characteristics such as gender, race, and sexual orientation?
Are developers responsible for potential bad usages of their algorithms?
Accountability: who is responsible for the output of the algorithm?
Interpretability and transparency: in sensitive applications, can we get an explanation for the algorithm’s decision?
Are we aware of human biases found in our training data? Can we reduce them?
What should we do to use user data cautiously and respect user privacy?

To demonstrate the importance of ethics in machine learning, it is now taught in classes, it has a growing community of researchers working on it, dedicated workshops and tutorials, and a Google team entirely devoted to it.

We are going to look into several examples.

To train or not to train? That is the question

Machine learning has evolved dramatically over the past few years, and together with the availability of data, it’s possible to do many things more accurately than before. But prior to considering implementation details, we need to pause for a second and ask ourselves: so, we can develop this model - but should we do it?

The models we develop can have bad implications. Assuming that none of my readers is a villain, let’s think in terms of “the road to hell is paved with good intentions”. How can (sometimes seemingly innocent) ML models be used for bad purposes?

A model that detects sexual orientation can be used to out people against their will
A model to detect criminality using face images can put innocent people behind bars
A model that creates fake videos can be used as “evidence” for fake news
A text generation model can be used to generate fake (positive or negative) reviews

In some cases, the answer is obvious (do we really want to determine that someone is a potential criminal based on their looks?). In other cases, it’s not straightforward to weigh all the potential malicious usages of our algorithm against the good purposes it can serve. In any case, it’s worth asking ourselves this question before we start coding.

Would you develop a model that can recognize smurfs if you knew it could be used by Gargamel? (Image source)

Underrepresented groups in the data

So our model passed the “should we train it” phase and now it’s time to gather some data! What can go wrong in this phase?

In the previous post we saw some examples for seemingly solved tasks whose models work well only for certain populations. Speech recognition works well for white males with an American accent but less so for other populations. Text analysis tools don’t recognize African-American English as English. Face Recognition works well for white men but far less accurately for dark-skinned women. In 2015, Google Photos automatically labelled pictures of black people as “gorillas”.

The common root of the problem in all these examples is insufficient representation of certain groups in the training data: not enough speech by females, blacks, and people with non-American English accent. Text analysis tools are often trained on news texts, which means it’s mostly written by adult white males. Finally, not enough facial images of black people. If you think about it, it’s not surprising. This goes all the way back to the photographic film, which had problems rendering dark skins. I don’t actually think there were bad intentions behind any of these, just maybe - ignorance? We are all guilty of being self-centered, so we develop models to work well for people like us. In the case of software industry, this mostly means “work well for white males”.

Biased supervision

When we train a model using supervised learning, we train it to perform similarly to humans. Unfortunately, it comes with all the disadvantages of humans, and we often train models to mimic human biases.

Let’s start with the classic example. Say that we would like to automate the process of mortgage applications, i.e. training a classifier to decide whether or not someone is eligible for a mortgage. The classifier is trained using the previous mortgage applications with their human-made decisions (accepted/rejected) as gold labels. It’s important to note that we don’t exactly train a classifier to accurately predict an individual’s ability to pay back the loan; instead we train a classifier to predict what a human would decide when being presented with the application.

We already know that humans have implicit biases and that sensitive attributes such as race and gender may affect these decisions negatively. For example, in the US, black people are less likely to get a mortgage. Since we don’t want our classifier to learn this bad practice (i.e. rejecting a mortgage merely because the applicant is black), we leave out those sensitive attributes from our feature vectors. The model has no access to these attributes.

Only that analyzing the classifier predictions with respect to the sensitive attributes may yield surprising results; for example, that black people are less likely than white people to be eligible for a mortgage. The model is biased towards black people. How could this happen?

Apparently, the classifier gets access to the excluded sensitive attributes through included attributes which are correlated with them. For example, if we provided the applicant’s address, it may indicate their race. (In the US, zip code it is highly correlated with race). Things can get even more complicated when using deep learning algorithms on texts. We no longer have control on the features the classifier learns. Let’s say that the classifier now gets as input a textual mortgage application. Now it may be able to detect race through writing style and word choice. And this time we can’t even remove certain suspicious features from the classifier.

Adversarial Removal

What can we do? We can try to actively remove anything that indicates race.

We have a model that gets as input a mortgage application (X), learns to represent it as f(X) (f encodes the application text, or extracts discrete features), and predicts a decision (Y) - accept or reject. We would like to remove information about some sensitive feature Z, in our case race, from the intermediate representation f(X).

This can be done by jointly training a second classifier, an “adversarial” classifier, which tries to predict race (Z) from the representation f(X). The adversary’s goal is to predict race successfully, while at the same time, the main classifier aims both to predict the decision (Y) with high accuracy, and to fail the adversary. To fail the adversary, the main classifier has to learn a representation function f which does not include any signal pertaining to Z.

The idea of removing features from the representation using adversarial training was presented in this paper. Later, this paper used the same technique to remove sensitive features. Finally, this paper experimented with textual input, and found that demographic information of the authors is indeed encoded in the latent representation. Although they managed to “fail” the adversary (as the architecture requires), they found that training a post-hoc classifier on the encoded texts still managed to detect race somewhat successfully. They concluded that adversarial training isn’t reliable for completely removing sensitive features from the representation.

Biased input representations

We’re living in an amazing time with positive societal changes. I’ll focus on one example that I personally relate to: gender equality. Every once in a while, my father emails me an article about some successful woman (CEO/professor/entrepreneur/etc.). He is genuinely happy to see more women in these jobs because he remembers a time when there were almost none. For me, I wish for times that this would be a non-issue - when my knowledge that women can do these jobs and the number of women actually doing these jobs finally make sense together.

In the Ethics in NLP workshop at EACL 2017, Joanna Bryson distinguished between 3 related terms: bias is knowing what "doctor" means, including that more doctors are male than female (if someone tells me they’re going to the doctor, I normally imagine they’re going to see a male doctor). Stereotype is thinking that doctors should be males (and consequently, that women are unfit to be doctors). Finally, prejudice is if you only use (go to / hire) male doctors. The thing is, while we as humans--or at least some of us--can distinguish between the three, algorithms can’t tell the difference.

One of the points of failure in this lack of algorithmic ability to tell bias from stereotype is in word embeddings. We discussed in a previous post this paper which showed that word embeddings capture gender stereotypes. They showed that for instance, when using embeddings to solve analogy problems (a toy problem which is often used to evaluate the quality of word embeddings), they may suggest that father is to doctor like mother is to nurse, and that man to computer programmer is like woman to homemaker. This obviously happens because statistically there are more nurse and homemaker females and male doctors and computer programmers than vice versa, which is reflected in the training data.

Google image search for “doctor” (left) and “nurse” (right): there are many more female than male nurse images.

However, we treat word embeddings as representing meaning. By doing so, we engrave “male” into the meaning of “doctor” and “female” into the meaning of “nurse”. These embeddings are then commonly used in applications, which might inadvertently amplify these unwanted stereotypes.

The suggested solution in that paper was to “debias” the embeddings, i.e. trying to remove the bias from the embeddings. The problem with this approach is, first, that you can only remove biases that you are aware of. Second, which I find worse, is that it removes some of the characteristics of a concept. As opposed to the removal of sensitive features from classification models, in which the features we try to remove (e.g. race) have nothing to contribute to the classification, here we are removing an important part of a word’s meaning. We still want to know that most doctors are men, we just don’t want to have a meaning representation in which woman and doctor are incompatible concepts.

The sad and trivial take-home message is that algorithms only do what we tell them too, so “racist algorithms” (e.g. the Microsoft chatbot) are only racist because they learned it from people. If we want machine learning to help build a better reality, we need to research not just techniques for improved learning, but also ways to teach algorithms what not to learn from us.

Tuesday, August 14, 2018

Deep Learning in NLP

This post is an old debt. Since I’ve started this blog 3 years ago, I’ve been refraining from writing about deep learning (DL), with the exception of occasionally discussing a method that uses it, without going into details. It’s a challenge to explain deep learning using simple concepts and without the caveat of remaining at a very high level. But perhaps worse than that, DL somewhat contradicts the description of my blog: “human-interpretable computer science”. We don’t really know how or why it works, and attempts to interpret it still only scratch the surface. On the other hand, it has been so prominent in NLP in the last few years (Figure 1), that it’s no longer reasonable to ignore it in a blog about NLP. So here’s my attempt to talk about it.

Figure 1: word cloud of the words in the titles of the accepted papers of ACL 2018, from https://acl2018.org/2018/07/31/conference-stats/. Note the prevalence of deep learning-related words such as “embeddings”, “network”, “rnns”.

This writing is based on many resources and most of the material is a summary of the main points I have taken from them. Credit is given in the form of linking to the original source. I would be happy to get corrections and comments!

Who is this post meant for?

What this post is not

Let’s talk about deep learning already!

What has improved?

Going Deep

Representation Learning

Recurrent Neural Networks

What is not yet working perfectly?

The Need for Unreasonable Amount of Data

The Risk of Overfitting

The Artificial Data Problem

The Shaky Ground of Distributional Word Embeddings

The Unsatisfactory Representations Beyond the Word Level

The Non Existing Robustness

The Lack of Interpretability

Who is this post meant for?

As always, this post will probably be too basic for most NLP researchers, but you’re welcome to distribute it to people who are new to the field! And if you’re like me and you enjoy reading people’s views about things you already know, you’re welcome to read too!

What this post is not

This post is NOT an extensive list of all the most up-to-date recent innovations in deep learning for NLP; I don’t know many of them myself. I tried to remain in a relatively high-level, so there are no formulas or formal definitions. If you want to read an extensive, detailed overview of how deep learning methods are used in NLP, I strongly recommend Yoav Goldberg’s “Neural Network Methods for Natural Language Processing” book. My post will also not teach you anything practical. If you’re looking to learn DL for NLP in 3 days and strike a fortune, there are tons of useful guides on the web. Not this post!

Let’s talk about deep learning already!

OK! Deep learning is a subfield of machine learning. As with machine learning, the goal is to give computers the ability to “learn” to solve some task without being explicitly programmed to solve it, but rather by using data and applying statistical methods to it. In the common case of supervised learning, the idea is to make the computer estimate some function, that takes some input and returns a prediction with respect to some question. I know this is very vague, so let’s take the task of named entity recognition (NER) as an example. The task is to determine for each word in a sequence of words (say, a sentence), whether it is a part of a named entity, and if so, of which type. In the text, “Prince Harry attended a wedding with a hole in his shoe” (I didn’t make it up), the words “Prince” and “Harry” should be classified as PERSON, and other words as NONE. A machine learning algorithm expects as input a set of properties pertaining to each word (called “feature vector”), and predicts a label for this word, out of a set of possible labels (e.g. PERSON, ORGANIZATION, LOCATION, NONE).

In the context of NLP, machine learning has been used for a long time, so what is new? Before we go through the differences, let’s think of the characteristics of a traditional machine learning algorithm, specifically a supervised classifier. In general, a traditional supervised classification pipeline was as follows.

Figure 2: pipeline of a traditional supervised classifier - extracting features, then approximating the function.

The input is raw (e.g. text), along with the gold labels. Phase (1) is the extraction of human-designed features from raw input, and representing it as vectors of features. Going back to the NER example, the person designing the learning algorithm would ask themselves: “what can indicate the type of entity of a given word?”. For example, a capitalized word is a good indication that a word is a part of a named entity in general. When the previous word is “The”, it can hint that the entity is an organization rather than a person. Someone had to come up with all these ideas. Here is an example of features that were previously used for NER:

Figure 3: features for NER, from this paper. The example is taken from Stanford CS224d course: Deep Learning for Natural Language Processing.

Phase (2) is the learning/training phase, in which the computer tries to approximate a function that takes as input the feature vectors and predicts the correct labels. The function is now a function of the features, i.e., in the simple case of a linear classifier, it assigns a weight for each feature with respect to each label. Features which are highly indicative of a label (e.g. previous word = The for ORGANIZATION) would be assigned a high weight for that label, and those which are highly indicative of not belonging to that label would be assigned a very low weight. The final prediction is done by summing up all the weighted feature values, and choosing the label that got the highest score.

Note that I’m not discussing how these weights are learned from the training examples and the gold labels. This is not super important for the current discussion, and you can read about it elsewhere (e.g. here).

Now let’s talk about the differences between traditional machine learning and deep learning.

What has improved?

First of all, it works, empirically. The state-of-the-art performance in nearly every NLP task today is achieved by a neural model. Literally every task listed on this website “Tracking Progress in Natural Language Processing” has a neural model as the best performing model. Why is it working so much better from previous approaches?

Going Deep

“Deep learning” is a buzzword referring to deep neural networks - powerful learning algorithms inspired by the brain’s computation mechanism. A simple single-layered feed-forward neural network is basically the same as the linear classifier we discussed. A deep neural network is a network that contains one or more hidden layers, which are also learned. This is the visual difference between a linear classifier (left, neural network with no hidden layers) and a neural network with a single hidden layer (right):

Figure 4: shallow neural network (left) vs. deep (1 hidden layer) neural network (right).

In general, the deeper the network is, the more complex functions it can estimate, and the better it can (theoretically) approximate them.

In mathematical notation, each layer is a multiplication by another learned matrix, but more importantly, it also goes into a non-linear activation function (e.g. hyperbolic tangent or the simple rectifier that sets values under a certain threshold to zero). Some functions can’t be approximated accurately enough using linear models. I’m deliberately not going into details, but you can read about linear separability here. With the help of the multiple layers and the non-linear activations, neural networks can approximate better functions for many tasks, resulting in improved performance.

Representation Learning

Another key aspect that changed is how the input is represented. Traditional machine learning methods worked because someone designed a well-thought feature vector to represent the inputs, as we can see in figure 3. Deep learning obviates the need to come up with a meaningful representation, and learns a representation from raw input (e.g. words). The new pipeline looks like this:

Figure 5: pipeline of a deep supervised classifier - learning both the input representation and the other model parameters.

Now, rather than feeding the learning algorithms hand-engineered feature vectors, we only give it the raw texts. In the case of NER, we can now just feed as features a window of words around each target word. For example, in the sentence “John worked at the Post Office in the city until last year”, the feature vector of the target word “Office” with a window of size 3 would be [Post, Office, in]. The learning algorithm, in addition to learning the network parameters (the function from representation to output), as it did before, learns also the word representations suitable for the task at hand. In other words, one of the additional parameters that the network learns is the word embeddings. We can think about it as a lookup table whose index is a word (string) and its output is a vector.

It is common to initialize this lookup table with pre-trained word embeddings. We discussed word embeddings in this blog post. They are trained using a large text collection, based on a linguistic hypothesis which states that words with similar meanings appear in the same contexts (next to the same “neighbour” words). Pre-trained embeddings are useful because they are often trained on a lot more data than available for the end task itself, and the more data, the more high-quality the vectors are.

On the other hand, pre-trained embeddings provide a general notion of “similarity” between words, which is not necessarily the same similarity our task needs. Think for example about sentiment analysis. Let’s say we are trying to predict the sentiment of restaurant reviews. Generic pre-trained embeddings will tell us that good and bad are highly similar, because they appear near the same words. But in the context of sentiment analysis we’d like their vectors to be further apart. The developer can choose between initializing the lookup table with the pre-trained embeddings or randomly, and also between updating the embeddings as additional network parameters, to fit the task better, or keeping them fixed. We’ll touch upon that when we discuss overfitting.

Back to the NER example, this is what a network with a window size of 3 would look like:

Figure 6: a neural network for NER, which uses a window of 3 words to classify a word.

In a traditional model, the feature “next word” is a discrete variable than can accept as value any word in the vocabulary. Let’s say that during training, the model encountered a word like “ltd” many times as the next word of an organization, and figured that it is a good indication of the ORGANIZATION class. If during test time, the model needs to classify a word followed by “Inc”, it may have no information about this word, and can’t generalize the knowledge about the similar word “ltd”. When the feature vector is composed of word embeddings, since the word embeddings of “ltd” and “Inc” are similar, now that inputs are similar, and the model can use knowledge about similar words to output the correct prediction.

Recurrent Neural Networks

The network we discussed so far is called a feed-forward neural network. In the NER example we used a fixed-size window of words around each target word. But what if we want to use the entire sentence? For example, in the sentence “John worked at the Post Office in the city until last year, and hated this organization” it is beneficial for the model to be aware of the last word “organization” while predicting the labels of “Post” and “Office”. One problem is that we don’t know in advance how many words each input sentence will contain. Another problem is that the representation of each word in the window model is independent of the other words - and we would like the representation to be more contextualized. For instance, the representation of “post” in the context of “post office” should be different from its representations in “blog post” or “post doc”.

Recurrent neural networks (RNNs) solve both problems. An RNN is a network that takes as input a sequence (e.g. a sentence as a sequence of words, or a sequence of characters, etc.) and outputs vectors representing each subsequence (i.e. the first item, the first two items, ..., the entire sequence). At each time step, the RNN considers both the previous memory (of the sequence until the previous input) and the current input. The last output vector can be considered as representing of the entire sequence, like a sentence embedding. Intermediate vectors can represent a word in its context.

The output vectors can then be used for many things, among which: representation - a fixed-size vector representation for arbitrarily long texts; as feature vectors for classification (e.g. representing a word in context, and predicting its NER tag); or for generating new sequences (e.g. in translation, a sequence-to-sequence or seq2seq model encodes the source sentence and then decodes = generates the translation). In the case of the NER example, we can now use the output vector corresponding to each word as the word’s feature vector, and predict the label based on the entire preceding context.

Figure 7: a NER model that uses an RNN to represent each word in its context.

Technical notes: an LSTM (long short-term memory) is a specific type of RNN that works particularly well and is commonly used. The differences between various RNN architectures are in the implementation of the internal memory. A bidirectional RNN/LSTM or a biLSTM processes the sequences from both sides - right to left and left to right, such that each output vector contains information pertaining to both the previous and the subsequent items in the sequence. For a much more complete overview of RNNs, I refer you to Andrej Karpathy's blog post "The Unreasonable Effectiveness of Recurrent Neural Networks".

So these are the main differences. There are many other new techniques on top of them, such as attention, Generative Adversarial Networks (GAN), deep reinforcement learning, and multi-task learning, but we won’t discuss them.

Interestingly, neural networks are not a new idea. They have been around since the 1950s, but have become increasingly popular in recent years thanks to the advances in computing power and the amount of available text on the web.

What is not yet working perfectly?

Although the popular media likes to paint an optimistic picture of “AI is solved” (or rather, a very pessimistic, false picture of “AI is taking over humanity”), in practice there are still many limitations to current deep methods. Here are several of them, mostly from the point of view of someone who works on semantics (feel free to add more limitations in the comments!).

The Need for Unreasonable Amount of Data

It’s difficult to convey the challenges in our work to people outside the NLP field (and related fields), when there are already many products out there that perform so well. Indeed, we have come a long way to be able to generally claim that some tasks are “solved”. This is especially true to low-level text analysis tasks such as part of speech tagging. But the performance on more complex semantic tasks like machine translation is surprisingly good as well. So what is the secret sauce of success, and why am I still unsatisfied?

First, let’s look at a few examples of tasks “solved” by DL, not necessarily in NLP:

Automatic Speech Recognition (ASR) - also known as speech-to-text. Deep learning-based methods have reported human-level performance last year, but this interesting blog post tells us differently. According to this post, while the the recent improvements are impressive, the claims about human-level performance are too broad. ASR works very well on American accented English with high signal-to-noise ratios. It has been trained on conversations by mostly American native English speakers with little background noise, which is available in large-scale. It doesn’t work well, definitely not human-level performance, for other languages, accents, non-native speakers, etc.

Facial Recognition - another task claimed to be solved, but this article from the New York Times (based on a study from MIT), says that it’s only solved for white men. An algorithm trained to identify gender from images was 99% accurate on images of white man, but far less accurate -- only 65% -- for dark-skinned women. Why is that so? a widely used dataset for facial recognition was estimated to be more than 75% male and more than 80% white.

Machine Translation - not claimed to be solved yet, but the release of Google Translate neural models in 2016 reported large performance improvements. The paper reported “60% reduction in translation errors on several popular language pairs”. The language pairs were English→Spanish, English→French, English→Chinese, Spanish→English, French→English, and Chinese→English. All these languages are considered “high-resource” languages, or in simple words, languages for which there is a massive amount of training data. As we discussed in the blog post about translation models, the training data for machine translation systems is a large collection of the same texts written in both the source language (e.g. English) and the target language (e.g. French). Think book translations as an example for a source of training data.

In other news, popular media was recently worried that Google Translate spits out some religious nonsense, completely unrelated to the source text. Here is an example. The top example is with religious content, the bottom example is not religious, only unrelated to the source text.

Figure 8: “who has been using these technologies for a long time”? Hopefully not the Igbo speakers, translating to English. Google Translate makes up things when translating Gibberish from the low-resource language “Igbo”.

This excellent blog post offers a simple explanation to this phenomenon: “low-resource” language pairs (e.g. Igbo and English), for which there is not a lot of available data, have worse neural translation models. Not surprising so far. The reason the translator generates nonsense is that when it is given unknown inputs (the input nonsense in the figure 8), the system tries to provide a fluent translation and ends up “hallucinating” sentences. Why religious texts? Because religious texts like the Bible and the Koran exist in many languages, and they are probably the major part of the available training data for translations between low-resource languages.

Can you spot a pattern among the successful tasks? Having a tremendous amount of training data increases the chances of training high-quality neural models for the task. Of course, it is a necessary-but-not-sufficient condition. In her RepL4NLP keynote, Yejin Choi talked about “solved” tasks and said that they all have in common a lot of training data and enough layers. But with respect to NLP tasks, there is another factor. Performing well on machine translation is possible without having a model with deep text understanding abilities, but rather by relying on the strong alignment between the input and the output. Other tasks, which require deeper understanding, such as recognizing fake news, summarizing a document, or making a conversation, have not been solved yet. (And may not be solvable just by adding more training data or more layers?).

The models of these “solved” tasks are only applicable for inputs which are similar to the training data. ASR for Scottish accent, facial recognition for black women, and translation from Igbo to English are not solved yet. Translation is not the only NLP example; whenever someone tells you some NLP task is solved, it probably only applies to very specific domains in English.

The Risk of Overfitting

Immediately following the previous point, when the training data is limited, we have a risk of overfitting. By definition, overfitting happens when a model performs extremely well on the training set, while performing poorly on the test set. It happens because the model memorizes specific patterns in the training data instead of looking at the big picture. If these patterns are not indicative of the actual task, and are not present in the test data, the model would perform badly on the test data. For example, let’s say you’re working on a very simplistic variant of text classification, in which you need to distinguish between news and sports articles. Your training data contains articles about sports and tweets from news agencies. Your model may learn that an article with less than 280 characters is related to news. The performance would be great on the training set, but what your model actually learned is not to distinguish between news and sports articles but rather between tweets and other texts. This definitely won’t be helpful in production when your news examples can be full-length articles.

Overfitting is not new to DL, but what’s changed from traditional machine learning are two main aspects: (1) it’s more difficult to “debug” DL models and detect overfitting, because we no longer have nice manually-designed features, but automatically learned representations; and (2) the models have many many more parameters than traditional machine learning models used to have - the more layers, the more parameters. This means that a model can now learn more complex functions, but it’s not guaranteed to learn the best function for the task, but rather the best function for the given data. Unfortunately, it’s not always the same, and this is something to keep in mind!

With respect to the first aspect, updating the pre-trained word embeddings during training can lead to overfitting. Given that the training set is limited and doesn’t cover the entire vocabulary, we’re only moving some words in the embedding space but keeping others in their place. The words we move (e.g. kangaroo) now have better vectors for the specific task, but they’re further away from their (distributional) neighbours which are not found in the training set. This hurts the model generalization abilities: when it encounters an unobserved word (e.g. wallaby), its vector is no longer located next to similar words (kangaroo) which have been moved, so the model doesn’t know much about it. Updating the embeddings during training is a good idea only if your task has very different needs from its embeddings than just plain similarity (and then you may want to start with a random initialization rather than pre-trained embeddings), and only if you have enough training data that covers a broad vocabulary.

For the next, related point, I’m going to broaden the definition of “overfitting” to the phenomenon of a model that memorizes the peculiarities in the data rather than what’s actually important for the task, regardless of its performance on the test set.

The Artificial Data Problem

Machine learning enthusiasts like to say that given enough training data, DL can learn to estimate any function. I personally take the more pessimistic stand that some language tasks are too complex and nuanced to solve using DL with any reasonable amount of training data. Nevertheless, like everyone else in the field, I’m constantly busy thinking of ways to get more data quickly and without spending too much money.

One representative example (of many) to a task in which the release of a large amount of training data raised researchers’ interest in the task is recognizing textual entailment (RTE, sometimes called “natural language inference” or NLI). This is an artificial task that was invented because many language understanding abilities we develop can be reduced to this task. In this task, two sentences -- a premise and a hypothesis -- are given. The task is for a model to automatically determine what a human reading the premise can say with respect to the hypothesis. Let’s look at the definitions along with a simple example from the SNLI dataset. For the premise “A street performer is doing his act for kids” and the hypothesis:

Entailment: the hypothesis must also be true.
Hypothesis = “A person performing for children on the street”.
The premise tells us there is a street performer, so we can infer that he is a person and he is on the street. The premise also tells us that he is doing his act - therefore performing, and for kids = for children. The hypothesis only repeats the information conveyed in the premise and information which can be inferred from it.
Neutral: the hypothesis may or may not be true.
Hypothesis = “A juggler entertaining a group of children on the street”.
In addition to repeating information from the premise, the hypothesis now tells us that it’s a juggler. The premise is more general and can refer to other types of street performers (e.g. a guitar player), so this may or may not be a juggler.
Contradiction: the hypothesis must be false.
Hypothesis = “A magician performing for an audience in a nightclub”.
The hypothesis describes a completely different event happening at a nightclub.

This task is very difficult for humans and machines, and requires background knowledge, commonsense, knowledge about the relationship between words (whether they mean the same thing, one of them is more specific than the other, or do they contradict each other), recognizing that different mentions refer to same entity (coreference), dealing with syntactic variations, etc. For an extensive list, I recommend reading this really good summary.

In the early days, methods’ performance was mediocre. The available data for this task contained a few hundreds of annotated examples. They required many different types of knowledge to answer correctly, and were diverse in the their topics. Unfortunately, there were too few of them to throw a neural network at... A few years ago, the huge SNLI dataset was released, containing half a million examples. What enabled scaling up to such a large dataset was (1) taking the premises from an already available collection of image captions; and (2) asking people to generate entailed, neutral, and contradicting hypotheses. This made the data collection simple enough to not require experts but rather be able to use crowdsourcing workers.

Following the release of this dataset, the interest of the NLP community in RTE peaked. Specifically, many neural methods have been developed. What they basically do is encode each sentence (premise, hypothesis) into a vector, normally by running each sentence through an RNN. The premise and hypothesis vectors are then combined using arithmetic operations and sent to a classifier that outputs the label (entailment, neutral, or contradiction). This approach (left side of figure 9) can get you as far as ~87% accuracy on the test set, and the difference between the specific methods is mostly in the technicalities. More sophisticated methods encode the hypothesis conditioned on the premise, using the neural attention mechanism (right side of figure 9). In simple words, the hypothesis encoder is allowed to “look at” the premise words, and it roughly aligns each hypothesis word to a related premise word. For example, given the premise “A street performer is doing his act for kids” and the hypothesis “A juggler entertaining a group of children on the street”, the alignments would be juggler-performer, act-entertaining, street-street, etc. (In practice, it’s not a 1-on-1 alignment, but a weighted 1-to-many attention). This approach gets you to over 90% accuracy today, which is beyond human performance on this dataset. Yes, you read correctly. A statistical DL method is better than humans on this dataset.

Figure 9: two common architectures of neural RTE systems. Left: sentence encoding models that encode each sentence separately. Right: attention model that encodes the hypothesis conditioned on the premise. Both models extract features from the premise and hypothesis vectors and use them for classification.

Anyone who’s ever worked on textual entailment and who is even a tiny bit skeptical of DL as the solution to everything, had to be suspicious. Indeed, many of us were. Fast forward a few months, a flood of papers confirming that the task is indeed not solved. Instead of solving the general, very difficult textual entailment task, our models are memorizing very specific peculiarities of the training data (which are also common in the test data). 1, 2 and 3 concurrently showed that a model which has access only to the hypothesis can solve the task with performance which is much better than a random guess (which is what we’d expect from a model that has no access to the premise). They all pointed out some peculiarities in the data that enable that. For example, hypotheses of contradicting examples tend to contain more negative words. This happens because the premises are image captions (“a dog is running in the park”). Image captions rarely describe something that doesn’t happen. The easiest thing for a person asked to write a contradicting sentence is to add negation: “a dog is not running in the park”.

Funnily, 1 also showed that the appearance of the word “cat” in the hypothesis can indicate contradiction, as there were many dog images, and what contradicts a dog better than a cat? In reality, cats are lovely creatures, and a sentence with a cat doesn’t immediately contradict any other sentence.

We also showed that state-of-the-art models fail on examples that require knowledge about relations between words, even when the example is super simple. For instance, the models would think that “a man starts his day in India” and “a man starts his day in Malaysia” are entailing, just because India and Malaysia are similar (although mutually exclusive) words. We showed that the models only learns to distinguish between such words if the specific words appear enough times in the training data. For example, many contradiction examples in the training data have a man in the premise doing something and a woman in the premise doing the same thing. Having observed enough of these examples, the models learn that man and woman are mutually exclusive. But they fail in the India/Malaysia example, because they didn’t observe this exact pair of words in enough training examples. Since it’s unreasonable to rely on the training set to provide enough examples of each possible pair of words, a different solution, probably involving incorporating external knowledge from dictionaries and taxonomies, is needed.

The main lesson from this story should not be that DL methods are unsophisticated parrots that can only repeat exactly what they saw in the training data. Instead, there are several things to consider:

Good performance on the test set doesn’t necessarily indicate solving the task. Whenever our training and test data are not “natural” but rather generated in a somewhat artificial way, we run the risk that they will both contain the same peculiarities which are not properties of the actual task. A model learning these peculiarities is wasting energy on remembering things that are unhelpful in real-world usage, and creating an illusion of solving an unsolved task. I’m not saying we shouldn’t in any case process our data - just that we should be aware of this. If your model is getting really good performance on a really difficult task, there’s reason to be suspicious.
DL methods can only be as good as the input they get. When we make inferences, we employ a lot of common sense and world knowledge. This knowledge is simply not available in the training data, and we can never expect the training data to be extensive enough to contain it. Domain knowledge is not redundant, and in the near future someone will come up with smart ways to incorporate it into a neural model, and achieve good performance on newer, less simple datasets.

The Shaky Ground of Distributional Word Embeddings

At the core of many deep learning methods lie pre-trained word embeddings. They are a great tool for capturing semantic similarity between words. They mostly capture topical similarity, or relatedness (e.g. elevator-floor), but they can also capture functional similarity (e.g. elevator-escalator). Relying on pre-trained word embeddings is a great way to make your model generalize to new inputs, and I recommend it for any task that doesn’t have loads of available data.

With that said, it’s not perfect. In many tasks we need to know the exact relationship between two words. It’s not enough to know that elevator and escalator (or India and Malaysia) are similar - we need to know that they are mutually exclusive. And word embeddings don’t tell us that. In fact, they conflate lots of semantic relations together.

I think I have a good way to simulate that, and I’ve been using it in my talks for the last few months. The idea is to take some text, say lyrics of a song, a script of a TV series, a famous speech, anything you like. Then, go over the text and replace each noun with its most similar word in word2vec. (It doesn’t have to be word2vec and doesn’t have to be only for nouns - this is what I did. The code and some other examples are available here.) Here is a part of my favorite example: Martin Luther King’s “I Have a Dream” speech. This is what you get when you replace words by their word2vec neighbours:

Figure 10: a part of Martin Luther King’s “I Have a Dream” speech after replacing nouns with most similar word2vec words.

Apart from being funny, this is a good illustration of this phenomenon: words have been replaced with other words in different relationships with them. For example, country instead of nation is not quite the same, but synonymous enough. Kids instead of children is perfectly fine. But a daydream is a type of a dream, week and day are mutually exclusive (they share a mutual category of time unit), Classical.com is completely unrelated to content (yes, statistical methods have errors…), and protagonist is synonymous with the original word character, but in the sense of a character in the book - and not of individual’s qualities.

In the last 3 years there are also multiple methods for learning a different type of word embeddings, that captures -- in addition to or instead of this fuzzy similarity -- semantic relations from taxonomies like WordNet. For example, the Retrofitting method started with distributional (regular) word embeddings and then moved vectors in the space such that two words that appear together in WordNet as synonyms would be close to each other in the vector space. The Attract-Repel method did the same but also made sure that antonyms would be further apart (e.g. think again of the good/bad vectors in sentiment analysis). Other methods include Order Embeddings, Poincaré Embeddings, LEAR, and many more. While these methods are elegant, and have shown to capture the semantic relations they get as input, they have yet to improve the performance of NLP applications upon a version of the system that uses regular embeddings.

The Unsatisfactory Representations Beyond the Word Level

Recurrent neural networks allow us to process texts in arbitrary length: from phrases consisting of several words to sentences, paragraphs, and even full documents. Does this also mean that we can have phrase, sentence, and document embeddings, that will capture the meaning of these texts?

Now is a good time to remember the famous quote from Ray Mooney:

To be honest, I completely agree with this opinion when talking about general-purpose sentence embeddings. While we have good methods to represent sentences for specific tasks and objectives, it’s not clear to me what a “generic” sentence embedding needs to capture and how to learn such a thing.

Many researchers think differently, and sentence embeddings have been pretty common in the last few years. To name a few, the Skip-Thought vectors build upon the assumption that a meaningful sentence representation can help predicting the next sentence. Given that even I as a human can rarely predict the next sentence in a text I’m reading, I think this is a very naive assumption (could you predict that I’ll say that?...). But it probably predicts the next sentence with more accuracy than it would predict a completely unrelated sentence, creating kind of a topical, rather than meaning representation. In the example in figure 11, the model considered lexically-similar sentences (i.e. sentences that share the same words or contain similar words) as more similar to the target sentence than a sentence with a similar meaning but a very different phrasing. I’m not surprised.

Figure 11: The most similar sentences to the target sentence “A man starts his day in India” out of the 3 other sentences that were encoded, using the sent2vec demo.

Another approach is the Autoencoder, which creates a vector representing the input text, and is trained to reproduce the input from that vector. The core assumption is that to be able to predict the original sentence, the representation must capture important aspects of the sentence’s meaning. You can think about it as a type of compression.

Finally, the byproduct of tasks that represent sentences as vectors, like textual entailment models (for classification) or machine translation models (for text generation) - are the sentence embeddings! Yes, they are trained to capture a specific aspect relating to their end task (entailment / translation), but assuming that these tasks require deep understanding of the meaning of a sentence, the embeddings can be used as general representations.

So what aspects of the sentence do these representations capture? This paper did a pretty extensive analysis of various types of sentence embeddings. They defined some interesting properties which may be conveyed in a sentence, starting from shallow things like the sentence length (number of words) and whether a specific word is in the sentence or not; moving on to syntactic properties such as the order of words in the sentences; and finally semantic properties like whether the sentence is topically coherent (in other words, is it possible to distinguish between a “real” sentence and a sentence in which one word was replaced with a completely random word). To find out which of these properties are encoded in which sentence embeddings approach, they used the sentence embeddings as inputs to very simple classifiers, each trained to recognize a single property (e.g. using the vector of “this is a vector” in the sentence length classifier to predict 4). The performance of the various sentence embeddings on all methods was somewhere between the performance of a very simple baseline method (e.g. using random vectors) and the human performance on the task. Not surprisingly, there is more room for improvement on the complex and more semantic tasks.

I’d like at this point to repeat Ray Mooney’s quote and say that you still can’t cram the meaning of a whole sentence into a single vector. It’s impressive that we have gone so far to have representations that capture all these properties, but is this all there is to a sentence’s meaning? Here are some things that I don’t know whether the embeddings capture or not, but mostly assume they don’t:

Do they capture things which are not said explicitly, but can be implied? (I didn’t eat anything since the morning implies I’m hungry).
Do they capture the meaning of the sentence in the context it is said? (No, thanks can mean I don’t want to eat if you’ve just offered me food).
Do they always assume that the meaning of a sentence is literal and compositional (composed of the meanings of the individual words), or do they have good representations for idioms? (I will clean the house when pigs fly means I will never clean the house).
Do they capture pragmatics? (Can you tell me about yourself actually means tell me about yourself. You don’t want your sentence vectors to be like the interviewee in this joke).
Do they capture things which are not said explicitly because the speakers have a common background? (If I tell a local friend that the prime minister must go home, we both know the specific prime minister I’m talking about).

I can go on and on, but I think these are enough examples to show that we have yet to come up with a meaning representation that mimics whatever representation we have in our heads, which is derived by making inferences and basing on common sense and world knowledge. If you need more examples, I recommend taking a look at the slides from Emily Bender’s “100 Things You Always Wanted to Know about Semantics & Pragmatics But Were Afraid to Ask” tutorial.

The Non Existing Robustness

Anyone who’s ever tried to reimplement neural models and reproduce the published results knows we have a problem. More often than not, you wouldn’t get the exact same published results. Sometimes not even close. The reason this happens is often due to differences in the values of “hyper-parameters”: technical settings that have to do with training such as the number of epochs, the regularization values and methods, and others. While they are seemingly not super important, in practice hyper-parameter values can make large performance differences.

The problem starts when training new models. You come up with an elegant model, you implement and train it, but the results are not as expected. This doesn't mean that the architecture or the data is not good; it often means that you need to tweak the hyper-parameter values and re-train the model, yielding completely different and hopefully better performance. Unfortunately, it’s almost impossible to tell in advance which values would yield better performance. There is no best-practice, just a lot of trial and error. We train many different models with various settings, then test their performance on the validation set (a set of examples separate from the training and the test sets) and choose the best performing model (which is then tested on the test set).

Hyper-parameter tuning is an exhausting and often frustrating process. I’m sure that many good models get lost on the way because the researcher lost the patience or ran out of computational resources (yes, neural models also take longer to train… and strong machines cost a lot of money).

It’s pretty discouraging to think that achieving good performance on some test set is sometimes due to arbitrary settings rather than thanks to sound scientific ideas and model design.

The Lack of Interpretability

Last but not least, the interpretability issue is probably the worst caveat of DL.

Machines don’t give an explanation for their predictions. While this was also true for traditional ML, it was easier to analyze the predictions and come up with an explanation in retrospect. Having algorithms which also learn the representations and networks with a lot more parameters makes this much more difficult. To put it into simple words, we generally have no idea what’s happening inside the networks we train, they are “black box” models.

Why is this a problem? First, being able to interpret our models would help us debug our models and easily understand why they are not working and what it is exactly they are learning. It will make the development cycle much shorter and our models more robust and trustable. While it’s nearly impossible today, there are people working to change it.

Second, and more importantly, in some tasks, transparency and accountability is crucial. Specifically, tasks concerned with safety, or which can discriminate against particular groups. Sometimes there is even a legal requirement to provide an explanation. Think of the following examples (not necessarily NLP):

Self-driving cars
Predicting probability of a prisoner to commit another crime (who needs to be released?)
Predicting probability of death for patients (who is a better healthcare “financial investment”)
Deciding who should be approved a loan
more...

In the next post I will elaborate on some of these examples and how ML sometimes discriminates against particular groups. This post is long enough already!

Friday, April 13, 2018

Targeted Content

You must have heard of, or have suspected first-handedly, the famous conspiracy theory that the Facebook app listens to your phone's microphone in order to better target ads that match your current interests. I've had the funniest experience with that myself: a friend in the cosmetics business has told me about this conspiracy, and in the same conversation she mentioned that an advertising agent has called her to offer advertising her business. Later that day, I got a Facebook ad "advertise your cosmetics business". What the heck? What are the odds of that? And I don't even have a Facebook app installed, just the Facebook messenger.

Although Mark Zuckerberg denied this conspiracy theory in his senate hearing, I doubt that people would stop believing it whenever the ads algorithm surprises them. Choosing to believe Zuckerberg that they don't listen to our microphones (yet, I suspect), I'm pretty confident that they, as well as other companies, are using our written content (emails, social media posts, search queries).

Most people are alarmed by these suspicions from the privacy aspect: what data does this company hold about me? how do they use it? who do they share it with? This post will not be about that. Instead, this post will be about the technical aspect, which is what interests me most as an NLP researcher. If we assume that our apps constantly listen to us and that our written content is monitored and analyzed, what does it say about the text understanding capabilities of these companies?

Oh, and expect no answers. This post is all about questions and conspiracy theories!

What is personalized content?
Personalized content doesn't have to come in the form of an ad. It can take the form of recommendations (products to buy based on previous purchases, songs to listen to, as in this post). It can be relevant professional content from LinkedIn, discounts on services you've previously consumed, cheap flights to your planned destinations, and so on. Some of this will be a direct result of the preferences and settings you defined in the website. For example, I've registered in several websites to get updates on concerts of my favorite bands, and I get healthy vegetarian recipes from Yummly. Some of this content will be based on inferences that the system makes, assuming that certain content is relevant for you. Here is one example:

Lately I've been getting @quora digest emails on topics related to conversations I had with people (in Hebrew!). 1/5
— Vered Shwartz (@VeredShwartz) October 17, 2017

In that case I was amazed by the accuracy of the Quora digest emails I was getting. Specifically, I had a conversation with my husband about the confidence it takes to admit you don't know something, and he mentioned he likes to say something more helpful than "I don't know" to someone who needs help. The next day, I got a personally-tailored Quora digest email that contained an answer to the question "Could you say something nice instead of 'I don't know'?". It wasn't under any of the topics that I follow (computer science related topics and parakeets).

In what follows I will exemplify most of my points using ads.

What we think these algorithms do
OK, so in my case, I have to try to put my knowledge about the limitations of this technology and my skepticism aside for a second and think like the average person. In that case, I think that:

If the ad is about a topic that I discussed in a spoken conversation, then there must be a recorder, and a speech-to-text component that converts the speech into written text.
Which language did I speak or have written when this happened? In case this happened for more than one language, it's possible that the company has different algorithms (or at least different trained models of the same algorithm) for each language.
Written content and transcribed speech are processed to match with the available content/ads.
In some cases, it seems that even simple keyword matching leads to nice results. E.g., if you mentioned a vacation in Thailand you will be matched with ads containing the words vacation and Thailand (I will let you know if I get any such ads after writing this post...). It takes no text understanding capabilities to do so, it only requires recognizing that a bunch of words said in the same sentence (or in a short period of time) also appear in some ad. If you insist, it may work with information retrieval (IR) algorithms to recognize the most important words.
In other cases, it seems that a deeper understanding of the meaning of my queries and conversations is required in order to match it to the relevant content. A good example is the Quora digest example from above. Based on IR algorithms, searching for common words like I, don't, know, helpful, nice, say, something will not get you as far as searching for more rare content words like vacation and Thailand. So it must be that the algorithm has built some meaning representation to our conversation, and compared it with the one of that Quora answer, which was phrased with slightly different words. On top of everything, our conversation was in Hebrew, so it must have a universal multi-lingual meaning representation mechanism.

Alternative explanations
Skepticism returns; I can believe that my speech is recorded and transcribed fairly accurately to text when I speak English. It's a bit harder to believe when it happens in other languages (e.g. Hebrew in my case), but I can still find it somewhat reasonable; Automatic speech recognition (ASR), although isn't perfect, still works reasonably well. It's the text understanding component I'm much, much more skeptical about. Despite the constant progress, and although popular media makes it seem like AI is solved and computers completely understand human language, I know it definitely isn't the case yet. So what other explanations can there be for the targeted content we see?

By Chance. None of this actually happens and we're just imagining. Well, OK, not none of this, but in some cases, it's really just chance.

One of the reasons that we're not easily convinced by this "by chance" argument is that we generally tend to pay attention only to the true-positive cases ("hits") in which we talked about something and immediately got an ad about it. It's much harder to notice the "misses": an ad that seems off (false positive) or all the things that we discussed and got no ads about (false negative).

In the end of the day, we're all just common people that share many common interests. Advertisers may reach us because they try to reach a large audience and we happen to fall under the very broad categories they target (e.g. age group). It could be that by chance we see ads exactly for the product or service we need now.

Other Means. Technically speaking, rather than understanding text, it's much easier to consider other parameters such as your location, your declared interests (i.e. pages you've liked on Facebook, search results you clicked on in Google), your location, your age, gender, marital status, and more. If you didn't provide one or more of these details, no worries! Your friends have, and it's likely you share some of these details with them!

Here is one good example:

I keep getting babies and pregnancy ads on Facebook. I'm a married woman in her 30s, both information items are available in my Facebook profile, and that alone is enough to assume this topic is relevant for me (personally, it is not, but the percent of women like me is too small to care about the error rate, and I totally accept that). Add to this that many of my Facebook friends are other people in my age who are members of parenting groups, have liked pages of baby-related stuff, etc. I can't ever make this stop, but I guess it will stop naturally when I'm in my late forties.

I'd like to finish with an anecdote about how non-sophisticated targeted content can sometimes be, to the point where you rub your eyes in disbelief and say "how stupid can these algorithms be?". A few days ago I've written to someone in an email "I'll be in Seattle on May 30". Minutes later, I get an email from Booking.com with the title "Vered, Seattle has some last-minute deals!". That would have been smart, unless I've already used Booking.com to book a hotel room in Seattle for exactly these dates.

I may be way off and it may be that these companies have killer AI abilities which are kept very well in secret. In that case, some of my readers who work for these companies must be giggling now. To paraphrase Joseph Heller (or whoever said it first), just because you're paranoid, doesn't mean they're not after you, but hey, there's no way their technology is good enough to do what you think it does, so some of it is just pure chance. Not as catchy as the original quote, I know.