Monday, October 30, 2017

Ambiguity

One of the problems with teaching computers to understand natural language, is that much of the meaning in what people say is actually hidden in what they don't say. As humans, we trivially interpret the meaning of ambiguous words, written or spoken, according to their context. For example, this blog post is published in a blog that largely discusses natural language processing, so if I write "NLP", you'd know I refer to natural language processing rather than to neuro-linguistic programming. If I told you that the blog post doesn't fit into a tweet because it's too long, you'd know that the blog post is too long and not that the tweet is too long. You would infer that even without having any knowledge about Twitter's character limit, because it just doesn't make sense otherwise. Unfortunately, common-sense and world knowledge that come so easily for us are not trivial to teach to machines. In this post, I will present a few cases in which ambiguity is a challenge in NLP, along with common ways in which we try to overcome it.


Polysemous words, providing material for dad jokes since... ever.
Lexical Ambiguity
Lexical ambiguity can occur when a word is polysemous, i.e. has more than one meaning, and the sentence in which it is contained can be interpreted differently depending on its correct sense.

For example, the word bank has two meanings - either a financial institute or the land alongside the river. When we read a sentence with the word bank, we understand which sense of bank the text refers to according to the context:

(1) Police seek person who robbed bank in downtown Reading.
(2) The faster-moving surface water travels along the concave bank.

In these example sentences, "robbed" indicates the first sense while "water" and "concave" indicate the second.



Existing Solutions for Lexical Ambiguity
Word embeddings are great, but they conflate all the different senses of a word into one vector. Since word embeddings are learned from the occurrences of a word in a text corpus, the word embedding for bank is learned from its occurrences in both senses, and will be affected from neighbors related to the first sense (money, ATM, union) and of the second (river, west, water, etc.). The resulting vector is very likely to tend towards the more common sense of bank, as can be seen in this demo: see how all the nearest words to bank are related to its financial sense.

Word Sense Disambiguation (WSD) is an NLP task aimed at disambiguating a word in context. Given a list of potential word senses for each word, the correct sense of the word in the given context is determined. Similar to the way humans disambiguate words, WSD systems also rely on the surrounding context. A simple way to do so, in a machine-learning based solution (i.e. learning from examples), is to represent a word-in-context as the average of its context word vectors ("bag-of-words"). In the example above, we get for the first occurrence of bankfeature_vector(bank) = 1/8( (vector(police) + vector(seek) + vector(person) + vector(who) + vector(robbed) + vector(in) + vector(downtown) + vector(reading))and for the second: feature_vector(bank) = 1/9(vector(the) + vector(faster) + vector(moving) + vector(surface) + vector(water) + vector(travels) + vector(along) + vector(the) + vector(concave)).



Can Google expand the acronym "ACL" correctly for me?
Acronyms
While many words in English are polysemous, things turn absolutely chaotic with acronyms. Acronyms are highly polysemous, some having dozens of different expansions. To make things even more complicated, as opposed to regular words, whose various senses are recorded in dictionaries and taxonomies like WordNet, acronyms are often domain-specific and not commonly known.

Take for example a Google search for "ACL 2017". I get results both for the Annual Meeting of the Association for Computational Linguistics (which is what I was searching for) and for the Austin City Limits festival. I have no idea whether this happens because (a) these are the two most relevant/popular expansions of "ACL" lately or the only ones that go with "2017"; or (b) Google successfully disambiguated my query, showing the NLP conference first, and leaving also the musical festival ranked lower in the search results, since it knows I also like music festivals. Probably (a) :)

Existing Solutions for Acronym Expansion
Expanding acronyms is considered a different task from WSD, in which there is no inventory of potential expansions for each acronym. Given enough context (e.g. "2017" is a context word for the acronym ACL), it is possible to find texts that contain the expansion. This can either be by searching for a pattern (e.g. "Association for Computational Linguistics (ACL)") or considering all the word sequences that start with these initials, and deciding on the correct one using rules or a machine-learning based solution.


Syntactic Ambiguity
No beginner NLP class is complete without at least one of the following example sentences:
  1. They ate pizza with anchovies
  2. I shot an elephant wearing my pajamas
  3. Time flies like an arrow
Common to all these examples is that each can be interpreted as multiple different meanings, where the different meanings differ in the underlying syntax of the sentence. Let's go over the examples.

The first sentence "They ate pizza with anchovies", can be interpreted as (i) "they ate pizza and the pizza had anchovies on it", which is the more likely interpretation, illustrated on the left side of the image below. This sentence has at least two more crazy interpretations: (ii) they ate pizza using anchovies (instead of using utensils, or eating with their hands), as in the right side of the image below, and (iii) they ate pizza and their anchovy friends ate pizza with them.

Visual illustration of the interpretations of the sentence "They ate pizza with anchovies".
Image taken from https://explosion.ai/blog/syntaxnet-in-context.
The first interpretation considers "with anchovies" as describing the pizza, while the other two consider it as describing the eating action. In the output of a syntactic parser, the interpretations will differ by the tree structure, as illustrated below.

Possible syntactic trees for the sentence "They ate pizza with anchovies", using displacy.

Although this is a classic example, both the Spacy and the Stanford Core NLP demos got it wrong. The difficulty is that syntactically speaking, both trees are likely. Humans know to prefer the first one based on the semantics of the words, and using their knowledge that anchovy is something that you eat rather than eat with. Machines don't come with this knowledge.

A similar parser decision is crucial in the second sentence, and just in case you haven't managed to find the funny interpretations yet: "I shot an elephant wearing my pajamas" has two ambiguities: first, does shoot mean taking a photo of, or pointing a gun to? (a lexical ambiguity). But more importantly, who's wearing the pajamas? Depending on whether wearing is attached to shot (meaning that I wore the pajamas while shooting) or to elephant (meaning that the elephant miraculously managed to squeeze into my pajamas). This entire scene, regardless of the interpretation, is very unlikely, and please don't kill elephants, even if they're stretching your pajamas.

The third sentence is just plain weird, but it also has multiple interpretations, of which you can read about here.

Existing Solutions for Syntactic Ambiguity
In the past, parsers were based on deterministic grammar rules (e.g. a noun and a modifier create a noun-phrase) rather than on machine learning. A possible solution to the ambiguity issue was to add different rules for different words. For more details, you can read my answer to Natural Language Processing: What does it mean to lexicalize PCFGs? on Quora.

Today, similarly to other NLP tasks, parsers are mostly based on neural networks. In addition to other information, the word embeddings of the words in the sentence are used for deciding on the correct output. So potentially, such a parser may learn that "eat * with [y]" yields the output in the left of the image if y is edible (similar to word embeddings of other edible things), otherwise the right one.


Coreference Ambiguity
Very often a text mentions an entity (someone/something), and then refers to it again, possibly in a different sentence, using another word. Take these two paragraphs from a news article as an example:

From https://www.theguardian.com/sport/2017/sep/22/donald-trump-nfl-national-anthem-protests.
The various entities participating in the article were marked in different colors.

I marked various entities that participate in the article in different colors. I grouped together different mentions of the same entities, including pronouns ("he" as referring to "that son of a bitch"; excuse my language, I'm just quoting Trump) and different descriptions ("Donald Trump", "the president"). To do that, I had to use my common sense (the he must refer to that son of a bitch who disrespected the flag, definitely not to the president or the NFL owners, right?) and my world knowledge (Trump is the president). Again, any task that requires world knowledge and reasoning is difficult for machines.

Existing Solutions for Coreference Resolution
Coreference resolution systems group mentions that refer to the same entity in the text. They go over each mention (e.g. the president), and either link it to an existing group containing previous mentions of the same entity ([Donald Trump, the president]), or start a new entity cluster ([the president]). Systems differ from each other, but in general, given a pair of mentions (e.g. Donald Trump, the president), they extract features referring either to each single mention (e.g. part-of-speech, word vector) or to the pair (e.g. gender/number agreement, etc.), and decide whether these mentions refer to the same entity.

Note that mentions can be proper-names (Donald Trump), common nouns (the president) and pronouns (he); identifying coreference between pairs of mentions from each type requires different abilities and knowledge. For example, proper-name + common noun may require world knowledge (Donald Trump is the president), while pairs of common nouns can sometimes be solved with semantic similarity (e.g. synonyms like owner and holder). Pronouns can sometimes be matched to their antecedent (original mention) based on proximity and linguistic cues such as gender and number agreement, but very often there is more than one possible option for matching.

A nice example of solving coreference ambiguity is the Winograd Schema challenge, of which I've first heard from this post in the Artificial Detective blog. In this contest, computer programs are given a sentence with two nouns and an ambiguous pronoun, and they need to answer which noun the pronoun refers to, as in the following example:

The trophy would not fit in the brown suitcase because it was too big. What was too big?
Answer 0: the trophy
Answer 1: the suitcase

Answering such questions requires, yes, you guessed correctly - commonsense and world knowledge. In the given example, the computer must reason that for the first object to fit into the second, the first object must be smaller than the second, so if the trophy could not fit into the suitcase, the trophy must be too big. Conversely, if instead of big, the question would have read small, the answer would have been "the suitcase".


Noun Compounds
Words are usually considered as the basic unit of a language, and many NLP applications use word embeddings to represent the words in the text. Word embeddings do a pretty decent job in capturing the semantics of a single word, and sometimes also its syntactic and morphological properties. The problem starts when we want to capture the semantics of a multi-word expression (or a sentence, or a document). The embedding of a word, for example dog, is learned from its occurrences in a large text corpus; the more common a word is, the more occurrences there are, and the higher the quality of the learned word embedding would be (it would be located "correctly" in the vector space near things that are similar to dog). A bigram like hot dog is already much less frequent, even less frequent is hot dog bun, and so on. The conclusion is clear - we can't learn embeddings for multi-word expressions the same way we do for single words.

The alternative is to try to somehow combine the word embeddings of the single words in the expression into a meaningful representation. Although there are many approaches for this task, there is no one-size-fits-all solution for this problem; a multi-word expression is not simply the sum of its single word meanings (hot dog is an extreme counter-example!).

One example out of many would be noun-compounds. A noun-compound is a noun that is made up of two or more words, which usually consists of the head (main) noun and its modifiers, e.g. video conference, pumpkin spice latte, and paper clip. The use of noun-compounds in English is very common, but most noun-compounds don't appear frequently in text corpora. As humans, we can usually interpret the meaning of a new noun-compound if we know the words it is composed of; for example, even though I've never heard of watermelon soup, I can easily infer that it is a soup made of watermelon.

Similarly, if I want my software to have a nice vector representation of watermelon soup, there is no way I can base it on the corpus occurrences of watermelon soup -- it would be too rare. However, I used my commonsense to build a representation of watermelon soup in my head -- how would my software know that there is a made of relation between watermelon and soup? This relation can be one out of many, for example: video conference (means), paper clip (purpose), etc. Note that the relation is implicit, so there is no immediate way for the machine to know what's the correct relation between the head and the modifier.1  To complicate things a bit further, many noun-compounds are non-compositional, i.e. the meaning of the compound is not a straightforward combination of the meaning of its words, as in hot dog, baby sitting, and banana hammock.

Existing Solutions for Noun-compound Interpretation
Automatic methods for interpreting the relation between the head and the modifier of noun-compounds have largely been divided into two approaches:

(1) machine-learning methods, i.e. hand-labeling a bunch of noun-compounds to a set of pre-defined relations (e.g. part of, made of, means, purpose...), and learning to predict the relation for unseen noun-compounds. The features are either related to each single word (head/modifier), such as their word vectors or lexical properties from WordNet, or to the noun-compound itself and its corpus occurrences. Some methods also try to learn a vector representation for a noun-compound in the form of applying a function to the word embeddings of its single words (e.g. vector(olive oil) = function(vector(olive), vector(oil))).

(2) finding joint occurrences of the nouns in a text corpus, some of which would explicitly describe the relation between the head and the modifier. For example "oil made of olives".

While there has been a lot of work in this area, success on this task is still mediocre. A recent paper suggested that current methods succeed mostly due to predicting the relation based solely on the head or on the modifier - for example, most noun-compounds with the head "oil" hold the made of relation (olive oil, coconut oil, avocado oil, ...). While this guess can be pretty accurate most of the times, it may cause funny mistakes as in the meme below.

From http://www.quickmeme.com/meme/3r9thy.

For the sake of simplicity, I focused on two-word noun-compounds, but noun-compounds with more than two words have an additional ambiguity - a syntactic ambiguity - what are the head-modifier relations in the compound? It is often referred to as bracketing. Without getting into too many details, consider the example of hot dog bun from before. It should be interpreted as [[hot dog][bun]] rather than [hot [dog bun]].





More to read?
Yeah, I know it was a long post, but there is so much more ambiguity in language that I haven't discussed. Here is another selected topic, in case you're looking for more to read. We all speak a second language called emoji, which is full of ambiguity. Here are some interesting articles about it: Emoji could cause confusion, trouble in the workplace, The real meaning of all those emoji in Twitter handles, Learning the language of emoji, and Why emojis may be the best thing to happen to language in the digital age. For the older people among us (and in the context of emoji, I consider myself old too, so no offence anyone), if you're not sure about the meaning of an emoji, why don't you check emojipedia first, just to make sure you're not accidentally using phallic symbols in your grocery list?


1 In this very interesting paper by Preslav Nakov there is a nice observation: a noun-compound is a "compression device" that allows saying more with less words.