Monday, July 13, 2015

Lexical Inference

After I dedicated the previous post to the awesome field of natural language processing, in this post I will drill down and tell you about the specific task that I'm working on: recognizing lexical inference. Most of the work that I will describe was done by other talented people. You can see references to their papers at the bottom of the post, in case you would like to read more about a certain work.

I'll start by defining what lexical inference is. We are given two terms, x and y (a term is a word such as cat or a multi-word expression such as United States of America). We would like to know whether we can infer the meaning of y from x (denoted by → y throughout this post).

For example, we can infer 
animal from cat, because when we talk about a cat we refer to an animal. In general, y can be inferred from x if they hold a certain lexical or semantic relation; for example, if y is a x (cat animal, Lady Gaga singer), if x causes y (flu fever), if x is a part of y (London England), etc. 


Now would be a good time to ask - why is this task important? We know that a cat is an animal. How would it help us if the computer can automatically infer that? I'll give a usage example. Let's say you use a search engine and type the query "actor Scientology" (or "actors engaged in Scientology", if you don't search by keywords). You expect the search engine to retrieve the following documents:

Figure 1: search results for the query "actor Scientology" that don't directly involve the word "actor"

since they are talking about a certain actor (Tom Cruise or John Travolta) and Scientology. However, what if these documents don't contain the word actor? The search engine needs to know that Tom Cruise → actor to retrieve the first document, and that John Travolta → actor to retrieve the second.
There are many other applications, and in general, knowing that one term infers another term helps dealing with language variability (there is more than one way of saying the same thing).

People have been working on this task for many years. As many other NLP tasks, this one is also difficult. There are two main approaches to recognize lexical inference:
  • Resource-based: in this approach, the inference is based on knowledge from hand-crafted resources, that specify the semantic or lexical relations between words or entities in the world. In particular, the resource which is usually used for this task is WordNet, a lexical database of the English language. WordNet contains words which are connected to each other via different relations, such as (tail, part of, cat) and (cat, subclass of, feline).1  See figure 2 for an illustration of WordNet.

    This approach is usually very precise (it is correct in most of the times that it says that 
    → y), because it relies on knowledge which is quite precise. However, its coverage (the percentage of times in which it recognizes that → y, out of all the times that → y is true) is limited, because some of the knowledge needed for the inference may be absent from the resource.
    Figure 2: an excerpt of WordNet - a lexical database of the English language

  • Corpus-based: this approach uses a huge text called "corpus" (e.g. all the English articles in Wikipedia) which is supposed to be representative of the language. The inference is based on the statistics of occurrences of x and y in the corpus. There are several ways to use a corpus to recognize lexical inference:

    • pattern-based approach - there are some patterns such as "and other y" or "y such as x" that indicate that → y; if you find it difficult to understand, think about "animals such as cats" and "cat and other animals" and ignore the plural/singular. If x and y frequently occur in the corpus in such patterns, this approach will recognize that → y. It is not enough to observe one or two occurrences; think about the sentence "my brother and other students". It may occur in the corpus, but this is not a general phenomenon: student is not a common attribute of brother. Positive examples such as cat and animal will probably occur more frequently in these patterns in the corpus. 

      The first method defined these patterns manually [1] . A later work found such patterns automaticall[2]. This work was highly referenced and used. It is quite precise and also has a good coverage. However, it requires that x and y occur together in the corpus, and some words tend not to occur together, even though they are highly related; for instance, synonyms (e.g. elevator and lift).
    • distributional approach - the second approach solves exactly this. It is based on a linguistic hypothesis [3] that says that if words occur with similar neighboring words, then they tend to have similar meanings (e.g. elevator and lift will both appear next to downupbuildingfloor, and stairs). There has been plenty of work in this approach: earlier methods defined some similarity measure between words which was based on the neighbors (the more common neighbors they share, the more similar they are) [4],[5]. In recent years, some automatic methods (that don't require defining a similarity measure) were developed (I might elaborate on these in another post, but it requires knowledge in topics that I haven't covered yet).
    Corpus-based methods, and in particular distributional ones, have a much higher coverage than resource-based methods, because they utilize huge texts. The amount of texts available on the web is incredible, as opposed to structured knowledge. However, they are much less precise. The distributional hypothesis says something about the similarity of x and y and it is a vague definition. Just because x and y are similar (what does that even mean?) it doesn't mean that we can infer x from y or vice versa; for instance, the words football and basketball are similar, and will probably share some common neighbors such as ball, player, team, match, and win. However, you can't infer one from the other. Moreover, distributional methods may say that hot and cold are similar, because both occur with weather, temperature, drink, water, etc. Now this is too much. Not only that hot ↛ cold and cold ↛ hot, but they mean exactly the opposite!

So what have we been doing?
We developed a new resource-based method for recognizing lexical inference [6]. We weren't going to compromise on precision, but we still wanted to improve upon the coverage of prior methods. In particular, we found that prior methods are incapable of recognizing inferences that contain recent terminology (e.g. social networks) and named-entities (called proper-names, e.g. Lady Gaga). This simply happens because prior methods are based on WordNet, and these terms are absent from WordNet; WordNet is an "ontology of the English language", so by definition it's not supposed to contain world-knowledge about named entities. Also, it hasn't been updated in years, so it doesn't cover recent terminology.

We used other structured knowledge resources that contain exactly this kind of information, are much larger than WordNet and are frequently updated. These resources contain information such as (Lady Gaga, occupation, singer) and (singer, subclass, person), that can indicate that Lady Gaga → singer and Lady Gaga person. However, they may also contain information such as (Lady Gaga, producer, Giorgio Moroder) but that does not indicate that Lady Gaga → Giorgio Moroder. As in WordNet, we needed to define which relations in the resource are relevant for lexical inference. For instance, the occupation relation is relevant for lexical inference, because a person infers its occupation (Lady Gaga → singerBarack Obama → president).

As opposed to WordNet-based methods, which only need to select relevant relations out of the few relations WordNet defines, it would be excruciating to do the same for the resources we used. They contain thousand of relations. So we developed a method that automatically recognizes which resource relations are indicative of lexical inference. Then, if it finds that x and y are connected to each other via a path containing only relevant relations, it predicts that → y. So in our previous example, since occupation and subclass were found indicative of lexical inference, then Lady Gaga → person. 

Similarly to the example, we've made successful inferences, and in particular inferences containing proper-names that were not captured by previous methods. We also maintained a very high precision. This is basically the simplified version of our paper.

So, is it perfect now?
Well... not exactly. First of all, our coverage is still lower than that of the corpus-based methods (but with higher precision, usually). Second, there are still some open issues left. I'll give one of them as an example, as this post is already very long (and I challenge you to tl;dr it).

Answer the following question:
apple __ fruit?
(a) →
(b) ↛

Well, I know this seems like a trivial question, but the answer is - it depends!
Are we talking about  or about?
The problem in determining whether apple → fruit, is that the word apple has two senses (meanings). In one of its senses, apple → fruit, and in the other, apple ↛ fruitIn order to decide correctly, we need to know which of the senses of apple is the one we are being asked about. 

Figure 3: I've just seen this on my Facebook feed after publishing the post and I had to add it :)

As I mentioned before, recognizing lexical inference is usually a component in some NLP application. In such application, xy or both x and y are part of a text, and the application asks "does x infer y?", "what can we infer from x?" or "what infers y?". If x=apple, and we would like to know whether it infers y=fruit, the solution (for humans) would be to look at the texts. 

Say we have the sentence I ate a green apple for breakfast. We can easily understand that the correct sense of apple in this sentence is fruit. How did we know that? We noticed words like ate, breakfast and green that are related to apple in the sense of fruit (and unrelated to Apple the company). There are already automatic methods that do that (with some success, of course). So one of the next challenges is to incorporate them and apply context-sensitive lexical inference. In this case, infer that I ate a fruit and not that I ate a company. I promise to update in case I have any progress with that.

[1] Hearst, Marti A. "Automatic acquisition of hyponyms from large text corpora." Proceedings of the 14th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 1992.  
[2] Snow, Rion, Daniel Jurafsky, and Andrew Y. Ng. "Learning syntactic patterns for automatic hypernym discovery." Advances in Neural Information Processing Systems 17. 2004.
[3] Harris, Zellig S. "Distributional structure." Word. 1954. 
[4] Weeds, Julie, and David Weir. "A general framework for distributional similarity." Proceedings of the 2003 conference on Empirical methods in natural language processing. Association for Computational Linguistics, 2003. 
[5] Kotlerman, Lili, et al. "Directional distributional similarity for lexical inference." Natural Language Engineering 16.04: 359-389. 2010. 
[6] Shwartz, Vered, Omer Levy, Ido Dagan, and Jacob Goldberger. "Learning to Exploit Structured Resources for Lexical Inference." Proceedings of the Nineteenth Conference on Computational Natural Language Learning.  Association  for Computational Linguistics. 2015. 

1 These relations actually have less friendly names: holonym/meronym and hyponym/hypernym.


  1. Do you factor letter case in? If someone says "Apple", I'd be pretty confident that they are talking about the computer company (unless it's the beginning of a sentence. Though I appreciate not everyone capitalizes pronouns.

    1. It's always a good idea to check whether the word is capitalized, and it can definitely help distinguishing between named entities and other nouns.


      1. some people don't bother to capitalize, especially on social media (e.g. the example from the post).

      2. as you mentioned, a word in the beginning of the sentence may be a capitalized noun which is not a named entity.

      3. in the common case, you need to distinguish between several senses of a word/term which may consist of more than one named entity or more than one regular noun. E.g. deciding whether Dell refers to the company or to its founder Michael Dell.