From time to time I actually make some progress with my own research, that I think might be interesting or beneficial for others. Now is such a time.* so let me share that with you.
If you've read my blog post about lexical inference, then you should already be familiar with my research goals. I'll explain it shortly: I'm working on automated methods to recognize that a certain term's meaning (word or multi-word expression) can be inferred from another's.
There are several interesting lexical-semantic relations, such as synonymy/equivalence (intelligent/smart, elevator/lift), hypernymy/subclass (parrot/bird, stare/look), meronymy/part-of (spoon/cutlery, London/England), antonymy/opposite (short/tall, boy/girl), and causality (flu/fever, rain/flood).
These relations are interesting, because whenever we encounter one of the terms in a given sentence, we can use our knowledge to infer new facts. For instance, the sentence "I live in London" (wishful thinking...), could be used to infer "I live in England", knowing that London is a part of England. Of course we also need to know something about the sentence itself, because saying that "I left London" doesn't necessarily entail that "I left England". I might have just taken the train to Reading for the Festival :) But this is another line of research which I haven't explored deeply yet, so we'll leave that for another post.
In this particular work, we've focused on the common hypernymy relation between nouns (and noun phrases). We developed a method that given a pair of nouns (x, y) (e.g. (cat, animal), (abbey road, record), (apple, animal)) predicts whether y is a hypernym of x - or in other words, whether x is a subclass of y (e.g. cats are a subclass of animals) or an instance of y (e.g. abbey road is an instance of record). I'll try to keep things simple. If you're interested in more details or the references to other papers, please refer to the paper.
There are two main approaches in the literature of hypernymy detection: path-based and distributional. Path-based methods are very elegant (a matter of taste, I guess). They assume that if y is a hypernym of x, then this relation must be expressed frequently enough when looking at a large text corpus. A pair of words that tends to be connected through patterns such as "X and other Y", "Y such as X", "X is a Y" is likely to hold the hypernymy relation (e.g. cats and other animals, fruit such as apples). To overcome some adjectives and relative clauses that stand in the way of the important information (as in Abbey Road is [the 11th studio] album), a dependency parser is used, outputting the syntactic relation between words in the sentence (e.g. Abbey Road is connected to is and is is connected to album). See the figure below for an example of such a path.
|
Though this method works nicely, it suffers from one major limitation: it cannot generalize. If x1 and y1 are mainly connected by the "X is considered as Y" pattern, and x2 and y2 are connected via "X is regarded as Y", they practically share no information. These are considered two different paths. Attempts to generalize such paths by replacing words along the paths with wild-cards (e.g. "X is * as Y") or part-of-speech tags (e.g. "X is VERB as Y") may end up in paths too-general (e.g. "X is denied as Y", which also generalizes to "X is VERB as Y", is a negative path).
In contrast, the distributional approach considers the separate occurrences of x and y in the corpus. It relies on the distributional hypothesis, that I've already mentioned here and here. The main outcome of this hypothesis is that words can be represented using vectors, with similar words (in meaning) sharing similar vectors. In recent years, people have been using these vectors in supervised hypernymy detection. To represent each (x, y) pair as a vector, they somehow combined x and y's vectors (e.g. by concatenating them). They trained a classifier on top of these vectors and predicted for new pairs (e.g. (apple, fruit)) whether y is a hypernym of x. These methods have shown good performance, but it was later found that they tend to overfit to the training data, and are pretty bad in generalizing; for example, if you are trying to predict hypernymy for a new pair (x, y) with rare words x and y that weren't observed in the training data, the prediction will be (only slightly better than) a guess.
To sum up recent work - path-based methods can leverage information about the relation between a pair of words, but they do not generalize well. On the other hand, distributional methods might not recognize the relation between a pair of words, but they contain useful information about each of the words. Since these two approaches are complementary, we combined them!
We started by improving path representation, using a recurrent neural network. I can't possibly explain the technical details without writing a long background post about neural networks first, so I'll skip most of the technical details. I'll just say that this is a machine learning model that processes sequences (e.g. sequence of words, letters, edges, etc), that can, among other things, output a vector representing the entire sequence. In our case, we split the dependency path to edges, and let the network learn a vector representation of the entire path, by going over the edges sequentially. Then, we replace the traditional path features that represent a term-pair, e.g. "X is defined as Y", with the vector representing the path - the path embbeding.
The nice thing about these path embbedings is that -- can you guess? similar paths have similar path embeddings. This happens thanks to two things. First, the network can learn that certain edges are important for detecting hypernymy, while others are not, which may lead to consolidating paths that differ by certain unimportant edges.
Moreover, since neural networks can only work on vectors, the entire information we use is encoded as vectors. For instance, the words along the path (e.g. is, defined, as) are all encoded as vectors. We use word embeddings, so similar words have similar vectors. This results in similar vectors for paths that differ by a a pair of similar words, e.g. "X is defined as Y" and "X is described as Y".
Similar paths having similar vectors is helpful for the classifier. In the paper, we show that our method performed better than the prior methods. Just to give an intuition, let's say for instance that the classifier learned that the path "X company is a Y", which was pretty common in the corpus, indicates hypernymy. And let's say that "X ltd is a Y" only occurred once for a positive (x, y) pair. The previous methods would probably decide that such a path is not indicative of hypernymy, since they don't have enough evidence about it. However, our method recognizes that ltd and company are similar words, yielding similar path vectors for these two paths. If "X company is a Y" is considered indicative, then so does "X ltd is a Y".
At last, we combined the complementary path-based and distributional approaches. To add distributional information to our model (the information on the separate occurrences of each term x and y), we simply added the word embedding vectors of x and y to the model, allowing it to rely on this information as well. With this simple change we achieve significant improvement in performance compared to prior methods in each approach. For so many years people have been saying these two approaches are complementary, and turns out it was actually not too difficult to combine them :)
The nice thing about these path embbedings is that -- can you guess? similar paths have similar path embeddings. This happens thanks to two things. First, the network can learn that certain edges are important for detecting hypernymy, while others are not, which may lead to consolidating paths that differ by certain unimportant edges.
Moreover, since neural networks can only work on vectors, the entire information we use is encoded as vectors. For instance, the words along the path (e.g. is, defined, as) are all encoded as vectors. We use word embeddings, so similar words have similar vectors. This results in similar vectors for paths that differ by a a pair of similar words, e.g. "X is defined as Y" and "X is described as Y".
Similar paths having similar vectors is helpful for the classifier. In the paper, we show that our method performed better than the prior methods. Just to give an intuition, let's say for instance that the classifier learned that the path "X company is a Y", which was pretty common in the corpus, indicates hypernymy. And let's say that "X ltd is a Y" only occurred once for a positive (x, y) pair. The previous methods would probably decide that such a path is not indicative of hypernymy, since they don't have enough evidence about it. However, our method recognizes that ltd and company are similar words, yielding similar path vectors for these two paths. If "X company is a Y" is considered indicative, then so does "X ltd is a Y".
At last, we combined the complementary path-based and distributional approaches. To add distributional information to our model (the information on the separate occurrences of each term x and y), we simply added the word embedding vectors of x and y to the model, allowing it to rely on this information as well. With this simple change we achieve significant improvement in performance compared to prior methods in each approach. For so many years people have been saying these two approaches are complementary, and turns out it was actually not too difficult to combine them :)
Paper details
Improving Hypernymy Detection with an Integrated Path-based and Distributional Method. Vered Shwartz, Yoav Goldberg and Ido Dagan. ACL 2016. link
Improving Hypernymy Detection with an Integrated Path-based and Distributional Method. Vered Shwartz, Yoav Goldberg and Ido Dagan. ACL 2016. link
* Now = a few months ago, but the paper was under review and I couldn't publish anything about it.