One of the things that make natural language processing so difficult is language variability: there are multiple ways to express the same idea/meaning. I mentioned it several times in this blog, since it is a true challenge for any application that aims to interact with humans. You may program it to understand common things or questions that a human may have, but if the human decides to deviate from the script and phrase it slightly differently, the program is helpless. If you want a good example, take your favorite personal assistant (Google assistant, Siri, Alexa, etc.) and ask it a question you know it can answer, but this time use a different phrase. Here is mine:
Both questions I asked have roughly the same meaning, yet, Google answers the first perfectly but fails to answer the second, backing off to showing search results. In fact, I just gave you a "free" example of another difficult problem in NLP which is ambiguity. It seems that Google interpreted showers as "meteor showers" rather than as a light rain.
One way to deal with the language variability difficulty is to construct a huge dictionary that contains groups or pairs of texts with roughly the same meaning: paraphrases. Then, applications like the assistant can, given a new question, look up the dictionary for any question they were programmed to answer which has the same meaning. Of course, this is a naive idea, given that language is infinite and one can always form a new sentence that has never been said before. But it's a good start, and it may help developing algorithms that can associate a new unseen text to an existing dictionary entry (i.e. generalizing).
Several approaches have been used to construct such dictionaries, and in this post I will present some of the simple-but-smart approaches.
Translation-based paraphrasing
The idea behind this approach is super clever and simple: suppose we are interested in collecting paraphrases in English. If two English texts are translated to the same text in a foreign language, then they are likely paraphrases of each other. Here is an example:
The English texts on the left are translated into the same Italian text on the right, implying that they have the same meaning. |
This approach goes as far as 2001. The most prominent resource constructed with this approach is the paraphrase database (PPDB). It is a resource containing hundreds of millions of text pairs with roughly the same meanings. Using the online demo, I looked up for paraphrases of "nice to meet you", yielding a bunch of friendly variants that may be of use for conference small talks:
it was nice meeting you |
it was nice talking to you |
nice to see you |
hey, you guys |
it's nice to meet you |
very nice to meet you |
nice to see you |
i'm pleased to meet you |
it's nice to meet you |
how are you |
i'm delighted |
it's been a pleasure |
Paraphrases of "nice to meet you", from PPDB. |
In practice, all these texts appear as paraphrases of "nice to meet you" in the resource, with different scores (to what extent is this text a paraphrase of "nice to meet you"?). These texts were found to be translated to the same text in a single or in multiple foreign languages, and their scores correspond to the translation scores (as explained here), along with other heuristics.2
While this approach provides a ton of very useful paraphrases, as you can guess, it also introduces errors, as in every automatic method. One type of an error occurs when the foreign word has more than one sense, each translating into a different, unrelated English word. For example, the Spanish word estacion has two meanings: station and season. When given a Spanish sentence that contains this word, it is translated (hopefully) to the correct English word according to the context. This paraphrase approach, however, does not look at the original sentences in which these words occur, but only at the phrase table -- a huge table of English phrases and their Spanish translations without their original contexts. It has therefore no way at this point to tell that stop and station refer to the same meaning of estacion, and are therefore paraphrases, while season and station are translations of two different senses of estacion.
Even without making such a horrible mistake of considering two texts as paraphrases when they are not related at all, paraphrasing is not well-defined, and the paraphrase relation encompasses many different relations. For example, looking for paraphrases of the word tired in PPDB, you will get equivalent phrases like fatigued, more specific phrases like overtired/exhausted, and related but not-quite-the-same phrases like bored. This may occur when the translator likes being creative and does not remain completely faithful to the original sentence, but also when the target language does not contain an exact translation for a word, defaulting in a slightly more specific or more general word. While this is not a specific phenomenon of this approach but rather of all the paraphrasing approaches (for different reasons), this has been studied by the PPDB people who did an interesting analysis of the different semantic relations the resource captures.
The following approaches focus on paraphrasing predicates. A predicate is a text describing an action or a relation between one or more entities/arguments, very often containing a verb. For example: John ate an apple or Amazon acquired Whole Foods. Predicate paraphrases are pairs of predicate templates -- i.e. predicates whose arguments were replaced by placeholders -- that would have roughly the same meaning given an assignment to their arguments. For example, [a]0 acquired [a]1 and [a]0 bought [a]1 are paraphrases given the assignment [a]0 = Amazon and [a]1 = Whole Foods.1 Most approaches focus on binary predicates (predicates with two arguments).
Argument-distribution paraphrasing
This approach relies on a simple assumption: if two predicates have the same meaning, they should normally appear with the same arguments. Here is an example:
In this example, the [a]0 slots in both predicates are expected to contain names of companies that acquired other companies while the [a]1 slot is expected to contain acquired companies.
The DIRT method represents each predicate as two vectors: (1) the distribution of words that appeared in its [a]0 argument slot, and (2) the distribution of words that appeared in its [a]1 argument slot. For example, the [a]0 vectors of the predicates in the example will have positive/high values for names of people and names of companies that acquired other companies, and low values for other (small) companies and other unrelated words (cat, cookie, ...). To measure the similarity between two predicates, the two vector pairs ([a]0 in each predicate and [a]1 in each predicate) are compared using vector similarity measures (i.e. cosine similarity), and a final score averages the per-slot similarities.
Now, while it is true that predicates with the same meaning often share arguments, it is definitely not true that predicates that share a fair amount of their argument instantiations are always paraphrases. A simple counterexample would be of predicates with opposite meanings, that often tend to appear with similar arguments: for instance, "[stock] rise to [30]" and "[stock] fall to [30]" or "[a]0 acquired [a]1" and "[a]0 sold [a]1" with any [a]0 that once bought an [a]1 and then sold it.
Following this approach, other methods were suggested, such as capturing a directional inference relation between predicates (e.g. [a]0 shot [a]1 => [a]0 killed [a]1 but not vice versa), releasing a huge resource of such predicate pairs (see the paper); and a method to predict whether one predicate entails the other, given a specific context (see the paper).
Event-based paraphrases
Another good source for paraphrases is multiple descriptions of the same news event, as various news reporters are likely to choose different words to describe the same event. To automatically group news headlines discussing the same story, it is common to group them according to the publication date and word overlap. Here is an example of some headlines describing the acquisition of Whole Foods by Amazon:
We can stop here and say that all these headlines are sentential paraphrases. However, going a step further, if we've already observed in the past Google to acquire YouTube / Google is buying YouTube as sentential paraphrases (and many other similar paraphrases), we can generalize and say that [a]0 to acquire [a]1 and [a]0 is buying [a]1 are predicate paraphrases.
Early works relying on this approach are 1, 2, followed by some more complex methods like 3. We recently harvested such paraphrases from Twitter, assuming that tweets with links to news web sites that were published on the same day are likely to describe the same news events. If you're interested in more details, here are the paper, the poster and the resource.
This approach is potentially more accurate than the argument-distribution approach. The latter assumes that predicates that often occur with the same arguments are paraphrases, while the former considers predicates with the same argument as paraphrases only if it believes that they discuss the same event.
What does the future hold? neural paraphrasing methods, of course. I won't go into technical details (I feel that there are enough "neural network for dummies" blog posts out there, and I'm by no means an expert on that topic). The idea is to build a model that reads a sequence of words and then generates a different sequence of words that has the same meaning. If it sounds like inexplicable magic, it is mostly because even the researchers working on this task can at most make educated guesses on why something works well or not. In any case, if this ever ends up working well, it will be much better than the resources we have today, since it will be capable of providing paraphrases / judging correctness of paraphrases for new texts that were never observed before.
1 Of course, given a different choice of arguments, these predicates will not be considered as paraphrases. For example, Mary acquired a skill is not a paraphrase of Mary bought a skill. The discussed approaches consider predicate-pairs as paraphrases, if there exists an argument assignment (/context) under which these predicates are paraphrases. ↩
2 See also more recent work on translation-based paraphrasing. ↩
2 See also more recent work on translation-based paraphrasing. ↩