Monday, June 20, 2016

Linguistic Analysis of Texts

Not long ago, Google released their new parser, oddly named Parsey McParseface. For a couple of days, popular media was swamped with announcements about Google solving all AI problems with their new magical software that understands language [e.g. 1, 2].

Well, that's not quite what it does. In this post, I will explain about the different steps applied for analyzing sentence structure. These are usually used as a preprocessing step for higher-level tasks that try understanding the meaning of sentences, e.g. intelligent personal assistants like Siri or Google Now.

The following tools are traditionally used one after the other (also known as the "linguistic annotation/processing pipeline"). Generally speaking, the accuracy of available tools for the tasks in this list is in decreasing order. Some low-level tasks are considered practically solved, while others still have room for improvement.
  1. Sentence splitting - as simple as it sounds: receives a text document/paragraph and returns its partition to sentences. While it sounds like a trivial task -- cut the text on every occurrence of a period -- it is a bit trickier than that; sentences can end with an exclamation / question mark, and periods are also used in acronyms and abbreviations in the middle of the sentence. The simple period rule will fail on this text, for example. Still, sentence splitting is practically considered a solved task, using predefined rules and some learning algorithms. See this for more details.

  2. Tokenization - a tokenizer receives a sentence and splits it to tokens. Tokens are mostly words, but words that are short forms of negation or auxiliaries are split to two tokens, e.g. I'm => I 'maren't => are n't.

  3. Stemming / Lemmatization - Words appear in natural language in many forms, for instance, verbs have different tense suffixes (-ing, -ed, -s), nouns have plurality suffixes (s), and adding suffixes to words can sometimes change their grammatical categories, as in nation (noun) => national (adjective) => nationalize (verb).
    The goal of both stemmers and lemmatizers is to "normalize" words to their common base form, such as "cats" => "cat", "eating" => "eat". This is useful for many text-processing applications, e.g. if you want to count how many times the word cat appears in the text, you may also want to count the occurrences of cats.
    The difference between these two tools is that stemming removes the affixes of a word, to get its stem (root), which is not necessarily a word on its own, as in driving => drivLemmatization, on the other hand, analyzes the word morphologically and returns its lemma. A lemma is the form in which a word appears in the dictionary (e.g. singular for nouns as in cats => cat, infinitive for verbs as in driving => drive).
    Using a lemmatizer is always preferred, unless there is no accurate lemmatizer for that language, in that case a stemmer is better than nothing.

  4. Part of speech tagging - receives a sentence, and tags each word (token) with its part of speech (POS): noun, verb, adjective, adverb, preposition, etc. For instance, the following sentence: I'm using a part of speech tagger is tagged in Stanford Parser as:
    I/PRP 'm/VBP using/VBG a/DT part/NN of/IN speech/NN tagger/NN ./. Which means that I is a personal pronoun, 'm (am) is a verb, non-3rd person singular present, and if you're interested, here's the list to interpret the rest of the tags.
    (POS taggers achieve around 97% accuracy).

  5. Syntactic parsing - analyzes the syntactic structure of a sentence, outputting one of two types of parse trees: constituency-based or dependency-based.

    Constituency - segments the sentence into syntactic phrases: for instance, in the sentence the brown dog ate dog food, [the brown dog] is a noun phrase, [ate dog food] is a verb phrase, and [dog food] is also a noun phrase.
    An example of constituency parse tree, parsed manually by me and visualized using syntax tree generator.
    Dependency - connects words in the sentence according to their relationship (subject, modifier, object, etc.). For example, in the sentence the brown dog ate dog food, the word brown is a modifier of the word dog, which is the subject of the sentence. I've mentioned dependency trees in the previous post: I used them to represent the relation that holds between two words, which is a common use.
    (Parsey McParseface is a dependency parser. Best dependency parsers achieve around 94% accuracy).

An example of dependency parser output, using Stanford Core NLP.

Other tools, which are less basic, but are often used, include:
  • Named entity recognition (NER) - receives a text and marks certain words or multi-word expressions in the text with named entity tags, such as PERSON, LOCATION, ORGANIZATION, etc. 
    An example of NER from Stanford Core NLP.

  • Coreference resolution - receives a text and connects words that refer to the same entity (called "mentions"). This includes, but not limited to:
    pronouns (he, she, I, they, etc.) - I just read Lullaby. It is a great book.
    different names / abbreviations - e.g., the beginning of the text mentions Barack Obama, which is later referred to as Obama.
    semantic relatedness - e.g. the beginning of the text mentions Apple which is later referred to as the company.

    This is actually a tougher task than the previous ones, and accordingly, it achieves less accurate results. In particular, sometimes it is difficult to determine which entity a certain mention refers to (while it's easy for a human to tell): e.g. I told John I don't want Bob to join dinner, because I don't like him. Who does him refer to?
    Another thing is that it is very sensitive to context, e.g. in one context apple can be co-referent with the company, while in another, that discusses the fruit, it is not true.

  • Word sense disambiguation (WSD) - receives a text and decides on the correct sense of each word in the given context. For instance, if we return to the apple example, in the sentence Apple released the new iPhone, the correct sense of apple is the company, while in I ate an apple after lunch the correct sense is the fruit. Most WSD systems use WordNet for the sense inventory.

  • Entity linking - and in particular, Wikification: receives a text and links entities in the text to the corresponding Wikipedia articles. For instance, in the sentence 1984 is the best book I've ever read, the word 1984 should be linked to (rather than to the articles discussing the films / TV shows).
    Entity linking can complement word sense disambiguation, since most proper names (as Apple or 1984) are not present in WordNet.

  • Semantic role labeling (SRL) - receives a sentence and detects the predicates and arguments in the sentence. A predicate is usually a verb, and each verb may have several arguments, such as agent / subject (the person who does the action), theme (the person or thing that undergoes the action), instrument (what was used for doing the action), etc. For instance, in the sentence John baked a cake for Mary, the predicate is bake, and the arguments are agent:John, theme:cake, and goal:Mary. This is not just the final task in my list: it is the task which is the closest to understanding the semantics of a sentence.

Here is an example for (a partial) analysis of the sentence: The brown dog ate dog food, and now he is going to sleepusing Stanford Core NLP:

Analysis of The brown dog ate dog food, and now he is going to sleep, using Stanford Core NLP.
All this effort, and we are not even yet talking about deep understanding of the sentence meaning, but rather analyzing the sentence structure, perhaps as a step toward understanding its meaning. As my previous posts show, it's hard as it is to understand a single word's meaning. In one of my next posts I will describe methods that deal with the semantics of a sentence.

By the way, if you are potential users of these tools, and you are looking for a parser, Google's parser is not the only one available. BIST is more accurate and faster than Parsey McParseface, and spaCy is slightly less accurate, but much faster than both.