Probably Approximately a Scientific Blog: 2016

Wednesday, November 23, 2016

Antonymy

In the Seinfeld episode, "the opposite", George says that his life is the opposite of everything he wanted it to be, and that every instinct he has is wrong. He decides to go against his instincts and do the opposite of everything. When the waitress asks him whether to bring him his usual order, "tuna on toast, coleslaw, and a cup of coffee", he decides to have the opposite: "Chicken salad, on rye, untoasted. With a side of potato salad. And a cup of tea!". Jerry argues with him on what's the opposite of tuna, which is according to him, salmon. So which one of them is right? If you ask me, nor salmon nor chicken salad is the opposite of tuna. There is no opposite of tuna. But this funny video demonstrates one of the biggest problems in the task of automatically detecting antonyms: even us humans are terrible at that!

It's a Bird, It's a Plane, It's Superman (not antonyms)

Many people would categorize a pair of words as opposites if they represent two mutually exclusive options/entities in the world, like male and female. black and white, and tuna and salmon. The intuition is clear when these two words x and y represent the only two options in the world. In set theory, it means that y is the negation/complement of x. In other words, everything in the world which is not x, must be y (figure 1).

Figure 1: x and y are the only options in the world U

In this sense, tuna and salmon are not antonyms - they are actually more accurately defined as co-hyponyms: two words that share a common hypernym (fish). They are indeed mutually exclusive, as one cannot be both a tuna and a salmon. However, if you are not a tuna, you are not necessarily a salmon. You can be another type of fish (mackerel, cod...) or something else which is not a fish at all (e.g. person). See figure 2 for a set theory illustration.

Figure 2: salmon and tuna are mutually exclusive, but not the only options in the world

Similarly, George probably had in mind that tuna and chicken salad are mutually exclusive options for sandwich fillings. He was probably right; a tuna-chicken salad sandwich sounds awful. But since there are other options for sandwich fillings (peanut butter, jelly, peanut butter and jelly...), these two can hardly be considered as antonyms, even if we define antonyms as complements within a restricted set of entities in the world (e.g. fish, sandwich fillings). I suggest the "it's a bird, it's a plane, it's superman" binary test for antonymy: if you have more than two options, it's not antonymy!

Wanted Dead or Alive (complementary antonyms)

What about black and white? These are two colors out of a wide range of colors in the world, failing the bird-plane-Superman test. However, if we narrow our world down to people's skin colors, these two may be considered as antonyms.

Other examples for complementary antonyms are day and night, republicans and democrats, dead and alive, true and false, stay and go. As you may have noticed, they can be of different parts of speech (noun, adjective, verb), but the two words within each pair both share the same part of speech (comment if you can think of a negative example!).

Figure 3: Should I stay or should I go now?

So are we cool with complementary antonyms? Well, not quite. If you say that female and male are complementary antonyms, people might tell you that gender is not binary, but a spectrum. Some of these antonyms actually have other, uncommon or hidden options. Like in coma for the dead and alive pair, libertarians in addition to republicans and democrats, etc. Still, these pairs are commonly considered as antonyms, since there are two main options.

So what have we learned about complementary antonyms? That they are borderline, they depend on the context in which they occur, and they might be offensive to minorities. Use them with caution.

The Good, the Bad [and the Ugly?] (graded antonyms)
Even the strictest definition of antonymy includes pairs of gradable adjectives representing the two ends of a scale. Some examples are hot and cold, fat and skinny. young and old, tall and short, happy and sad. Set theory and my binary test aren't suitable for these types of antonyms.

Set theory isn't adequate because a gradable adjective can't be represented as a set, e.g. "the set of all tall people in the world". The definition of a graded adjective changes depending on the context and is very subjective. For example, I'm relatively short, so everyone looks tall to me, while my husband is much taller than me, so he is more likely to say someone is short. The set of tall people in the world changes according to the person who defines it.

In addition, by definition, testing for binarism fails. A cup of coffee can be more than just hot or cold. It can be boiling, very hot, hot, warm, cool, cold or freezing. And we can add more and more discrete options to the scale of coffee temperature.

What makes specific pairs of gradable adjectives into antonyms? While the definition requires that they would be in the ends of the scale, intuitively I would say that they should only be symmetric in the scale, e.g. hot and cold, boiling and freezing, warm and cool, but not hot and freezing.

Antonymy in NLP
While there is a vast linguistics literature about antonyms, I'm less familiar with it, and I'm going to focus on some observations and interesting points about antonymy that appear in NLP papers that I read.

The natural logic formulation ([1]) makes a distinction between "alternation" - words that are mutually exclusive, and "negation" - words that are both mutually exclusive and cover all the options in the world. While I basically claimed in this post that the former is not antonymy, we've seen that in some cases, if the two words represent the two main options, they may be considered as antonyms.

However, people tend to disagree on these borderline word pairs, so sometimes it's easier to conflate them under a more loose definition. For example, [2] had an annotation task in which they asked crowdsourcing workers to choose the semantic relation that holds for a pair of terms. They followed the natural logic relations, but decided to merge "alternation" and "negation" into a weaker notion of "antonyms".

More interesting observations about antonyms, and references to linguistic papers, can be found in [3], [4], and [5].

After we established that humans find it difficult to decide whether two words are antonyms, you must be wondering whether automatic methods can do reasonably well on this task. There has been a lot of work on antonymy identification (see the papers in the references, and their related work sections). I will focus on my little experience with antonyms. We've just published a new paper ([6]) in which we analyze the roles of two main information sources used for automatic identification of semantic relations. The task is defined as follows: given a pair of words (x, y), determine what is the semantic relation that holds between them, if any (e.g. synonymy, hypernymy, antonymy, etc). As in this post, we've used information from x and y's joint occurrences in a large text corpus, as well as information about the separate occurrences of each word x and y. We found that among all the semantic relations we tested, antonymy was almost the hardest to identify (only synonymy was worse).

The use of information about separate occurrences of x and y is based on the distributional hypothesis, which I've mentioned several times in this blog. Basically, if we look at the distribution of neighboring words of a word x, it may tell us something about the meaning of x. If we'd like to know what's the relation between x and y, we can compute something on top of the neighbor distributions of each word. For example, we can expect the distributions of x and y to be similar if x and y are antonyms, since one of the properties of antonyms is that they are interchangeable (a word can be replaced with its antonym and the sentence will remain grammatical and meaningful). Think about replacing tall with short, day with night, etc. The problem is that this is similarly true for synonyms - you can expect high and tall to also appear with similar neighboring words. So basing the classification on distributional information may lead to confusing antonyms with synonyms.

The joint occurrences may help identifying the relation that holds between the words in a pair, as some patterns indicate a certain semantic relation - for instance, "x is a type of y" may indicate that y is a hypernym of x. The problem is that patterns that are indicative of antonymy, such as "either x or y" (either cold or hot) and "x and y" (day and night), may also be indicative of co-hyponymy (either tuna or chicken salad). In any case, this seems far less bad than confusing antonyms with synonyms; in some applications it may suffice to know that x and y are mutually exclusive, with no importance to whether they are antonyms or co-hyponyms. For instance, when you query a search engine, you'd like it to retrieve results including synonyms of your search query (e.g. returning New York City subway map when you search for NYC subway map), but you wouldn't want it to include mutually exclusive words (e.g. Tokyo subway map).

One last thing to remember is that these automatic methods are trained and tested on data collected from humans. If we can't agree on what's considered antonymy, we can't expect these automatic methods to succeed in this any better than we do.

References

[1] Natural Logic for Textual Inference. Bill MacCartney and Christopher D. Manning. RTE 2007.
[2] Adding Semantics to Data-Driven Paraphrasing. Ellie Pavlick, Johan Bos, Malvina Nissim, Charley Beller, Benjamin Van Durme, and Chris Callison-Burch. ACL 2015.

[3] Computing Word-Pair Antonymy. Saif Mohammad, Bonnie Dorr and Graeme Hirst. EMNLP 2008.
[4] Computing Lexical Contrast. Saif Mohammad, Bonnie Dorr, Graeme Hirst, and Peter Turney. CL 2013.
[5] Taking Antonymy Mask off in Vector Space. Enrico Santus, Qin Lu, Alessandro Lenci, Chu-Ren Huang. PACLIC 2014.
[6] Path-based vs. Distributional Information in Recognizing Lexical Semantic Relations. Vered Shwartz and Ido Dagan. CogALex 2016.

Saturday, November 12, 2016

Question Answering

In the my introductory post about NLP I introduced the following survey question: when you search something in Google (or any other search engine of your preference), is your query:
(1) a full question, such as "What is the height of Mount Everest?"
(2) composed of keywords, such as "height Everest"

I never published the results since, as I suspected, there were too few answers to the survey, and they were probably not representative of the entire population. However, my intuition back then was that only older people are likely to search with a grammatical question, while people with some knowledge in technology would use keywords. Since then, my intuition was somewhat supported by (a) this lovely grandma that added "please" and "thank you" to her search queries, and (b) this paper from Yahoo Research that showed that search queries with question intent do not form fully syntactic sentences, but are made of segments (e.g. [height] [Mount Everest]).

Having said that, searching the web to get an answer to a question is not quite the same as actually asking the question and getting a precise answer:

Here's the weird thing about search engines. It was like striking oil in a world that hadn't invented internal combustion. Too much raw material. Nobody knew what to do with it.
— Ex Machina

It's not enough to formulate your question in a way that the search engine will have any chance of retrieving relevant results. Now you need to process the returned documents and search for the answer.

Getting an answer to a question by querying a search engine is not trivial; I guess this is the reason so many people ask questions in social networks, and some other people insult them with Let me Google that for you.

The good news is that there are question answering systems, designed to do exactly that: automatically answer a question given as input; the bad news is that like most semantic applications in NLP, it is an extremely difficult task, with limited success.

Question answering systems have been around since the 1960s. Originally, they were developed to support natural language queries to databases, before web search was available. Later, question answering systems were able to find and extract answers from free text.

A successful example of a question answering system is IBM Watson. Today Watson is described by IBM as "a cognitive technology that can think like a human", and is used in many of IBM's projects, not just for question answering. Originally, it was trained to answer natural logic questions -- or more precisely, to form the correct question to a given answer, as in the television game show Jeopardy. On February 2011, Watson competed in Jeopardy against former winners of the show, and won! It had access to millions of web pages, including Wikipedia, which were processed and saved before the game. During the game, it wasn't connected to the internet (so it couldn't use a search engine, for example). The Jeopardy video is pretty cool, but if you have no patience watching it all (I understand you...), here's a highlight:

HOST: This trusted friend was the first non-dairy powdered creamer. Watson?

WATSON: What is milk?

HOST: No! That wasn’t wrong, that was really wrong, Watson.

Another example is the personal assistants: Apple's Siri, Amazon's Alexa, Microsoft's Cortana, and Google Assistant. They are capable of answering an impressively wide range of questions, but it seems they are often manually designed to answer specific questions.

So how does question answering work? I assume that each question answering system employs a somewhat different architecture, and some of the successful ones are proprietary. I'd like to present two approaches. The first is a general architecture for question answering from the web, and the second is question answering from knowledge bases.

Question answering from the web

I'm following a project report I submitted to a course 3 years ago, in which I exemplified this process on the question "When was Mozart born?". This example was originally taken from some other paper, which is hard to trace now. Apparently, it is a popular example in this field.

The system preforms the following steps:

A possible architecture for a question answering system.

Question analysis - parse the natural language question, and extract some properties:

Question type - mostly, QA systems support factoid questions (a question whose answer is a fact, as in the given example). Other types of questions, e.g. opinion questions, will be discarded at this point.

Answer type - what is the type of the expected answer, e.g. person, location, date (as in the given example), etc. This can be inferred with simple heuristics using the WH-question word, for example who => person, where => location, when => date.

Question subject and object - can be extracted easily by using a dependency parser. These can be used in the next step of building the query. In this example, the subject is Mozart.

Search - prepare the search query, and retrieve documents from the search engine. The query can be an expected answer template (which is obtained by applying some transformation to the question), e.g. "Mozart was born in *". Alternatively, or in case the answer template retrieves no results, the query can consist of keywords (e.g. Mozart, born).

Upon retrieving documents (web pages) that answer the query, the system focuses on certain passages that are more likely to contain the answer ("candidate passages"). These are usually ranked according to the number of query words they contain, their word similarity to the query/question, etc.

Answer extraction - try to extract candidate answers from the candidate passages. This can be done by using named entity recognition (NER) that identifies in the text mentions of people, locations, organizations, dates, etc. Every mention whose entity type corresponds to the expected answer type is a candidate answer. In the given example, any entity recognized as DATE in each candidate passage will be marked as a candidate answer, including "27 January 1756" (the correct answer) and "5 December 1791" (Mozart's death date).

The system may also keep some lists that can be used to answer closed-domain questions, such as "which city [...]" or "which color [...]" that can be answered using a list of cities and a list of colors, respectively. If the system identified that the answer type is color, for example, it will search the candidate passage for items contained in the list of colors. In addition, for "how much" and "how many" questions, regular expressions identifying numbers and measures can be used.

Ranking - assign some score for each candidate answer, rank the candidate answers in descending order according to their scores, and return a list of ranked answers. This phase differs between systems. The simple approach would be to represent an answer by some characteristics (e.g. surrounding words) and learn a supervised classifier to rank the answers.

An alternative approach is to try to "prove" the answer logically. In the first phase, the system creates an expected answer template. In our example it would be "Mozart was born in *". By assigning the candidate answer "27 January 1756" to the expected answer template, we get the hypothesis "Mozart was born in 27 January 1756", which we would like to prove from the candidate passage. Suppose that the candidate passage was "[...] Wolfgang Amadeus Mozart was born in Salzburg, Austria, in January 27, 1756. [...]", a person would know that given the candidate passage, the hypothesis is true, therefore this candidate answer should be ranked high.

To do this automatically, Harabagiu and Hick ([1]) used a textual entailment system: the system receives two texts and determines whether if the first text (text) is true, it means that the second one (hypothesis) is also true. Some of these systems return a number, indicating to what extent this is true. This number can be used for ranking answers.

While this is a pretty cool idea, the unfortunate truth is that textual entailment systems do not perform better than question answering systems, or very good in general. So reducing the question answering problem to that of recognizing textual entailment doesn't really solve question answering.

Question answering from knowledge bases

A knowledge base, such as Freebase/Wikidata and DBPedia, is a large-scale set of facts about the world in a machine-readable format. Entities are related to each other via relations, creating triplets like (Donald Trump, spouse, Melania Trump) and (idiocracy, instance of, film) (no association between the two facts whatsoever ;)). Entities can be people, books and movies, countries, etc. Example relations are birth place, spouse, occupation, instance of, etc. While these facts are saved in a format which is easy for a machine to read, I never heard of a human who searches information in knowledge bases. Which is too bad, since it contains an abundance of information.

So some researchers (e.g. [2], following [3]) came up with the great idea of letting people ask a question in natural language (e.g. "When was Mozart born?"), parsing the question automatically to relate it to a fact in the knowledge base, and answer accordingly.
This reduces the question answering task to understanding the natural language question, whereas querying for the answer from a knowledge base requires no text processing. The task is called executable semantic parsing. The natural language question is mapped into some logic representation, e.g. Lambda calculus. For example, the example question would be parsed to something like λx.DateOfBirth(Mozart, x). The logical form is then executed against a knowledge base; for instance, it would search for a fact such as (Mozart, DateOfBirth, x) and return x.

Despite having the answer appear in a structured format rather than in free text, this task is still considered hard, because parsing a natural language utterance into a logical form is difficult.*

By the way, simply asking Google "When was Mozart born?" seems to take away my argument that "searching the web to get an answer to a question is not quite the same as actually asking the question and getting a precise answer":

Google understands the question and answers precisely.

Only that it doesn't. Google added this feature to its search engine in 2012, in which it presents information boxes above the regular search results, for some queries and questions. They parse the natural language query and try to retrieve results from their huge knowledge base, known as Google knowledge graph. Well, I don't know exactly how they do it, but I guess that similarly to the previous paragraph, their main effort is in parsing and understanding the query, which can then be matched against facts in the graph.

References:

[1] Methods for Using Textual Entailment in Open-Domain Question Answering. Sanda Harabagiu and Andrew Hick. In ACL and COLING 2006.

[2] Semantic Parsing on Freebase from Question-Answer Pairs. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. In EMNLP 2013.
[3] Learning to parse database queries using inductive logic programming. John M. Zelle and Raymond J. Mooney. In AAAI 1996.

* If you're interested in more details, I recommend going over the materials from the very interesting ESSLLI 2016 course on executable semantic parsing, which was given by Jonathan Berant.

Sunday, August 28, 2016

Crowdsourcing (for NLP)

Developing new methods to solve scientific tasks is cool, but they usually require data. We researchers often find ourselves collecting data rather than trying to solve new problems. I've collected data for most of my papers, but never thought of it as an interesting blog post topic. Recently, I attended Chris Biemann's excellent crowdsourcing course at ESSLLI 2016 (the 28th European Summer School in Logic, Language and Information), and was inspired to write about the topic. This blog post will be much less technical and much more high-level than the course, as my posts usually are. Nevertheless, credit for many interesting insights on the topic goes to Chris Biemann.¹

Who needs data anyway?
So let's start from the beginning: what is this data and why do we need it? Suppose that I'm working on automatic methods to recognize the semantic relation between words, e.g. I want my model to know that cat is a type of animal, and that wheel is a part of a car.

At the very basic level, if I already developed such a method, I will want to check how well it does compared to humans. Evaluation of my method requires annotated data, i.e. a set of word pairs and their corresponding true semantic relations, annotated by humans. This will be the "test set"; the human annotations are considered as "gold/true labels". My model will try to predict the semantic relation between each word-pair (without accessing the true labels). Then, I will use some evaluation metric (e.g. precision, recall, F1 or accuracy) to see how well my model predicted the human annotations. For instance, my model would have 80% accuracy if for 80% of the word-pairs it predicted the same relation as the human annotators.

Figure 1: an example of dataset entries for recognizing the semantic relation between words.

If that was the only data I needed, I would have been lucky. You don't need that many examples to test your method. Therefore, I could select some word-pairs (randomly or using some heuristics), and annotate them myself, or bribe my colleagues with cookies (as I successfully did twice). The problem starts when you need training data, i.e., when you want your model to learn to predict something based on labelled examples. That usually requires many more examples, and annotating data is a very tiring and Sisyphean work.

What should we do, then? Outsource the annotation process -- i.e., pay with real money, not cookies!

What is crowdsourcing?

The word crowdsourcing is a blend word composed of crowd (intelligence) + (out-)sourcing [1]. The idea is to take a task that can be performed by experts (e.g. translating a document from English to Spanish), and outsource it to a large crowd of non-experts (workers) that can perform it.

The requester defines the task, and the workers work on it. The requester than decides whether to accept/reject the work and pays the workers (in case of acceptance).

The benefits of using "regular" people rather than experts are:

You pay them much less than experts - typically a few cents per question (/task). (e.g., [2] found that in translation tasks, the crowd reached the same quality as the professionals, with less than 12% of the costs).
They are more easily available via crowdsourcing platforms (see below).
By letting multiple people work on the task rather than a single/few experts, the task could be completed in a shorter time.

The obvious observation is that the quality of a worker is not as good as the expert; in crowdsourcing, it is not a single worker that replaces the expert, but the crowd. Rather than trusting a single worker, you assign each task to a certain number of workers, and combine their results. A common practice is to use the majority voting. For instance, let's say that I ask 5 workers what is the semantic relation between cat and dog, giving them several options. 3 of them say that cat and dog are mutually exclusive words (e.g. one cannot be both a cat and a dog), one of them says that they are opposites, and one says that cat is a type of dog. The majority has voted in the favor of mutually exclusive, and this is what I will consider as the correct answer.²

The main crowdsourcing platforms (out of many others) are Amazon Mechanical Turk and CrowdFlower. In this blog post I will not discuss the technical details of these platforms. If you are interested in a comparison between the two, refer to these slides from the NAACL 2015 crowdsourcing tutorial.

Figure 2: An example of a question in Amazon Mechanical Turk, from my project.

What can be crowdsourced?

Not every data we need to collect can be collected via crowdsourcing; some data may require expert annotation, e.g. if we need to annotate the syntactic trees of sentences in natural language, that's probably a bad idea to ask non-experts to do so.

The rules of thumb for crowdsourcability are:

The task is easy to explain, and you as a requester indeed explain it simply. They key idea is to keep it simple. The instructions should be short, i.e. do not expect workers to read a 50 page manual. They don't get paid enough for that. The instructions should include examples.
People can easily agree on the "correct" answer, e.g. "is there a cat in this image?" is good, "what is the meaning of life?" is really bad. Everything else is borderline :) One thing to consider is the possible number of correct answers. For instance, if the worker should reply with a sentence (e.g. "describe the following image"), they can do so in so many ways. Always aim one possible answer for a question.
Each question is relatively small.
Bonus: the task is fun. Workers will do better if they enjoy the task. If you can think of a way to gamify your task, do so!

Figure 3: Is there a cat in this image?

Some tasks are borderline and may become suitable for crowdsourcing if presented in the right way to the workers. If the task at hand seems too complicated to be crowdsourced, ask yourself: can I break it into smaller tasks that can each be crowdsourced? For example, let workers write a sentence that describes an image, and accept all answers; then let other workers validate the sentences (ask them: does this sentence really describe this image?).

Some examples for (mostly language-related) data collected with crowdsourcing
(references omitted, but are available in the course slides in the link above).

Checking whether a sentence is grammatical or not.
Alignment of dictionary definitions - for instance, if a word has multiple meanings, and hence has multiple definitions in each dictionary - the task was to align the definitions corresponding to the same meaning in different dictionaries.
Translation.
Paraphrase collection - get multiple sentences with the same meaning. These were obtained by asking multiple workers to describe the same short video.
Duolingo started as a crowdsourcing project!
And so did reCAPTCHA!

How to control for the quality of data?

OK, so we collected a lot of data. How do we even know if it's good? Can I trust my workers to do well on the task? Could they be as good as experts? And what if they just want my money and are cheating on the task just to get easy money?

There are many ways to control for the quality of workers:

The crowdsourcing platforms provide some information about the workers, such as the number of tasks they completed in the past, their approval rate (% of their tasks that were approved), location, etc. You can define your requirements from the workers based on this information.
Don't trust a single worker -- define that your task should be answered by a certain number of workers (typically 5) and aggregate their answers (e.g. by majority voting).
Create control questions - a few questions for which you know the correct answer. These questions are displayed to the worker just like any other questions. If a worker fails to answer too many control questions, the worker is either not good or trying to cheat you. Don't use this worker's answers (and don't let the worker participate in the task anymore; either by rejecting their work or by blocking them).³
Create a qualification test - a few questions for which you know the correct answer. You can require that any worker who wants to work on your task must take the test and pass it. As opposed to the control questions, the test questions don't have to be identical in format to the task itself, but should predict the worker's ability to perform the task well.
Second-pass reviewing - create another task in which workers validate previous workers' answers.
Bonus the good workers - they will want to keep working for you.
Watch out for spammers! Some workers are only after your money, and they don't take your task seriously, e.g. they will click on the same answer for all questions. There is no correlation between the number of questions workers answer and their quality, however, it is worth looking at the most productive workers: some of them may be very good (and you might want to give them bonuses), while some of them may be spammers.

Ethical issues in crowdsourcing

As a requester, you need to make sure you treat your workers properly. Always remember that workers are first of all people. When you consider how much to pay or whether to reject a worker's work, think of the following:

Many workers rely on crowdsourcing as their main income.
They have no job security.
Rejection in some cases is unfair - even if the worker was bad in the task, they still spent time working (unless you are sure that they are cheating).
New workers do lower-paid work to build up their reputation, but underpaying is not fair and not ethical.
Are you sure you explained the task well? Maybe it is your fault if all the workers performed badly?

The good news is that, from my little experience, paying well pays off for the requester too. If you pay enough (but not too much!), you get good workers that want to do the task well. When you underpay, the good workers don't want to work on your task - they can get better paying tasks. The time to complete the task will be longer. And if you are like me, the thought of underpaying your workers will keep you awake at night. So pay well :)⁴

Important take-outs for successful crowdsourcing:

Work in small batches. If you have 10,000 questions, don't publish all at once. Try some, learn from your mistakes, correct them and publish another batch. Mistakes are bound to happen, and they might cost you good money!
Use worker errors to improve instructions (remember: it might be your fault).
KEEP. IT. SIMPLE.
Use quality control mechanisms.
Don't underpay!
Always expect workers to be sloppy. Repeat guidelines and questions and don't expect workers to remember them.
If your questions are automatically generated, use random order and try to balance the number of questions with each expected answer, otherwise workers will exploit this bias (e.g. if most word-pairs are unrelated, they will mark all of them as unrelated without looking twice).
Make workers' lives easier, and they will perform better. For instance, if you have multiple questions regarding the same word, group them together.
If you find a way to make your task more fun, do so!

References
[1] Howe, Jeff. The rise of crowdsourcing. Wired magazine 14.6 (2006).
[2] Omar F. Zaidan and Chris Callison-Burch Crowdsourcing translation: professional quality from non-professionals. In ACL 2011.

1 And I would also like to mention another wonderful crowdsourcing tutorial that I attended last year at NAACL 2015, which was given by Chris Callison-Burch, Lyle Ungar, and Ellie Pavlick. Unfortunately, at that time I had no personal experience with crowdsourcing, nor believed that my university will ever have budget for that, therefore made no effort to remember the technical details; I was completely wrong. A year later I published a paper about a dataset collected with crowdsourcing, on which I even got a best paper award :) ^↩
2 For more sophisticated aggregation methods that assign weights to workers based on their quality, see MACE. ^↩

3 Blocking a worker means that they can't work on your tasks anymore. Rejecting a worker means that they are not paid for the work they have already done. As far as I know, it is not recommended to reject a worker, because then they write bad things about you in Turker Nation and nobody wants to work for you anymore. In addition, you should always give workers the benefit of the doubt; maybe you didn't explain the task well enough.^↩

4 So how much should you pay? First of all, not less than 2 cents. Second, try to estimate how long a single question takes and aim an hourly pay of around 6 USDs. For example, in this paper I paid 5 cents per question, which I've been told is the higher bound for such tasks.^↩

Monday, June 20, 2016

Linguistic Analysis of Texts

Not long ago, Google released their new parser, oddly named Parsey McParseface. For a couple of days, popular media was swamped with announcements about Google solving all AI problems with their new magical software that understands language [e.g. 1, 2].

Well, that's not quite what it does. In this post, I will explain about the different steps applied for analyzing sentence structure. These are usually used as a preprocessing step for higher-level tasks that try understanding the meaning of sentences, e.g. intelligent personal assistants like Siri or Google Now.

The following tools are traditionally used one after the other (also known as the "linguistic annotation/processing pipeline"). Generally speaking, the accuracy of available tools for the tasks in this list is in decreasing order. Some low-level tasks are considered practically solved, while others still have room for improvement.

Sentence splitting - as simple as it sounds: receives a text document/paragraph and returns its partition to sentences. While it sounds like a trivial task -- cut the text on every occurrence of a period -- it is a bit trickier than that; sentences can end with an exclamation / question mark, and periods are also used in acronyms and abbreviations in the middle of the sentence. The simple period rule will fail on this text, for example. Still, sentence splitting is practically considered a solved task, using predefined rules and some learning algorithms. See this for more details.
Tokenization - a tokenizer receives a sentence and splits it to tokens. Tokens are mostly words, but words that are short forms of negation or auxiliaries are split to two tokens, e.g. I'm => I 'm, aren't => are n't.
Stemming / Lemmatization - Words appear in natural language in many forms, for instance, verbs have different tense suffixes (-ing, -ed, -s), nouns have plurality suffixes (s), and adding suffixes to words can sometimes change their grammatical categories, as in nation (noun) => national (adjective) => nationalize (verb).
The goal of both stemmers and lemmatizers is to "normalize" words to their common base form, such as "cats" => "cat", "eating" => "eat". This is useful for many text-processing applications, e.g. if you want to count how many times the word cat appears in the text, you may also want to count the occurrences of cats.
The difference between these two tools is that stemming removes the affixes of a word, to get its stem (root), which is not necessarily a word on its own, as in driving => driv. Lemmatization, on the other hand, analyzes the word morphologically and returns its lemma. A lemma is the form in which a word appears in the dictionary (e.g. singular for nouns as in cats => cat, infinitive for verbs as in driving => drive).
Using a lemmatizer is always preferred, unless there is no accurate lemmatizer for that language, in that case a stemmer is better than nothing.

Part of speech tagging - receives a sentence, and tags each word (token) with its part of speech (POS): noun, verb, adjective, adverb, preposition, etc. For instance, the following sentence: I'm using a part of speech tagger is tagged in Stanford Parser as:
I/PRP 'm/VBP using/VBG a/DT part/NN of/IN speech/NN tagger/NN ./. Which means that I is a personal pronoun, 'm (am) is a verb, non-3rd person singular present, and if you're interested, here's the list to interpret the rest of the tags.
(POS taggers achieve around 97% accuracy).

Syntactic parsing - analyzes the syntactic structure of a sentence, outputting one of two types of parse trees: constituency-based or dependency-based.

Constituency - segments the sentence into syntactic phrases: for instance, in the sentence the brown dog ate dog food, [the brown dog] is a noun phrase, [ate dog food] is a verb phrase, and [dog food] is also a noun phrase.

An example of constituency parse tree, parsed manually by me and visualized using syntax tree generator.

Dependency - connects words in the sentence according to their relationship (subject, modifier, object, etc.). For example, in the sentence the brown dog ate dog food, the word brown is a modifier of the word dog, which is the subject of the sentence. I've mentioned dependency trees in the previous post: I used them to represent the relation that holds between two words, which is a common use.
(Parsey McParseface is a dependency parser. Best dependency parsers achieve around 94% accuracy).

An example of dependency parser output, using Stanford Core NLP.

Other tools, which are less basic, but are often used, include:

Named entity recognition (NER) - receives a text and marks certain words or multi-word expressions in the text with named entity tags, such as PERSON, LOCATION, ORGANIZATION, etc.

An example of NER from Stanford Core NLP.

Coreference resolution - receives a text and connects words that refer to the same entity (called "mentions"). This includes, but not limited to:
pronouns (he, she, I, they, etc.) - I just read Lullaby. It is a great book.
different names / abbreviations - e.g., the beginning of the text mentions Barack Obama, which is later referred to as Obama.
semantic relatedness - e.g. the beginning of the text mentions Apple which is later referred to as the company.

This is actually a tougher task than the previous ones, and accordingly, it achieves less accurate results. In particular, sometimes it is difficult to determine which entity a certain mention refers to (while it's easy for a human to tell): e.g. I told John I don't want Bob to join dinner, because I don't like him. Who does him refer to?
Another thing is that it is very sensitive to context, e.g. in one context apple can be co-referent with the company, while in another, that discusses the fruit, it is not true.
Word sense disambiguation (WSD) - receives a text and decides on the correct sense of each word in the given context. For instance, if we return to the apple example, in the sentence Apple released the new iPhone, the correct sense of apple is the company, while in I ate an apple after lunch the correct sense is the fruit. Most WSD systems use WordNet for the sense inventory.
Entity linking - and in particular, Wikification: receives a text and links entities in the text to the corresponding Wikipedia articles. For instance, in the sentence 1984 is the best book I've ever read, the word 1984 should be linked to https://en.wikipedia.org/wiki/Nineteen_Eighty-Four (rather than to the articles discussing the films / TV shows).
Entity linking can complement word sense disambiguation, since most proper names (as Apple or 1984) are not present in WordNet.
Semantic role labeling (SRL) - receives a sentence and detects the predicates and arguments in the sentence. A predicate is usually a verb, and each verb may have several arguments, such as agent / subject (the person who does the action), theme (the person or thing that undergoes the action), instrument (what was used for doing the action), etc. For instance, in the sentence John baked a cake for Mary, the predicate is bake, and the arguments are agent:John, theme:cake, and goal:Mary. This is not just the final task in my list: it is the task which is the closest to understanding the semantics of a sentence.

Here is an example for (a partial) analysis of the sentence: The brown dog ate dog food, and now he is going to sleep, using Stanford Core NLP:

Analysis of The brown dog ate dog food, and now he is going to sleep, using Stanford Core NLP.

All this effort, and we are not even yet talking about deep understanding of the sentence meaning, but rather analyzing the sentence structure, perhaps as a step toward understanding its meaning. As my previous posts show, it's hard as it is to understand a single word's meaning. In one of my next posts I will describe methods that deal with the semantics of a sentence.

By the way, if you are potential users of these tools, and you are looking for a parser, Google's parser is not the only one available. BIST is more accurate and faster than Parsey McParseface, and spaCy is slightly less accurate, but much faster than both.

Wednesday, May 25, 2016

Improving Hypernymy Detection

From time to time I actually make some progress with my own research, that I think might be interesting or beneficial for others. Now is such a time.* so let me share that with you.

If you've read my blog post about lexical inference, then you should already be familiar with my research goals. I'll explain it shortly: I'm working on automated methods to recognize that a certain term's meaning (word or multi-word expression) can be inferred from another's.

There are several interesting lexical-semantic relations, such as synonymy/equivalence (intelligent/smart, elevator/lift), hypernymy/subclass (parrot/bird, stare/look), meronymy/part-of (spoon/cutlery, London/England), antonymy/opposite (short/tall, boy/girl), and causality (flu/fever, rain/flood).

These relations are interesting, because whenever we encounter one of the terms in a given sentence, we can use our knowledge to infer new facts. For instance, the sentence "I live in London" (wishful thinking...), could be used to infer "I live in England", knowing that London is a part of England. Of course we also need to know something about the sentence itself, because saying that "I left London" doesn't necessarily entail that "I left England". I might have just taken the train to Reading for the Festival :) But this is another line of research which I haven't explored deeply yet, so we'll leave that for another post.

In this particular work, we've focused on the common hypernymy relation between nouns (and noun phrases). We developed a method that given a pair of nouns (x, y) (e.g. (cat, animal), (abbey road, record), (apple, animal)) predicts whether y is a hypernym of x - or in other words, whether x is a subclass of y (e.g. cats are a subclass of animals) or an instance of y (e.g. abbey road is an instance of record). I'll try to keep things simple. If you're interested in more details or the references to other papers, please refer to the paper.

There are two main approaches in the literature of hypernymy detection: path-based and distributional. Path-based methods are very elegant (a matter of taste, I guess). They assume that if y is a hypernym of x, then this relation must be expressed frequently enough when looking at a large text corpus. A pair of words that tends to be connected through patterns such as "X and other Y", "Y such as X", "X is a Y" is likely to hold the hypernymy relation (e.g. cats and other animals, fruit such as apples). To overcome some adjectives and relative clauses that stand in the way of the important information (as in Abbey Road is [the 11th studio] album), a dependency parser is used, outputting the syntactic relation between words in the sentence (e.g. Abbey Road is connected to is and is is connected to album). See the figure below for an example of such a path.

An example of a dependency path between parrot and bird

These paths serve as features for classification. These methods use training data - a list of (x, y) pairs with their corresponding label (e.g. (cat, animal, True), (apple, animal, False)). Each (x, y) pair is represented by all the dependency paths that connected them in the corpus: the feature vector holds an entry for each dependency path in the corpus, and the value is the frequency in which this path connected x and y (e.g. for (cat, animal) how many times "cat is an animal" occur in the corpus, how many times "animals such as cats" occur, etc.). A classifier is trained over these vectors to predict whether y is a hypernym of x.

Though this method works nicely, it suffers from one major limitation: it cannot generalize. If x1 and y1 are mainly connected by the "X is considered as Y" pattern, and x2 and y2 are connected via "X is regarded as Y", they practically share no information. These are considered two different paths. Attempts to generalize such paths by replacing words along the paths with wild-cards (e.g. "X is * as Y") or part-of-speech tags (e.g. "X is VERB as Y") may end up in paths too-general (e.g. "X is denied as Y", which also generalizes to "X is VERB as Y", is a negative path).

In contrast, the distributional approach considers the separate occurrences of x and y in the corpus. It relies on the distributional hypothesis, that I've already mentioned here and here. The main outcome of this hypothesis is that words can be represented using vectors, with similar words (in meaning) sharing similar vectors. In recent years, people have been using these vectors in supervised hypernymy detection. To represent each (x, y) pair as a vector, they somehow combined x and y's vectors (e.g. by concatenating them). They trained a classifier on top of these vectors and predicted for new pairs (e.g. (apple, fruit)) whether y is a hypernym of x. These methods have shown good performance, but it was later found that they tend to overfit to the training data, and are pretty bad in generalizing; for example, if you are trying to predict hypernymy for a new pair (x, y) with rare words x and y that weren't observed in the training data, the prediction will be (only slightly better than) a guess.

To sum up recent work - path-based methods can leverage information about the relation between a pair of words, but they do not generalize well. On the other hand, distributional methods might not recognize the relation between a pair of words, but they contain useful information about each of the words. Since these two approaches are complementary, we combined them!

We started by improving path representation, using a recurrent neural network. I can't possibly explain the technical details without writing a long background post about neural networks first, so I'll skip most of the technical details. I'll just say that this is a machine learning model that processes sequences (e.g. sequence of words, letters, edges, etc), that can, among other things, output a vector representing the entire sequence. In our case, we split the dependency path to edges, and let the network learn a vector representation of the entire path, by going over the edges sequentially. Then, we replace the traditional path features that represent a term-pair, e.g. "X is defined as Y", with the vector representing the path - the path embbeding.

The nice thing about these path embbedings is that -- can you guess? similar paths have similar path embeddings. This happens thanks to two things. First, the network can learn that certain edges are important for detecting hypernymy, while others are not, which may lead to consolidating paths that differ by certain unimportant edges.

Moreover, since neural networks can only work on vectors, the entire information we use is encoded as vectors. For instance, the words along the path (e.g. is, defined, as) are all encoded as vectors. We use word embeddings, so similar words have similar vectors. This results in similar vectors for paths that differ by a a pair of similar words, e.g. "X is defined as Y" and "X is described as Y".

Similar paths having similar vectors is helpful for the classifier. In the paper, we show that our method performed better than the prior methods. Just to give an intuition, let's say for instance that the classifier learned that the path "X company is a Y", which was pretty common in the corpus, indicates hypernymy. And let's say that "X ltd is a Y" only occurred once for a positive (x, y) pair. The previous methods would probably decide that such a path is not indicative of hypernymy, since they don't have enough evidence about it. However, our method recognizes that ltd and company are similar words, yielding similar path vectors for these two paths. If "X company is a Y" is considered indicative, then so does "X ltd is a Y".

At last, we combined the complementary path-based and distributional approaches. To add distributional information to our model (the information on the separate occurrences of each term x and y), we simply added the word embedding vectors of x and y to the model, allowing it to rely on this information as well. With this simple change we achieve significant improvement in performance compared to prior methods in each approach. For so many years people have been saying these two approaches are complementary, and turns out it was actually not too difficult to combine them :)

Paper details
Improving Hypernymy Detection with an Integrated Path-based and Distributional Method. Vered Shwartz, Yoav Goldberg and Ido Dagan. ACL 2016. link

* Now = a few months ago, but the paper was under review and I couldn't publish anything about it.

Tuesday, March 8, 2016

Text Classification

Given a piece of text (document), can software recognize the topic(s) it discusses? If you're not convinced that such a thing could be helpful, let's just start with two facts:

90% of the data in the world today has been created in the last two years [1].
Our attention span is now less than that of a goldfish [2], and we almost never read through an article [3].

These two together lead to sooo much data that might be of interest to you, that you'll never read. If only there were someone who could read articles for you and decide whether they intersect with your topics of interest. Well, automatic text classification can assist you with that.

Representation

As in the last post about word representation, we must first decide how to represent the documents. Intuitively, we want the algorithm to classify a document to a certain topic based on the document's content, as in figure 1. We need the document representation to reflect that.

Figure 1: Two example documents, one (Doc1) about computer science and the other (Doc2) about news.

The simplest and most common approach is the "bag-of-words" approach, in which each document is represented by all its words. Some words may be indicative of one topic or another, e.g. soccer, player, and match might indicate that a document is about sports, while government, prime minister, and war suggest that this is a news article. If you remember the spam filtering example from the supervised learning post, this is exactly the same: some words in the mail message may indicate that it is spam. Spam filtering can be regarded as a text classification task in which each document is a mail message, classified to one of the two topics spam / not spam.

The bag-of-words approach is simple, yet may yield nice results. However, it ignores multi-word expressions (document classification) some of which are non-compositional, i.e. the phrase's meaning is different from the separate word meanings (rock and roll). It also ignores word order, and syntax. Some other methods can consider these features as well. In this post we will stick to the simple bag-of-words approach, which is often good enough for this task.

Methods
Choosing the method in which documents will be classified to topics depends, first of all, on the available data. If we have a sample of labeled documents (documents with known topics), we will prefer supervised classification. In supervised learning, the algorithm is given a training set of labeled instances (e.g. document and its topic), and it tries to learn a model that predicts those labels correctly. This model is later used to predict the label (topic) of unseen instances (documents).

Otherwise, if we only have a bunch of documents without their topics (which is the more common case), we will apply unsupervised classification (clustering). Instead of trying to attach a label (topic) to each instance (document), the algorithm groups together similar documents, which seem to be about the same topic. The output of the clustering process is clusters of documents, each cluster represents a topic.

Supervised Document Classification
Different methods to classify documents differ in the instance representation and the learning algorithms. The general scheme is to represent each document as a feature vector, use a multi-class classifier and feed it the training instances and labels (documents and their topics) to learn a model. Then, given a new unlabeled document, the classifier can predict the document's topic, with some level of success.

Following the bag-of-words approach, we may represent each document as a |V|-dimensional vector, where V is our vocabulary. Each cell represents a word which may or may not appear in the document. There are a few variants of the cells content, here are some common ones:

Binary - 1 if the word occurred in the document, 0 otherwise. It is a set representation of the document's words.
Frequency - the number of times that a word occurred in the document. We can expect that topic prominent words would appear frequently in the topic documents, while other words would occur occasionally.
TF-IDF - I might be skipping a few simpler metrics, but this one is very useful. When we count the frequency of each word, we might end up with some words that are frequent in all documents, regardless of the topic. For instance, stop words and function words (the, it, and, what...) are never indicative of a certain topic. The TF-IDF metric handles this problem by measuring the importance of a term in a certain document, given all the other documents. The TF (Term-Frequency) measure is proportional to the word frequency in the document. The IDF (Inverse-Document-Frequency) decreases if the word is generally frequent in all documents. This way, a word gets a high score if it is relatively non-common in general, but common in the specific document.

Figure 2: The corresponding (partial) bag-of-words vectors for the example documents, with the binary (top) and frequency (bottom) variants.

It is no coincidence this post is following the one about word representations: speaking in word representation terminology, we can now see that the frequency document vector is the sum of one-hot vectors of all the words in the document. Can you see it?
Correspondingly, when working with word embeddings (continuous dense vectors), there is a variant of bag-of-words called CBOW (continuous bag-of-words): it is the sum/average of word vectors for each word in the document. This results in a D-dimensional vector that represents the entire document, where D is the word vectors dimension.

Once we represented each document as a feature vector, we let the classifier train over the feature vectors and corresponding labels. It may notice that feature vectors with high values in the cells of soccer, player, and match tend to occur with the label sports, for example.

There is a broad choice of algorithms, some may perform better than others. Roughly, they all try to do the same thing: the multi-class classifier learns a weight for each feature and each label. In our text classification task, it learns the salience of each word in the vocabulary for each topic. For example, soccer, player, and match will be assigned high weights and government and prime minister will be assigned low (maybe negative) weights for the sports label, and the opposite will occur for the news label. When the classifier is trained, the objective is to learn the weights that maximize the accuracy of the training set, i.e., classifying as many documents as possible to the correct topic.

For you coders, here is the simplest proof of concept. Other people may skip the code. This is a Python script, that works with scikit-learn, a machine learning package for Python. I used a subset of the topics in the 20 newsgroup dataset. I trained a simple logistic regression classifier, removing stop words and punctuation, and representing each document using CBOW with GloVe word embeddings (my personal favorite). This yields 83% F1 score, which has much potential of improvement, but it's still nice with such little effort.

document_classification.py

document_classification.py


1    import sys 
2    import nltk 
3    import string 
4    import codecs 
5     
6    from sklearn.datasets import fetch_20newsgroups 
7    from sklearn.metrics import precision_recall_fscore_support 
8    from sklearn.linear_model import LogisticRegression 
9     
10   import numpy as np 
11    
12    
13   def main(): 
14    
15       # Load the word vectors 
16       words, wv = load_embeddings('glove.6B.50d.txt') 
17       word_to_num = { word : i for i, word in enumerate(words) } 
18    
19       # Load the stop words 
20       with codecs.open('English_stop_words.txt', 'r', 'utf-8') as f_in: 
21           stop_words = set([line.strip() for line in f_in]) 
22    
23       # Load the datasets 
24       topics = ['talk.politics.guns', 'soc.religion.christian', 
25                 'comp.windows.x', 'rec.sport.baseball', 'sci.med'] 
26       newsgroups_train = fetch_20newsgroups(subset='train', 
27                                             remove=('headers', 
28                                                     'footers', 
29                                                     'quotes'), 
30                                             categories=topics) 
31       y_train = list(newsgroups_train.target) 
32       newsgroups_test = fetch_20newsgroups(subset='test', 
33                                            remove=('headers', 
34                                                    'footers', 
35                                                    'quotes'), 
36                                            categories=topics) 
37       y_test = list(newsgroups_test.target) 
38    
39       # Create the feature vectors 
40       X_train = create_doc_vectors(newsgroups_train.data, 
41                                    word_to_num, wv, stop_words) 
42       X_test = create_doc_vectors(newsgroups_test.data, 
43                                   word_to_num, wv, stop_words) 
44    
45       # Create the classifier 
46       classifier = LogisticRegression() 
47    
48       # Train the classifier 
49       classifier.fit(X_train, y_train) 
50    
51       # Predict the topics of the test set and compute 
52       # the evaluation metrics 
53       y_pred = classifier.predict(X_test) 
54       precision, recall, f1, support = \ 
55           precision_recall_fscore_support(y_test, y_pred, 
56                                           average='weighted') 
57    
58       print 'Precision: %.02f%%, Recall: %.02f%%, F1: %.02f%%' \ 
59             % (precision * 100, recall * 100, f1 * 100) 
60    
61    
62   def create_doc_vectors(data, word_to_num, wv, stop_words): 
63       """ 
64       Create a matrix in which each row is a document, 
65       and each document is represented as CBOW 
66       """ 
67       doc_vecs = [] 
68    
69       for doc in data: 
70           tokens = nltk.word_tokenize(doc.lower()) 
71           tokens = [w for w in tokens 
72                     if w not in string.punctuation 
73                     and w not in stop_words] 
74           tokens = [word_to_num.get(w, -1) for w in tokens] 
75           doc_vector = [wv[w] for w in tokens if w > -1] 
76           doc_vector = np.mean(np.vstack(doc_vector), axis=0) \ 
77               if len(doc_vector) > 0 else np.zeros((1, 50)) 
78           doc_vecs.append(doc_vector) 
79    
80       instances = np.vstack(doc_vecs) 
81       return instances 
82    
83    
84   def load_embeddings(embedding_file): 
85       """ 
86       Load the pre-trained embeddings from a file 
87       """ 
88       with codecs.open(embedding_file, 'r', 'utf-8') as f_in: 
89           words, vectors = \ 
90               zip(*[line.strip().split(' ', 1) for line in f_in]) 
91       vectors = np.loadtxt(vectors) 
92    
93       # Normalize each row (word vector) in the matrix to sum-up to 1 
94       row_norm = np.sum(np.abs(vectors)**2, axis=-1)**(1./2) 
95       vectors /= row_norm[:, np.newaxis] 
96    
97       return words, vectors 
98    
99    
100  if __name__ == '__main__': 
101      main() 
102

As an example, I printed one of the documents and the topic that the classifier assigned to it:

 I really think that this is the key point. When I saw the incident on Baseball Tonight Sunday, I couldn't believe how far away from the plate Gant went. Then he casually leaned against his bat.  I don't blame the umpire at all for telling the pitcher to pitch. The worst part of the whole incident was the Braves coming out onto the field.  What were they going to do, attack the umpire? The only people who should've been out there were Cox and maybe the coaches, but NO players. I agree with the person who posted before that Cox should be suspended for having no control over his team.

rec.sport.baseball

I'd say it's an easy one, since it specifically mentions "baseball". However, nothing should be taken for granted in NLP! Anything that works is a miracle :)

Unsupervised Document Classification
In some cases, we don't have labeled data, so we can't learn characteristics of instances with specific labels, e.g., words that tend to occur in documents about politics. Instead, we can find common characteristics between documents about the same (unknown) topic, and group documents from the (seemingly) same topic together in one cluster.

Then, when a new document is presented, we can assign it to the most suitable cluster, based, for example, on how similar it is to the documents in each cluster. This has several purposes; first of all, we can let someone look at a few documents from each cluster and infer the topic represented by the cluster. We can also take the most common words in this cluster, automatically, and use them as "tags" that describe the topic. Another usage can be recommending a document to someone who read other documents in this cluster (assuming that they are interested in the cluster's topic).

While we don't have the true labels (topics) of the training instances (documents), we assume that each document has a certain topic, which is a hidden variable in our model (as opposed to the documents themselves, which are observed). One clustering approach is through generative models: we assume that a model generated our existing documents, and we use an algorithm that tries to reconstruct the parameters of the model, in a way the best explains our observed data.

The assumption of the generative model is that our training data was generated through probabilities, that we would like to reconstruct. The simpler model, called mixture of histograms, assumes that each document has one topic. A document was generated as follows:

The document's topic c was drawn from the topics' distribution (probability function) P(C).
For example, if we have 3 topics, news, sports and music, with the probabilities 0.5, 0.3, 0.2 respectively, then in half of the cases we are expected to "generate" a document about news.
Given the topic c, the words in the document were sampled from a distribution of words given the topic, P(w|c). For example, if the topic is news, the probability of each word w in the vocabulary in the news topic is P(w|news). Since there are many words in the vocabulary, the probability for each of them is quite small (because the probability of all words sums up to 1). Anyhow, words that are likely to appear in news documents, such as war and report will have higher probabilities, e.g. P(war|news) > P(football|news). When we sample words for the generated news document, we will get mostly words discussing news.

Figure 3: An illustration of the probability distribution of word given a topic.

The goal of the algorithm is to learn the probability functions P(c), and P(w|c) for each topic, given solely the documents themselves. These probabilities are estimated using an iterative algorithm called EM (expectation maximization). Since the topic of each document is unknown (hidden), you should first decide on the number of topics (how fine-grained would you like the clustering to be? should you distinguish between football and basketball or is sports enough?). Then, start with a random guess of the probabilities. The algorithm works in iterations, improving the probabilities at each iteration. Each iteration is made of two steps:

Assign a topic to each document based on the current probabilities.
Given the documents' topic computed at the previous step, re-estimate the probabilities using relative frequency (e.g., the probability of news is the ratio between number of the documents assigned this topic, and the total number of documents).

This algorithm works nicely and it can be used to solve other problems with hidden variables. In the end, each document is assigned to a certain (meaningless) topic. As I mentioned before, the meaning of this cluster of documents can be understood by looking at the common features of several documents in the cluster (e.g. do they all discuss music?).

That's it about text classification for now. Now, can you code something that automatically infers the topic of this blog post? :)

There is so much more that I didn't cover in this post, because I don't want to exhaust anyone. If you're interested in reading more, I recommend:

The difference between generative and discriminative models - it's a general machine learning topic, not specific to document classification. In this post I gave an example of a supervised discriminative model and an unsupervised generative model, but:

The k-means algorithm is an unsupervised discriminative model that can be used for text classification.
Naïve Bayes is a supervised generative classifier that can be used for text classification.

Document classification can also be used for language identification.
And you can't write a post about text classification without mentioning LDA. It was a bit too complex for this post, but here, I mentioned it.