Probably Approximately a Scientific Blog

Interpretation of Time Expressions in Different Cultures

2021-10-05T03:52:00.003+03:00

This is a section from a nonfiction book I'm writing (...I guess this public announcement will now pressure me to finish it and find a publisher 😬). Thanks Anna Pryslopska for initiating the interesting Twitter discussion!

It’s common knowledge that Americans use the unique date format mm-dd-yyyy, in which the month appears before the day, differently from the rest of the world. There is no clear answer to why that is, but some hypothesize that this format was used in the UK, brought to the US by the Brits, and later changed in the UK to the European format dd-mm-yyyy. This is certainly a cause for confusion, which is almost inevitable when a date signifies July 4th for one person and April 7th for the other. With that said, it may also be useful in some very specific scenarios. A young colleague who travelled to a conference in the US a few weeks before his 21st birthday on July 4th, had successfully purchased alcohol by misleading the bartender into thinking his birthday, which was printed on a non-US passport, was April 7th.

I was more surprised to learn that Americans use a 12 hour clock (with AM and PM distinctions as needed) rather than the 24 hour clock. I think part of the reason it was less noticeable, is that both clocks are acceptable outside the US. It wasn’t until I texted an American friend, during a conference, that I will meet him at 18:00 next to the escalator, which made him chuckle and inform me I was using “military time”.

It was the same friend that only a couple of years earlier I was supposed to go sightseeing with on the day after the conference ended, and in the morning he texted me that we can meet in the afternoon. I was surprised, because my interpretation of “afternoon”, based on the norm associated with its literal translation to Hebrew, means around 4 or 5 pm, which seemed quite a late hour to begin sightseeing. He, of course, literally meant any time after 12 pm, which is a reasonable time to leave your hotel room after a week of exhausting conferencing.

Indeed, people interpret time expressions with some variation from each other. A 2002 study by the University of Aberdeen analyzed human-written weather forecasts along with the weather data they described. The study found significant individual differences between forecasters in the interpretation of some time phrases such as “by evening”, but full agreement on other expressions such as “midday”.

I couldn’t find any study regarding the cultural differences in interpretation of time expressions, so I conducted my own. I built a very simple survey with the following questions:

Where are you from, or where have you lived most of your life?
What is the range of time you consider as morning (or the equivalent of morning in your native language)?
What is the range of time you consider as noon (or the equivalent of noon in your native language)?
What is the range of time you consider as afternoon (or the equivalent of afternoon in your native language)?
What is the range of time you consider as evening (or the equivalent of evening in your native language)?
What is the range of time you consider as night (or the equivalent of night in your native language)?

I published the survey on Amazon Mechanical Turk, a crowdsourcing platform that enables recruiting workers to perform discrete tasks. To get answers from a range of countries, I published several batches of questionnaires, each time limiting them to workers from specific regions of the world.

Before I dive into the results, I would like to point out that this study was not conducted with my usual level of scientific rigour, mostly for budgetary reasons. In less subtle words, I did not conduct this experiment for my work, thus couldn’t pay for it with my research budget and paid with my own money, so I went cheap. Because my budget was limited, I collected only 349 answers, which means that for most countries I collected only a handful of answers or no answers at all, making conclusions about those countries less statistically well-supported. Moreover, some countries have many more Mechanical Turk workers than others, so I ended up collecting a very uneven number of responses from each country.

In addition, I live in the Pacific time zone (PST), and the time I published the batches affected the distribution of countries of workers who responded to the survey. For example, I thoughtlessly published the North American survey at 4 pm PST on a Friday, which likely meant most of the answers came from people living in the west coast. People living in countries when it was night time when the survey was made available were either not well represented or worse - distorted the data. Think of a person answering a survey at 2 am their time, do you really trust them as representative of their culture with respect to time? Go to sleep, dude.

So, if you would still like to discover the results of my very unscientific study, here they are. This was the country distribution:

United States of America	103
India	83
Brazil	48
Italy	37
United Kingdom	19
Spain	8
France	7
Philippines	4
Canada	3
Australia	3
American Samoa	2
Israel	2
Macedonia	2
Ireland	2
Germany	2
Greece	2
Romania	2
Taiwan	1
Saudi Arabia	1
Thailand	1
Hong Kong	1
Austria	1
Andorra	1
Barbados	1
Azerbaijan	1
Equatorial Guinea	1
Anguilla	1
Ethiopia	1
Netherlands	1
Malta	1
Poland	1
Sri Lanka	1
Belgium	1
Lithuania	1
Sweden	1
Pakistan	1
Singapore	1

Since the US is dominant in the survey, let’s first analyze the results received among participants in the US. The following figure presents the average start and end times for each time expression, along with error bars to mark the standard deviation, that measures the dispersion of the data relative to the average.

Americans considered morning on average to span from 4:45 to 11:27 am, noon from 11:47 am to 12:41 pm, afternoon from 1:06 pm to 4:27 pm, evening from 4:27 to 7:04 pm, and night from 7:19 pm to 9:30 am. If you’re wondering about the contradiction of the early morning start time (4:45 am) and the late night end time (9:30 am), the error bars can explain this discrepancy. The night end time data had the largest standard deviation, with many outliers such as people who considered the night to end at 11:59 pm for some reason. A more informative statistic that is less sensitive to outliers is the median. The median is the time that is the same or later than what half of the people would consider as the end of the night, and the same or earlier than what half of the people would consider as the end of the night. It was much earlier, at 5:45 am.

At this point, I’ve already empirically shown that Americans indeed consider “noon” as a very narrow time slot around 12 pm, although a small number of them were extremely early risers for whom 10 am already feels like noon, and some considered noon to end as late as 2 pm. Another observation that stands out for me here is the early evening beginning. It explains the early US dinner. If the evening starts at 4:30 pm, “The Cadillac” episode in season 7 of Seinfeld seems slightly less crazy. In this episode, Jerry visits his retired parents in Florida. They are getting ready to go to dinner at 4:30, to make it to the early-bird rate. Jerry says he can’t “force feed himself a steak at 4:30” and convinces them to wait for the regular priced dinner at 6.

Even if you treat the retiree population in Florida as an outlier, Dinner in the US is eaten rather early, around 6pm. I’ve had work dinners at 5:30pm as well. I’ve heard about restaurants that are so busy that you must book a table for dinner… unless you are willing to eat as late as 8 pm. Needless to say, 8 pm seems like a perfectly good time for dinner to me. I’ve often used “dinner time” as an example for temporal commonsense, e.g. “dinner is typically eaten at around 8 pm”. But giving it a second thought, I realize this is rather culture-specific. On trips to some countries in Europe we wandered around hungry at 9 pm, not finding where to eat because all the restaurants were already closed. In other countries it’s customary to eat very late, such as in Spain.

What makes the dinner time convention more confusing is that the meaning of the word dinner is not exactly “the evening meal”. Today, people typically use “dinner” and “supper” interchangeably to refer to the last meal of the day. However, Merriam-Webster classifies supper as a lighter meal, or “the evening meal especially when dinner is taken at midday”, while dinner is "the principal meal of the day" regardless of its time.

In 2019, my birthday happened to be on Thanksgiving. We tried to book a table in a restaurant for dinner. The options were limited because many restaurants were closed for the holiday and others only served Thanksgiving dinner. I don’t eat chicken, nor am I a fan of holiday food (blatantly generalizing from my experience with Jewish holidays). By the time we found a restaurant that serves its usual menu, they had no available tables for dinner. Right after hanging up the phone I had second thoughts about the way I phrased the question. I called again and asked whether they had available tables at 8 pm. They did. We had a great meal. It was only intuition that made me recheck, but when I dug deep into this, I learned the difference between dinner and supper, and I found out that Thanksgiving dinner is often eaten at around 2 to 4 pm, hours that I would consider lunch time. This ACL 2019 paper, in which textual mentions and their corresponding grounded values were automatically extracted from a large English text corpus, also supports this observation. In a figure showing the time of the day in which meals are typically eaten, dinner seemed to start according to some people as early as 1 pm.

Before all this talk about dinner makes me hungry, I will get back to the survey results. So how is the US different from the rest of the world? We don’t have enough data for a fine-grained analysis country-by-country, but we can group countries by continent, for example looking at all answers from Europe.

In Europe, the average morning was between 5:19 and 11:08 am, noon between 11:47 am and 1:30 pm, afternoon between 1:31 and 5:28 pm, evening between 5:51 and 6:50 pm, and night from 5:32 pm to 6:26 am. I was quite surprised by how early people considered the night to start, and in particular the intersection between the evening and night. I’ve heard people saying “good night” at 5 pm in the US, but expected Europe to party harder. Luckily, I allowed the survey respondents to add a free-text comment, and thankfully, many of them did. Two Spanish workers commented that in Spanish, there is no distinction between afternoon and evening, and that Spanish doesn't really have a word for evening. The word “tarde” (afternoon) is used to describe the range of hours from 1 pm to 8 pm, after which it is “noche” (night).

There are two other countries with enough responses for a meaningful statistical analysis: India and Brazil. Here is the same figure, for India:

In India, morning starts at 4:53 and ends at 10:05 am, noon starts at 11:14 am and ends at 11:55 pm, afternoon is between 12 and 1:37 pm, evening between 2:21 and 4:56 pm, and night from 5:45 pm to 8:54 am. Largely, all time expressions referred to earlier times than in the US, with the night spanning over 15 of the 24 hours.

In Brazil, morning is between 5:21 and 11:29 am, noon is between 11:20 am and 12:16, afternoon from 12:39 to 5:20 pm, evening from 5:28 to 5:49 pm, and night from 2:50 pm to 6:20 am. Again, the evening was swallowed by the night, and again, the comments explain it. First, many commented that there is no concept of evening in Brazil. One person elaborated and said that it gets dark early, and once it’s dark, it is already considered night. In addition, some people mentioned that there is no concept of “noon” either.

My own interpretation of these time expressions was as follows: morning at 6 am to 12 pm, noon from 12 to 3 pm, afternoon from 3 to 6 pm, evening from 6 to 10 pm, and night from 10 pm to 6 am. I had almost perfect agreement with my husband, except that he considered morning to start at 4 am. Interestingly, in Hebrew I would use “morning” to describe 4 am, i.e. “4 in the morning”, but because I don’t consider it a reasonable waking time, I made it part of the night. Indeed, this is “early morning”, a time expression I didn’t think of when I designed the survey. Many workers commented that they divide the time from dark to dawn into two or more different segments. Two workers from the Philippines indicated that the length of day and the length of night are equal, and that midnight marks the beginning of the new day, hence the morning. A worker from India commented that in their native language, there is a word for “early morning” used for the time range between 4 am to 6 am, though another Indian worker, possibly speaking a different native language, referred to this time as 12 am to 5 am. A third worker from India referred to 12 am to 4 am as “midnight”. That was surprising to me because I consider midnight as the exact time 12 am, although I realize I’m inconsistent with my interpretation of noon. Maybe it was clearer if it was more common to call it “midday” instead of “noon”.

Apart from the answers from Europe, which were diverse in terms of countries, the other regions were mostly dominated by a single country. The answers from North America were dominated by the US (93.6%), Asia and Pacific was dominated by India (85.6%), South America by Brazil (100%), and Africa and the Middle East only had 5 responses. It would also be interesting to study how the interpretation of time differs between states in the US, and in different times of the day, days of the week, and seasons. Do people tend to greet “good night” earlier in the day during the winter, when it gets dark early in the northern hemisphere, or is it always equivalent to “goodbye” after a certain hour, when “have a good day” doesn’t make much sense anymore? To solve the confusion, Americans often use the generic “have a good one” greeting, allowing the recipient to decide what “one” means in their own schedule.

Commonsense Reasoning for Natural Language Processing

2021-01-12T20:06:00.009+02:00

This long-overdue blog post is based on the Commonsense Tutorial taught by Maarten Sap, Antoine Bosselut, Yejin Choi, Dan Roth, and myself at ACL 2020. Credit for much of the content goes to the co-instructors, but any errors are mine.

In the last 5 years, popular media has made it seem that AI is nearly---if not already---solved by deep learning, with reports on super-human performance on speech recognition, image captioning, and object recognition. The release of Google Translate’s neural models in 2016 reported large performance improvements: “60% reduction in translation errors on several popular language pairs”. But looking under the hood, these numbers seem to be misleading. Neural models find shortcuts to the correct answers through dataset-specific input-output correlations, essentially solving the dataset but not the underlying task. When models are challenged with adversarial out-of-domain examples, they perform poorly. Small unnoticeable noise added to images confuses object recognition models and changes their predictions. Visual question answering models guess the answer based on the frequency of answers for the same type of question in the training set, e.g. replying "2" to any "how many" question. Image captioning models often learn to recognize objects based solely on their typical environment, and fail to recognize them outside their typical environment. In NLP, dialogue systems generate highly generic responses such as “I don’t know” even for simple questions. Open-ended generation is prone to repetition. Question answering systems are easily distracted by the addition of an unrelated sentence to the passage. And more.

Figure 1: adversarial examples in computer vision (left) and natural language processing tasks (right).

Machine learning models today perform reasonably well on perception tasks (image and speech recognition). However, they mostly lack the ability to perform simple intuitive commonsense inferences that humans do in every minute of their waking hours, regarding pre- and post-conditions of events, understanding other people's motivations and intents, mental and emotional states, etc.

What is commonsense?
Is commonsense knowledge already captured by pre-trained language models?
How to create benchmarks to measure commonsense reasoning capabilities?
How to gather and represent machine readable commonsense knowledge?
How to enhance neural models for commonsense reasoning tasks with symbolic knowledge?

Static integration
Dynamic integration

Summary

What is commonsense?

The boundaries of commonsense are quite challenging to define, but we will go with this working definition:

Commonsense is the basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among most people.

For example, it's commonsense that it's OK to keep the closet door open, but not the fridge door, as the food inside might go bad.

Types of commonsense:

Commonsense knowledge can be categorized according to types, including but not limited to:

Social commonsense: people are capable of making inferences about other people's mental states, e.g. what motivates them, what they are likely to do next, etc. This kind of inferences is captured by the ATOMIC knowledge base, discussed later. In addition, we each have a set of social norms of accepted behavior, e.g. knowing that “it's impolite to comment on someone's weight”. While these are often implicit in our actions and decisions, machines need to be taught them explicitly.
Temporal commonsense: natural language rarely communicates explicit temporal information. Instead it's vague and relies on the commonsense knowledge of the listener. For example, when told that "Dr. Porter is taking a vacation" we can predict that Dr. Porter will not be able to see us soon, as opposed to when "Dr. Porter is taking a walk". This requires knowing the typical duration of "taking a walk" (minutes) and that of "taking a vacation" (days). Other temporal knowledge is typical times, order, frequency, etc. of events which are addressed by the MC-TACO dataset and the TACO-LM time-aware contextual language model.
Physical commonsense: a glass will likely shatter if it falls to the floor, is a fact most people (and arguably cats) know. Physical commonsense includes knowledge about the physical properties and affordances of everyday objects, as tested in the PIQA dataset.

Commonsense is essential for humans to navigate everyday situations seamlessly and interact with each other in a reasonable and safe way, and for AI to understand human needs and actions better. Yet, endowing machines with such human-like commonsense reasoning capabilities has remained an elusive goal of AI research for decades. Past attempts, in the 1960s and 1970s, resulted in an AI winter, i.e. reduced interest and funding for AI research due to failed over-hyped research directions. In recent years, new interest in machine commonsense has emerged, with the availability of stronger computing power and huge amounts of data. With that said, the path to machine commonsense is unlikely to be brute force training larger neural networks with deeper layers.

Is commonsense knowledge already captured by pre-trained language models?

In the last 3 years, language models have been ubiquitous in NLP. Language models are pre-trained once, in a self-supervised manner that requires only a large text corpus. Traditionally, language models are trained to predict the next word in a sentence (top part of Figure 2, in blue), but they can also predict hidden (masked) words in the middle of the sentence, as in Google's BERT model (top part of Figure 2, in orange). This pre-training phase yields a function that gets a sequence of words (sentence, short paragraph) and returns a vector for each word in the sequence.

Figure 2: Language models pre-training and fine-tuning.

As opposed to word embeddings which are static, language model-based word vectors are dynamic and re-computed for each context. At the very basic level, they assign different vectors to words when they are used in different senses, as in Figure 3.

Figure 3: Static vs. dynamic word representations.

Do off-the-shelf pre-trained language models already capture commonsense knowledge?

✅ They are capable to some extent, of filling incomplete commonsense facts or ranking candidate facts. For example, the language model score (≈ statement plausibility) of a fact like "a musician plays a musical instrument" is higher than "a dancer plays a musical instrument". This is a proof that, in addition to lexical and syntactic knowledge, language models capture general knowledge about the world.

✅ They can, to some extent, associate concepts with their properties. They distinguish concepts  associated with a given set of properties, i.e. complete a statement such as "A has fur, is big, and has claws, has teeth, is an animal, ..." with bear (just like playing the "20 question game"). They perform better when they are shown encyclopedic properties (e.g. is an animal) as opposed to perceptual properties (e.g. smooth). They can also, pretty successfully, list the properties  associated with given concepts, e.g. complete the sentence "Everyone knows that a bear has " with fur, claws, teeth, etc.

However, knowledge generated from language models is noisy!

🚫 Several papers have shown that language models are not sensitive to negation, i.e. they consider the negated version of facts ("birds can't fly") as similarly plausible.

🚫 They are sensitive to phrasing:

🚫 In distributional word vectors, the vector representing a (sub-)word is learned from the contexts in which it appeared, leading to similar representation for semantically-similar words. In language models, the representation of similar contexts is similar, so the model learns which type of word should appear next (or instead of a masked token). This is generally a positive thing, but it sometimes over-generalizes, leading to examples such as this:

Figure 4: BERT guesses that the masked token should be a color, but fails to predict the correct color. Using the AllenNLP demo.

Here, BERT has seen in its training corpus enough sentences of the type "The color of something is [color]" to know to suggest different colors as substitutes for the masked word. Unfortunately, not every color is suitable in every context that calls for a color. BERT likely didn't see enough sentences discussing the color of a dove, thus it defaults to just predicting any color.

So knowledge in language models is not the most accurate and reliable. Is it still useful?

Yes, to some extent. One way to show it is through evaluation on tasks requiring commonsense knowledge. We will discuss several such tasks, but for now let's focus on WinoGrande as an example. It is the large-scale version of the Winograd Schema Challenge. Given a sentence with a cloze, the goal is to fill in the blank with a previously mentioned entity or concept, out of two answer choices. For example:

Because Brett found an internship while in college but Ian was unable to, _____ found a job less quickly after graduation.

Choices: Brett, Ian

What makes this task especially difficult is that every instance has a twin sentence which is minimally changed such that the correct answer is the other one (for instance, replacing "less quickly" with "more quickly" will change the correct answer from Ian to Brett).

Language model-based models top the leaderboards of WinoGrande and other commonsense tasks, but since they are trained on task specific training data, which often contains tens or hundreds of thousands of training examples, it's hard to attribute the success to the knowledge captured in language models from the pre-training step. A better way to estimate it is with zero-shot (unsupervised) models. Typically, the way zero-shot models address multiple-choice tasks is by phrasing a statement from the instance and each answer choice, and computing the language model score as a proxy for plausibility:

PLM(The answer is answer1)

PLM(The answer is answer2)

...

PLM(The answer is answerk)

And then predicting the answer choice with the best language model score (highest probability, which is usually computed as the lowest perplexity).

In our recent EMNLP paper, we took it one step further and asked whether we can use language models to generate what would otherwise be missing or implicit knowledge needed for solving a multiple choice commonsense question answering instance. We proposed the unsupervised "self-talk" framework, that uses language models to generate information seeking questions such as "what is the definition of..." and their corresponding answers (clarifications) to discover additional background knowledge. In the example in Figure 5, knowing that internship experience may help a person get a job is crucial for answering the question (which of Brett and Ian found a job less quickly?). On most benchmarks, the self-talk model performed better than unsupervised models with no additional knowledge, while competing with models that have access to knowledge bases. This is despite the inaccurate and noisy knowledge language models generate. However, when we showed people some of the clarifications that helped the model choose the correct answer choice, they judged only 40% of them as actually providing helpful information. This discrepancy means that our model doesn't imitate human reasoning process - it works differently. Check out our demo! It's not always accurate but it's often funny :)

Figure 5: An example of clarification generation for an instance from WinoGrande.

The best performance on commonsense tasks is achieved by fine-tuning language models, i.e. training them on task-specific data. Let's look at some of the benchmarks and the issues we face with supervised learning.

How to measure commonsense reasoning capabilities?

Multiple commonsense benchmarks have been released over the last few years. Some of them will be discussed here (see examples in Figure 6), along with the main differences and design choices when creating a benchmark.

Figure 6: Some commonsense benchmarks along with an example instance.

Type of knowledge: some benchmarks focus on a specific type of commonsense knowledge, such as social commonsense (e.g. Social IQa), physical commonsense (e.g. PIQA), temporal commonsense (e.g. MC-TACO), or causes and effects (e.g. COPA), while others target a broader domain of general commonsense knowledge and reasoning (e.g. WSC, WinoGrande, CommonsenseQA, ROCStories).

Size: most recent datasets include a large training set, in order to facilitate training large neural models. One way to create a benchmark is to hire experts to curate a high-quality dataset such as for WSC and COPA. These datasets are rather expensive to collect and are therefore typically small. The common alternative is to collect data through crowdsourcing or semi-automatically, and split it randomly to train, validation, and test sets. Models that learned data-specific shortcuts in the training set instead of generalized phenomena are likely to perform well on a test set drawn from the same distribution, but this performance is misleading and is likely a lot better than on real-world instances of the task. Despite this understanding, this is still the dominant approach.

Format: the vast majority of datasets are in the format of multiple choice questions, as exemplified in Figure 6. This format is the easiest to evaluate automatically: models are judged for their accuracy, i.e. what percent of the questions they answered correctly. Unfortunately, this type of tasks also makes it possible for a model to guess the correct answer. We're not talking about a random guess, which would leave enough room for improvement. A random guess is expected to result in an accuracy of 100/k %, where k is the number of answer choices, e.g. 50% accuracy for binary tests, 33.3% for tests with 3 choices, 25% for 4 choices, etc. The risk is that the model makes an "educated guess" based on - yes, you guessed it correctly - spurious correlations between the questions and the correct/incorrect answers.

How do you make sure a model is right for the right reasons?

That's the million-dollar question. We don't have a perfect solution for this problem yet. For a start, when collecting a new benchmark, the process of collecting incorrect answers (=distractors) should be well-designed such that distractors are plausible but unlikely. Using random answers as distractors (e.g. naturally-occurring sentences or correct answers of different questions) would create topically-different distractors, which are easy to detect (remember, relatedness is one of the strengths of distributional text representations). Asking people to come up with the distractors may introduce other annotation artifacts, such as exaggerations, going off-topic, or producing overly emotional texts, which are easy for models to detect. Some solutions have been proposed: for example, the distractors in Social IQa are answers for different questions asked on the same context. In Figure 7, the context "Alex spilt food all over the floor

and it made a huge mess." appears in the dataset with two questions: "what happens next?" and "what happened before?". The distractors of "what happens next?" are the correct answers of "what happened before?", e.g. that Alex has slippery hands. A similar approach is taken in CommonsenseQA.

Figure 7: Creating distractors for a Social IQa instance. Image credit: Maarten Sap.

An alternative solution is to filter out easy questions through "adversarial filtering", i.e. training a weaker model and iteratively removing instances that it succeeds in answering. Variants of adversarial filtering were applied to WinoGrande and PIQA.

Finally, I believe the future is in generative tasks, in which the model needs to produce a free text answer without being provided with the candidate answers. Several recent benchmarks are generative, such as TimeTravel (counterfactual reasoning), ART (abductive reasoning), CommonGen, and ProtoQA. The challenge in generative tasks is the lack of reliable automatic evaluation metrics. Given the gold standard reference answer(s), we would like a metric to (1) reward correct generated answers that are different from the reference answer, while (2) penalizing incorrect answers that are similar (e.g. lexically) to the reference. Human evaluation is reliable, but it is costly and is typically done once on the test set. In order to be able to improve models during development, we need automatic metrics. We currently settle for metrics based on lexical overlap such as BLEU and ROUGE which are pretty terrible at (1) and have little correlation with human judgements, or model-based metrics such as BERT score that are not great at (2).

How to gather and represent machine readable commonsense knowledge?

Commonsense resources provide machine-readable knowledge about the world. Resources are expected to be large-scale and accurate, consist of diverse knowledge types, and be usable in downstream tasks. ConceptNet is a large (21 million assertions), commonly-used resource consisting of general commonsense knowledge, in over 85 languages. ATOMIC consists of 880,000 triplets reasoning about causes and effects of everyday situations. Other resources are listed in Figure 8.

Figure 8: Overview of existing commonsense resources. Image credit: Maarten Sap.

Existing resources differ in several aspects:

Representation: how is knowledge represented in the resource? ConceptNet and ATOMIC represent knowledge in natural language (Figure 9), while NELL and Cyc represent knowledge in symbolic logic:

(#$implies (#$and (#$isa ?OBJ ?SUBSET) (#$genls ?SUBSET ?SUPERSET)) (#$isa ?OBJ ?SUPERSET))

Figure 9: example knowledge extracted from ConceptNet and ATOMIC. Image credit: Maarten Sap.

Knowledge type: ConceptNet consists of semantic knowledge, i.e. properties of concepts (e.g. reading is a type of activity). ATOMIC, on the other hand, is inferential: given a templated event with "PersonX" representing the subject and "PersonY" an optional object(s) (e.g. PersonX yells at PersonY), and one of 9 pre-defined relation dimensions (e.g. PersonX's motivation) it provides a second event (e.g. PersonX wanted to express anger).

Collection method: knowledge can be collected from humans, either experts or crowdsourcing workers. Expert-curated resources are more uniform and accurate, and may use complex representations, but it is an expensive collection method, and it is very time consuming. Alternatively, non-experts can write knowledge in natural language, making the collection faster and more scalable.

The alternative approach is to extract knowledge automatically from texts, as in NELL. This approach works, but it produces less accurate knowledge. In addition, the approach suffers from reporting bias: over-representing the rare at the expense of the trivial. For example, people are reported to murder more often than they are reported to breathe. Default properties of concepts (yellow banana) are mentioned less often than their alternatives (green banana), etc.

How to enhance neural models for commonsense reasoning tasks with symbolic knowledge?

Most models developed for solving commonsense benchmarks today are based on language models. Typically, each answer choice, along with the context, forms a statement. The language model computes a vector representing each statement. These vectors are then fed into a classifier that assigns a plausibility score for each candidate answer:

Figure 10: An illustration of using BERT to score the answer choices of a WinoGrande instance.

Static neuro-symbolic integration

The knowledge in commonsense resources may enhance models built for solving commonsense benchmarks. For example, we can extract from ConceptNet the assertions that job is used for making money, that spending money requires making money, that buying requires spending money, and that car is something you can buy. Ideally we would also need the knowledge that a high-paying job is a type of job, specifically one used for making a lot of money, which is required for spending a lot of money, which is required for buying something that costs a lot of money, car being one of them. Finally, we may want to remove the edge from "buy" to "car" so we can only get to "car" from the node "buy something that costs a lot of money".

Figure 12: Knowledge extracted from ConceptNet for the WinoGrande instance discussed above.

How do we incorporate knowledge from knowledge resources into a neural model?

The simple recipe (success not guaranteed) calls for 4 ingredients: the task addressed, the knowledge resource used, the neural component, and the combination method. We have already discussed tasks and knowledge resources, so I would only add here that ConceptNet is the main resource utilized for downstream models, although some models incorporate other knowledge sources, such as other knowledge bases (WordNet, ATOMIC), knowledge mined from text, and tools (knowledge base embeddings, sentiment analysis models, COMET - see below).

Figure 13: Resources used by most knowledge-informed commonsense models.

The neural component is the shiny new neural architecture - language models in the last 3 years, biLSTMs in the years prior, etc. The more interesting component is the combination method. We will look at 3 examples:

Incorporating into the scoring function: Lin et al. (2017) extracted probabilistic "rules" connecting pairs of terms from multiple sources such as WordNet (restaurant→eatery: 1.0), Wikipedia categories (restaurant→business: 1.0), script knowledge mined from text (X went to a restaurant→X ate: 0.32), word embedding-based relatedness scores (restaurant→food: 0.71), and more. The model scores each candidate answer according to the scores of the inference rules used to get from the context (e.g. "Mary walked to a restaurant" in Figure 14) to the candidate answer (e.g. "She ordered foods.").

Figure 14: "covering" each candidate answer by the original context and the rules extracted from various sources. Image credit: Lin et al. (2017).

Representing symbolic knowledge as vectors: Lin et al. (2019) used BERT as the neural component to represent the instance (statement vector). For their symbolic component, they extracted subgraphs from ConceptNet pertaining to concepts mentioned in the instance and learned to represent them as a vector (graph vector). These two vectors were provided as input to the answer scorer which was trained to predict the correct answer choice.

Figure 15: extracting subgraphs from ConceptNet pertaining to concepts mentioned in the instance. Image credit: Lin et al. (2019).

Multi-task learning: Xia et al. (2019) fine-tuned a BERT model to solve the multiple choice questions. They also trained two auxiliary tasks supervised by ConceptNet, in which two concepts were given as input and the classifier had to predict whether they are related or not, and the specific ConceptNet property that connects them. The BERT model was shared between the main and the auxiliary tasks, so that commonsense knowledge from ConceptNet was instilled into BERT, improving its performance on the main task.

Figure 16: multi-task learning aimed at instilling knowledge from ConceptNet into BERT.

Dynamic neuro-symbolic integration

There are two main limitations to the neuro-symbolic integration discussed above:

Coverage: relevant knowledge is often not found as-is in commonsense knowledge resources. As we've seen earlier, commonsense knowledge is immeasurably vast, so much of it is not documented.
Precision and context: knowledge found in the knowledge base about concept X doesn't necessarily apply to all contexts in which X appears. For example, when provided with "PersonX adopts a cat", ATOMIC says that PersonX had to go to the shelter first (Figure 17), but that's not always the case. It may as well be that PersonX adopted a cat they found on the street or got the cat from a friend who was no longer able to care for it.

Figure 17: ATOMIC inferences for the event "PersonX adopted a cat".

How do we provide machines with large-scale, contextualized commonsense knowledge?

The solution is to leverage manually curated commonsense knowledge resources, such as ConceptNet and ATOMIC, to train a model that can dynamically produce such knowledge for a given context. Commonsense knowledge resources are typically sparse, making training a knowledge base completion model to extend the resource less efficient. Pre-trained language models and their inherent knowledge come in handy here. Language models (such as GPT) implicitly represent knowledge, so you can re-train them on completing knowledge base assertions (e.g. from ATOMIC) to teach them the structure of knowledge. This is what COMET (COMmonsEnse Transformers) does, as illustrated in Figure 18.

Figure 18: Illustration of the training process of COMET: The language model is fine-tuned to predict the "tail entity" (e.g. inference in ATOMIC) given the "head entity" and the relation. Image credit: Antoine Bosselut.

COMET is capable of dynamically generating inferences for any context. For example, if we modify the context from ATOMIC to "David adopted his sister's cat because they found out her husband was allergic.", which for obvious reasons does not appear in ATOMIC, COMET no longer predicts that PersonX (David) had to go to the shelter, but instead that he, for example, needed to find out about it.

COMET has been used successfully in various downstream tasks requiring commonsense knowledge. Models trained on ATOMIC or on ConceptNet are available, and the demo for both ATOMIC and COMET can be found here. There is also a Visual COMET that can generate inferences from images.

Summary

We talked about ways to acquire and represent commonsense knowledge in machine readable format, ways to measure commonsense reasoning abilities, and ways to integrate this kind of knowledge into models. None of these is solved yet. Manually collecting all the commonsense knowledge is infeasible, while extracting it from texts or from language models suffers from inaccuracies, reporting bias, and societal biases. Looking forward, a promising research direction is multi-modal commonsense knowledge acquisition, e.g. learning from texts along images and videos. For example, looking through enough class photos, you might learn that the kids in the front row typically sit (especially if the kids in the last row are also seated).

Machines may reach human performance on commonsense benchmarks but it's often due to being right for the wrong reasons rather than actually possessing and successfully applying commonsense knowledge and reasoning abilities. Generative tasks are somewhat less prone to this issue, but we would have to develop reliable automatic evaluation metrics to make them the standard.

Machine commonsense reasoning is becoming more and more popular within NLP so I am optimistic about future breakthroughs!

Text Generation

2019-08-30T20:51:00.001+03:00

Early this year, OpenAI announced a very powerful language model they developed that can generate human-like text. While such announcements are usually followed by the release of a model to the public, this one suggested that such a powerful tool will pose a danger, and therefore only a smaller and less powerful version of it was released. Soon enough, in addition to the usual buzz on academic Twitter, the news made it to popular media - where it was described in a rather simplistic and exaggerated way. This has caused some fear among the general population; some criticism among other NLP people; and many questions from their relatives ("hey, look at this article I've found - did they just solve NLP? Are you going to be unemployed?"). Six months later, OpenAI finally decided to release the full model.

While I might be late to the dangerous language models party, I thought this blog lacks a basic post about text generation. How are these models trained? How are they used? Are they really that good? And dangerous?

Scope
Language Models
Generating Text
Training a language model
Evaluating text generation
Are language models dangerous?

Scope

The reason that everyone is talking about language models (LMs) lately is not so much that they're all working on text generation, but because pre-trained LMs (like the OpenAI GPT-2 or Google's BERT) are used to produce text representations across various NLP applications, greatly improving their performances. The effect is similar to the effect that pre-trained word embeddings had on NLP in 2013. I recommend reading Sebastian Ruder's article NLP's ImageNet moment has arrived that summarizes it very nicely. This blog post will focus on text generation.

There is an important distinction between two main types of applications of text generation:

1. Open-ended generation: the purpose is to generate any text. It could be on some specific topic or continuing a previous paragraph, but the model is given the artistic freedom to generate any text.

2. Non open-ended generation: the model is expected to generate a specific text. More formally, given some input, the model should generate text that is strictly derived from it. A good example is translation: given a sentence in French, for instance, the model must generate a sentence in English - but not just any sentence - it should have the same meaning as the French sentence. Other examples include summarization (given a long document, generate a short text that consists of the important details in the document); image captioning (given an image, generate a text describing it); speech to text (transcribing); and converting text to code or SQL queries.

This post focuses on open-ended text generation.

Language Models

I've discussed LMs in one of the earlier posts in this blog, in the context of machine translation. Simply put, a language model is a probability distribution of the next word in the text, given the previous words. The distribution is over all the words in the vocabulary, which is typically very large (may be a few hundred thousands or more).

For example, what can be the next word in the sentence "I'm tired, I want to"? A good language model would assign a high score to p(sleep|I'm tired, I want to). The probability of a word like "bed" should be low - although it is a related term, it doesn't form a grammatical sentence; or "party" which is syntactically correct but contradicts with logic. The probability of an entire sentence is the product of the conditional probability of each word given the previous words, using the chain rule:

p(I'm tired, I want to sleep) = p(I'm|<s>) * p(tired|<s> I'm) * p(,|<s> I'm tired) * p(I|<s> I'm tired,) * p(want|<s> I'm tired, I) * p(to|<s> I'm tired, I want) * p(sleep|<s> I'm tired, I want to) * p(</s>|<s> I'm tired, I want to sleep)

where <s> and </s> mark the beginning and the end of the sentence, respectively. Note that I used a word-based LM for the purpose of demonstration in this post, however, it's possible to define the basic token as a character or a "word piece" / "subword unit".

Generating Text

While LMs can be used to score a certain text on its likelihood in the language, in this post we will discuss another common usage of them which is to generate new text. Assuming we've already trained a language model, how do we generate text? We will demonstrate it with this very simple toy LM, which has a tiny vocabulary and very few probable utterances:

To generate text using a language model, one must generate token by token, each time deciding on the next token using the distribution defined by the previous tokens. The most basic way is to simply take the most probable word at each step. The code will look like this:
Our toy LM only generates the sentence This LM is cool. In general, this generation method is pretty limited because it has very little diversity and in particular, it favors very frequent words, some of which are function words like determiners (the, a, ...), prepositions (on, in, of, ...) and so on. Moreover, this interesting study showed that text generated by maximizing the probability is very different from human-generated text. People don't tend to generate the most likely and obvious utterances, but rather the most helpful amount of information which is not already known to the listener (according to Grice's Cooperative Principle).

An alternative is to sample from the distribution, i.e., randomly select a word from the vocabulary, proportional to its probability given the previous words, according to the language model. The code will look something like this:

You may notice that running this code multiple times sometimes generates This LM is stupid and sometimes This LM is cool. While this sampling method tends to generate more diverse texts, it's not perfect either, because now there is a chance to sample a rare or unrelated word at each time step - and once the model does, the generation of the next word is conditioned on that rare word and it quickly goes downhill.

A simple solution is to combine the two approaches and sample only from the top k most probable words in the distribution, for some pre-defined k (as done in this work). This is what it would look like:

Notice that after keeping only k words in the distribution, we need to make sure again that they form a valid probability distribution, i.e. each entry is between 0 and 1, and the sum is 1.

An alternative way to sample from the top of the distribution is top p: sort the tokens by their probability from highest to lowest, and take tokens until the sum of probabilities (which is exactly the probability to generate any of these tokens) reaches some pre-defined value p between 0 and 1. A small number close to 0 is similar to always taking the most probable token, while a large number close to 1 is similar to sampling from the entire distribution. This method is more flexible from top k because the number of candidate tokens may change according to the generated prefix. For example, a general text like I want to may have many valid continuations (with a relatively small probability for each), while a more specific text like The bride and the groom got will have much fewer, with the obvious next token married taking most of the probability mass.

Update 01/11/21: a top p snippet is now available - thanks to Saptarshi Sengupta for the contribution!

Training a language model

I've already discussed N-gram language models, but by the time I wrote that post (4 years ago), they were already obsolete and replaced by neural language models. The basic algorithm for training a neural LM is as follows:

A large amount of text is dedicated for training (training corpus).

The model goes over the corpus, sentence by sentence.

For a given sentence w₁... w_n, for each word w_i:

A representation for the context of the word (for example, the previous words in the sentence) is computed:

Each word in the sequence w₁... w_i-1 is represented with a vector, i.e. word embedding.
These word embeddings are fed into the encoder, which returns a single vector representing this sequence.

This vector is fed into a classifier whose goal is to predict the next word (each word is a class).
During training, the predicted word w' is compared with the gold label (the actual next word w_i) and the model parameters are updated accordingly.

The various neural LMs differ in their choice of basic token (i.e. word, character, word piece) and encoder. The encoder takes a sequence of word embeddings and returns a single vector representing the corresponding sequence of words (e.g. ... tired, I want to). I may have a separate post in the future that focuses on ways to encode text into a vector. For the purpose of this post, let's treat it as a black box function. The following figure illustrates the training (specifically for an encoder based on an RNN):

Figure 1: an excerpt of a neural language model in action. The word embeddings are fed in-order to a recurrent neural network (RNN) that represents each prefix of the sentence. The representation of the previous words (the output of the RNN) is fed to a classifier (MLP) that predicts the next word: each word in the vocabulary is a class. During training, the loss function updates the parameters (of the MLP, RNN, and word embeddings) so that it would guess the next word correctly the next time.

The two main advantages of neural LMs over N-gram LMs are:

(1) N-gram LMs predict the next word based on a history of N-1 words, e.g. given I'm tired, I want to, a 3-gram LM will predict the next word only based on the last 3 words "I want to", completely ignoring the crucial word "tired". N-gram LMs were usually based on small Ns (2-4) (see the post about N-gram language models for explanation).

(2) N-gram LMs are based on the statistics of how many times each text appeared in the data, and it has to be verbatim, i.e. the occurrences of I'm tired are disjoint from those of I'm exhausted. Neural LMs, on the other hand, learn to represent a fragment of text as a vector and to predict the next word based on it. It may generalize semantically-similar texts by assigning them similar vector representations (resulting in the same prediction).

Important note: some LMs today are trained with a different training objective, i.e. not optimizing guessing correctly the next word in the sentence. Specifically, BERT has a "masked LM objective", i.e. hiding random words in the sentence and guessing them from their surrounding context - tokens before and after these hidden words. Text GANs (Generative Adversarial Networks) consist of two competing components: a generator that generates human-like text and a discriminator trained to distinguish between human-generated and generator-generated texts. In practice, current GAN-based text generation doesn't perform as well as generation from language models (see here and here).

Evaluating text generation

Comparing the performance of two classifiers that were trained to solve the same task is easy - we have a test set with the true label of each data point; we predict the test labels using each model, and compute the accuracy of each model compared to the true labels. We then have exactly two numeric values - the higher the accuracy, the better the model. This is not the case for text generation.

Since we are talking about open-ended generation, there is no gold standard text the model is expected to produce (we have a test set, but we really just want to make sure the generated text looks like it), so how can we judge the model's quality? The best we can do is to manually examine some of the model outputs and decide whether we think it's a good (human-like?) text or not. To do so more systematically we can perform a more proper human evaluation by showing people texts generated by our model vs. texts generated by some baseline model (or by humans...), asking them to rate which is better, and aggregating across multiple judgements on multiple texts. While this is probably the best evaluation method, it is costly and takes a long time to obtain. As a result, it is usually applied to a relatively small number of texts at the final stages of the model development, and isn't used to validate texts in the intermediate steps (which can potentially help improving the model).

The alternative and commonly used metric is perplexity: by definition, it is the inverse probability of the test set, normalized by the number of words. So we want to get a low perplexity score as possible which means the probability of the test set is maximized - i.e., the LM learned a probability distribution which is similar to the "truth". The test set is just a bunch of texts which the LM has not seen before, and its probability is computed by going over it word by word and computing the LM probability of predicting each word given its past. A good LM will assign high probability to the "correct" (actual) next word and a low probability to other words.

Although perplexity is the most common evaluation metric for text generation, it is criticized for various reasons. Mainly, because it has been shown that improvement in perplexity doesn’t always mean an improvement in applications using the language model (it's basically not a good indicator of quality). And also because perplexity can't be used to evaluate text generation models that don't produce a distribution of words under the hood, like GANs. And if you thought that evaluation metrics for non open-ended generation are better, think twice!¹

Are language models dangerous?

In the previous post I discussed the potential misuses of machine learning models, so the starting point should be that yes, if used by people with malicious intentions, LMs may pose a danger. More specifically, the announcement from OpenAI expressed the concern that such a model, if released, may be used to generate fake news at scale. While this is not completely unreasonable, there are currently two limitations of text generation that may help reducing the fear of LMs enhancing disinformation, at least temporarily.

When humans generate fake news, they have certain goals - typically either to promote some propaganda or to maximize ad revenue through clicks. Unlike humans, language models don't have agenda. The language models mentioned here were designed to generate text that looks realistic and which is coherent and topically related given some human-written opening passage. There is no easy way to use them to generate controllable fake news at scale.

The exception is Grover, which was designed to generate controlled text. Specifically, it was designed to generate fake news, controlled by several parameters: domain (e.g. New York Times), date, authors, and headline. Nevertheless, most importantly, the model can be used to discriminate between fake and real news very accurately. It learns to recognize the small differences between text generated by machines and by humans, and it accurately distinguishes them even when the text was generated by another language model. Which leads to the second point: machine-generated text is just not good enough yet (if good equals "human-like").

Yes, generated text today is quite impressive. It is grammatical and in most cases it doesn't deviate from the topic. But it is not fact-aware (see how it continues the following sentence: GPT-2 is a language model ___), it has little common sense (and this one: she fell and broke her leg because someone left a banana peel ____), and as previously mentioned, often just doesn't read "human-like". Even when it does and humans can't tell that it's machine-generated, there are models that are good at detecting that. The robots may fail us humans, but not each other 🤖

Fear of disinformation is justified, but at least at its current state, I'm more concerned about the humans involved in it. Those that initiate and generate it, those that spread it with evil intention, and especially the many others that spread it ignorantly and naively. Perhaps, in parallel to the race between technology used for or developed against disinformation, we can also train humans to think more critically?

I learned a lot of what I know about text generation pretty recently, thanks to my awesome collaborators on the text GAN evaluation paper and my teammates at AI2/UW (especially Ari Holtzman and Rowan Zellers). Thanks!

1 The evaluation of non open-ended generation depends on the task, yet suffers from a major issue: the gold standard is a given text, but it may not be the only correct text due to variability in language. In machine translation, for example, the standard evaluation metric is BLEU, which basically compares chunks of text in the reference (gold standard) translation to the system predicted translation. Various correct translations may differ in their syntactic structure or in the choice of words. Penalizing a model for not predicting the exact sentence that the human translators suggested (and which is found in the test set) is unfair, yet this is the standard way to evaluate machine translation models today. The same issue exists for summarization with the ROUGE metric. For a much more elaborate discussion on this topic, see Rachael Tatman's blog post. ^↩

Ethical Machine Learning

2018-09-14T19:33:00.000+03:00

With machine learning increasingly automating many previously manual decision making processes, it’s time to reflect not only on the algorithms’ performances but also on the ethical issues involved. Here are some questions about ethical concerns:

Fairness: are the outputs of our algorithms fair towards everyone? Is it possible that they discriminate against people based on characteristics such as gender, race, and sexual orientation?
Are developers responsible for potential bad usages of their algorithms?
Accountability: who is responsible for the output of the algorithm?
Interpretability and transparency: in sensitive applications, can we get an explanation for the algorithm’s decision?
Are we aware of human biases found in our training data? Can we reduce them?
What should we do to use user data cautiously and respect user privacy?

To demonstrate the importance of ethics in machine learning, it is now taught in classes, it has a growing community of researchers working on it, dedicated workshops and tutorials, and a Google team entirely devoted to it.

We are going to look into several examples.

To train or not to train? That is the question

Machine learning has evolved dramatically over the past few years, and together with the availability of data, it’s possible to do many things more accurately than before. But prior to considering implementation details, we need to pause for a second and ask ourselves: so, we can develop this model - but should we do it?

The models we develop can have bad implications. Assuming that none of my readers is a villain, let’s think in terms of “the road to hell is paved with good intentions”. How can (sometimes seemingly innocent) ML models be used for bad purposes?

A model that detects sexual orientation can be used to out people against their will
A model to detect criminality using face images can put innocent people behind bars
A model that creates fake videos can be used as “evidence” for fake news
A text generation model can be used to generate fake (positive or negative) reviews

In some cases, the answer is obvious (do we really want to determine that someone is a potential criminal based on their looks?). In other cases, it’s not straightforward to weigh all the potential malicious usages of our algorithm against the good purposes it can serve. In any case, it’s worth asking ourselves this question before we start coding.

Would you develop a model that can recognize smurfs if you knew it could be used by Gargamel? (Image source)

Underrepresented groups in the data

So our model passed the “should we train it” phase and now it’s time to gather some data! What can go wrong in this phase?

In the previous post we saw some examples for seemingly solved tasks whose models work well only for certain populations. Speech recognition works well for white males with an American accent but less so for other populations. Text analysis tools don’t recognize African-American English as English. Face Recognition works well for white men but far less accurately for dark-skinned women. In 2015, Google Photos automatically labelled pictures of black people as “gorillas”.

The common root of the problem in all these examples is insufficient representation of certain groups in the training data: not enough speech by females, blacks, and people with non-American English accent. Text analysis tools are often trained on news texts, which means it’s mostly written by adult white males. Finally, not enough facial images of black people. If you think about it, it’s not surprising. This goes all the way back to the photographic film, which had problems rendering dark skins. I don’t actually think there were bad intentions behind any of these, just maybe - ignorance? We are all guilty of being self-centered, so we develop models to work well for people like us. In the case of software industry, this mostly means “work well for white males”.

Biased supervision

When we train a model using supervised learning, we train it to perform similarly to humans. Unfortunately, it comes with all the disadvantages of humans, and we often train models to mimic human biases.

Let’s start with the classic example. Say that we would like to automate the process of mortgage applications, i.e. training a classifier to decide whether or not someone is eligible for a mortgage. The classifier is trained using the previous mortgage applications with their human-made decisions (accepted/rejected) as gold labels. It’s important to note that we don’t exactly train a classifier to accurately predict an individual’s ability to pay back the loan; instead we train a classifier to predict what a human would decide when being presented with the application.

We already know that humans have implicit biases and that sensitive attributes such as race and gender may affect these decisions negatively. For example, in the US, black people are less likely to get a mortgage. Since we don’t want our classifier to learn this bad practice (i.e. rejecting a mortgage merely because the applicant is black), we leave out those sensitive attributes from our feature vectors. The model has no access to these attributes.

Only that analyzing the classifier predictions with respect to the sensitive attributes may yield surprising results; for example, that black people are less likely than white people to be eligible for a mortgage. The model is biased towards black people. How could this happen?

Apparently, the classifier gets access to the excluded sensitive attributes through included attributes which are correlated with them. For example, if we provided the applicant’s address, it may indicate their race. (In the US, zip code it is highly correlated with race). Things can get even more complicated when using deep learning algorithms on texts. We no longer have control on the features the classifier learns. Let’s say that the classifier now gets as input a textual mortgage application. Now it may be able to detect race through writing style and word choice. And this time we can’t even remove certain suspicious features from the classifier.

Adversarial Removal

What can we do? We can try to actively remove anything that indicates race.

We have a model that gets as input a mortgage application (X), learns to represent it as f(X) (f encodes the application text, or extracts discrete features), and predicts a decision (Y) - accept or reject. We would like to remove information about some sensitive feature Z, in our case race, from the intermediate representation f(X).

This can be done by jointly training a second classifier, an “adversarial” classifier, which tries to predict race (Z) from the representation f(X). The adversary’s goal is to predict race successfully, while at the same time, the main classifier aims both to predict the decision (Y) with high accuracy, and to fail the adversary. To fail the adversary, the main classifier has to learn a representation function f which does not include any signal pertaining to Z.

The idea of removing features from the representation using adversarial training was presented in this paper. Later, this paper used the same technique to remove sensitive features. Finally, this paper experimented with textual input, and found that demographic information of the authors is indeed encoded in the latent representation. Although they managed to “fail” the adversary (as the architecture requires), they found that training a post-hoc classifier on the encoded texts still managed to detect race somewhat successfully. They concluded that adversarial training isn’t reliable for completely removing sensitive features from the representation.

Biased input representations

We’re living in an amazing time with positive societal changes. I’ll focus on one example that I personally relate to: gender equality. Every once in a while, my father emails me an article about some successful woman (CEO/professor/entrepreneur/etc.). He is genuinely happy to see more women in these jobs because he remembers a time when there were almost none. For me, I wish for times that this would be a non-issue - when my knowledge that women can do these jobs and the number of women actually doing these jobs finally make sense together.

In the Ethics in NLP workshop at EACL 2017, Joanna Bryson distinguished between 3 related terms: bias is knowing what "doctor" means, including that more doctors are male than female (if someone tells me they’re going to the doctor, I normally imagine they’re going to see a male doctor). Stereotype is thinking that doctors should be males (and consequently, that women are unfit to be doctors). Finally, prejudice is if you only use (go to / hire) male doctors. The thing is, while we as humans--or at least some of us--can distinguish between the three, algorithms can’t tell the difference.

One of the points of failure in this lack of algorithmic ability to tell bias from stereotype is in word embeddings. We discussed in a previous post this paper which showed that word embeddings capture gender stereotypes. They showed that for instance, when using embeddings to solve analogy problems (a toy problem which is often used to evaluate the quality of word embeddings), they may suggest that father is to doctor like mother is to nurse, and that man to computer programmer is like woman to homemaker. This obviously happens because statistically there are more nurse and homemaker females and male doctors and computer programmers than vice versa, which is reflected in the training data.

Google image search for “doctor” (left) and “nurse” (right): there are many more female than male nurse images.

However, we treat word embeddings as representing meaning. By doing so, we engrave “male” into the meaning of “doctor” and “female” into the meaning of “nurse”. These embeddings are then commonly used in applications, which might inadvertently amplify these unwanted stereotypes.

The suggested solution in that paper was to “debias” the embeddings, i.e. trying to remove the bias from the embeddings. The problem with this approach is, first, that you can only remove biases that you are aware of. Second, which I find worse, is that it removes some of the characteristics of a concept. As opposed to the removal of sensitive features from classification models, in which the features we try to remove (e.g. race) have nothing to contribute to the classification, here we are removing an important part of a word’s meaning. We still want to know that most doctors are men, we just don’t want to have a meaning representation in which woman and doctor are incompatible concepts.

The sad and trivial take-home message is that algorithms only do what we tell them too, so “racist algorithms” (e.g. the Microsoft chatbot) are only racist because they learned it from people. If we want machine learning to help build a better reality, we need to research not just techniques for improved learning, but also ways to teach algorithms what not to learn from us.

Deep Learning in NLP

2018-08-14T21:21:00.001+03:00

This post is an old debt. Since I’ve started this blog 3 years ago, I’ve been refraining from writing about deep learning (DL), with the exception of occasionally discussing a method that uses it, without going into details. It’s a challenge to explain deep learning using simple concepts and without the caveat of remaining at a very high level. But perhaps worse than that, DL somewhat contradicts the description of my blog: “human-interpretable computer science”. We don’t really know how or why it works, and attempts to interpret it still only scratch the surface. On the other hand, it has been so prominent in NLP in the last few years (Figure 1), that it’s no longer reasonable to ignore it in a blog about NLP. So here’s my attempt to talk about it.

Figure 1: word cloud of the words in the titles of the accepted papers of ACL 2018, from https://acl2018.org/2018/07/31/conference-stats/. Note the prevalence of deep learning-related words such as “embeddings”, “network”, “rnns”.

This writing is based on many resources and most of the material is a summary of the main points I have taken from them. Credit is given in the form of linking to the original source. I would be happy to get corrections and comments!

Who is this post meant for?

What this post is not

Let’s talk about deep learning already!

What has improved?

Going Deep

Representation Learning

Recurrent Neural Networks

What is not yet working perfectly?

The Need for Unreasonable Amount of Data

The Risk of Overfitting

The Artificial Data Problem

The Shaky Ground of Distributional Word Embeddings

The Unsatisfactory Representations Beyond the Word Level

The Non Existing Robustness

The Lack of Interpretability

Who is this post meant for?

As always, this post will probably be too basic for most NLP researchers, but you’re welcome to distribute it to people who are new to the field! And if you’re like me and you enjoy reading people’s views about things you already know, you’re welcome to read too!

What this post is not

This post is NOT an extensive list of all the most up-to-date recent innovations in deep learning for NLP; I don’t know many of them myself. I tried to remain in a relatively high-level, so there are no formulas or formal definitions. If you want to read an extensive, detailed overview of how deep learning methods are used in NLP, I strongly recommend Yoav Goldberg’s “Neural Network Methods for Natural Language Processing” book. My post will also not teach you anything practical. If you’re looking to learn DL for NLP in 3 days and strike a fortune, there are tons of useful guides on the web. Not this post!

Let’s talk about deep learning already!

OK! Deep learning is a subfield of machine learning. As with machine learning, the goal is to give computers the ability to “learn” to solve some task without being explicitly programmed to solve it, but rather by using data and applying statistical methods to it. In the common case of supervised learning, the idea is to make the computer estimate some function, that takes some input and returns a prediction with respect to some question. I know this is very vague, so let’s take the task of named entity recognition (NER) as an example. The task is to determine for each word in a sequence of words (say, a sentence), whether it is a part of a named entity, and if so, of which type. In the text, “Prince Harry attended a wedding with a hole in his shoe” (I didn’t make it up), the words “Prince” and “Harry” should be classified as PERSON, and other words as NONE. A machine learning algorithm expects as input a set of properties pertaining to each word (called “feature vector”), and predicts a label for this word, out of a set of possible labels (e.g. PERSON, ORGANIZATION, LOCATION, NONE).

In the context of NLP, machine learning has been used for a long time, so what is new? Before we go through the differences, let’s think of the characteristics of a traditional machine learning algorithm, specifically a supervised classifier. In general, a traditional supervised classification pipeline was as follows.

Figure 2: pipeline of a traditional supervised classifier - extracting features, then approximating the function.

The input is raw (e.g. text), along with the gold labels. Phase (1) is the extraction of human-designed features from raw input, and representing it as vectors of features. Going back to the NER example, the person designing the learning algorithm would ask themselves: “what can indicate the type of entity of a given word?”. For example, a capitalized word is a good indication that a word is a part of a named entity in general. When the previous word is “The”, it can hint that the entity is an organization rather than a person. Someone had to come up with all these ideas. Here is an example of features that were previously used for NER:

Figure 3: features for NER, from this paper. The example is taken from Stanford CS224d course: Deep Learning for Natural Language Processing.

Phase (2) is the learning/training phase, in which the computer tries to approximate a function that takes as input the feature vectors and predicts the correct labels. The function is now a function of the features, i.e., in the simple case of a linear classifier, it assigns a weight for each feature with respect to each label. Features which are highly indicative of a label (e.g. previous word = The for ORGANIZATION) would be assigned a high weight for that label, and those which are highly indicative of not belonging to that label would be assigned a very low weight. The final prediction is done by summing up all the weighted feature values, and choosing the label that got the highest score.

Note that I’m not discussing how these weights are learned from the training examples and the gold labels. This is not super important for the current discussion, and you can read about it elsewhere (e.g. here).

Now let’s talk about the differences between traditional machine learning and deep learning.

What has improved?

First of all, it works, empirically. The state-of-the-art performance in nearly every NLP task today is achieved by a neural model. Literally every task listed on this website “Tracking Progress in Natural Language Processing” has a neural model as the best performing model. Why is it working so much better from previous approaches?

Going Deep

“Deep learning” is a buzzword referring to deep neural networks - powerful learning algorithms inspired by the brain’s computation mechanism. A simple single-layered feed-forward neural network is basically the same as the linear classifier we discussed. A deep neural network is a network that contains one or more hidden layers, which are also learned. This is the visual difference between a linear classifier (left, neural network with no hidden layers) and a neural network with a single hidden layer (right):

Figure 4: shallow neural network (left) vs. deep (1 hidden layer) neural network (right).

In general, the deeper the network is, the more complex functions it can estimate, and the better it can (theoretically) approximate them.

In mathematical notation, each layer is a multiplication by another learned matrix, but more importantly, it also goes into a non-linear activation function (e.g. hyperbolic tangent or the simple rectifier that sets values under a certain threshold to zero). Some functions can’t be approximated accurately enough using linear models. I’m deliberately not going into details, but you can read about linear separability here. With the help of the multiple layers and the non-linear activations, neural networks can approximate better functions for many tasks, resulting in improved performance.

Representation Learning

Another key aspect that changed is how the input is represented. Traditional machine learning methods worked because someone designed a well-thought feature vector to represent the inputs, as we can see in figure 3. Deep learning obviates the need to come up with a meaningful representation, and learns a representation from raw input (e.g. words). The new pipeline looks like this:

Figure 5: pipeline of a deep supervised classifier - learning both the input representation and the other model parameters.

Now, rather than feeding the learning algorithms hand-engineered feature vectors, we only give it the raw texts. In the case of NER, we can now just feed as features a window of words around each target word. For example, in the sentence “John worked at the Post Office in the city until last year”, the feature vector of the target word “Office” with a window of size 3 would be [Post, Office, in]. The learning algorithm, in addition to learning the network parameters (the function from representation to output), as it did before, learns also the word representations suitable for the task at hand. In other words, one of the additional parameters that the network learns is the word embeddings. We can think about it as a lookup table whose index is a word (string) and its output is a vector.

It is common to initialize this lookup table with pre-trained word embeddings. We discussed word embeddings in this blog post. They are trained using a large text collection, based on a linguistic hypothesis which states that words with similar meanings appear in the same contexts (next to the same “neighbour” words). Pre-trained embeddings are useful because they are often trained on a lot more data than available for the end task itself, and the more data, the more high-quality the vectors are.

On the other hand, pre-trained embeddings provide a general notion of “similarity” between words, which is not necessarily the same similarity our task needs. Think for example about sentiment analysis. Let’s say we are trying to predict the sentiment of restaurant reviews. Generic pre-trained embeddings will tell us that good and bad are highly similar, because they appear near the same words. But in the context of sentiment analysis we’d like their vectors to be further apart. The developer can choose between initializing the lookup table with the pre-trained embeddings or randomly, and also between updating the embeddings as additional network parameters, to fit the task better, or keeping them fixed. We’ll touch upon that when we discuss overfitting.

Back to the NER example, this is what a network with a window size of 3 would look like:

Figure 6: a neural network for NER, which uses a window of 3 words to classify a word.

In a traditional model, the feature “next word” is a discrete variable than can accept as value any word in the vocabulary. Let’s say that during training, the model encountered a word like “ltd” many times as the next word of an organization, and figured that it is a good indication of the ORGANIZATION class. If during test time, the model needs to classify a word followed by “Inc”, it may have no information about this word, and can’t generalize the knowledge about the similar word “ltd”. When the feature vector is composed of word embeddings, since the word embeddings of “ltd” and “Inc” are similar, now that inputs are similar, and the model can use knowledge about similar words to output the correct prediction.

Recurrent Neural Networks

The network we discussed so far is called a feed-forward neural network. In the NER example we used a fixed-size window of words around each target word. But what if we want to use the entire sentence? For example, in the sentence “John worked at the Post Office in the city until last year, and hated this organization” it is beneficial for the model to be aware of the last word “organization” while predicting the labels of “Post” and “Office”. One problem is that we don’t know in advance how many words each input sentence will contain. Another problem is that the representation of each word in the window model is independent of the other words - and we would like the representation to be more contextualized. For instance, the representation of “post” in the context of “post office” should be different from its representations in “blog post” or “post doc”.

Recurrent neural networks (RNNs) solve both problems. An RNN is a network that takes as input a sequence (e.g. a sentence as a sequence of words, or a sequence of characters, etc.) and outputs vectors representing each subsequence (i.e. the first item, the first two items, ..., the entire sequence). At each time step, the RNN considers both the previous memory (of the sequence until the previous input) and the current input. The last output vector can be considered as representing of the entire sequence, like a sentence embedding. Intermediate vectors can represent a word in its context.

The output vectors can then be used for many things, among which: representation - a fixed-size vector representation for arbitrarily long texts; as feature vectors for classification (e.g. representing a word in context, and predicting its NER tag); or for generating new sequences (e.g. in translation, a sequence-to-sequence or seq2seq model encodes the source sentence and then decodes = generates the translation). In the case of the NER example, we can now use the output vector corresponding to each word as the word’s feature vector, and predict the label based on the entire preceding context.

Figure 7: a NER model that uses an RNN to represent each word in its context.

Technical notes: an LSTM (long short-term memory) is a specific type of RNN that works particularly well and is commonly used. The differences between various RNN architectures are in the implementation of the internal memory. A bidirectional RNN/LSTM or a biLSTM processes the sequences from both sides - right to left and left to right, such that each output vector contains information pertaining to both the previous and the subsequent items in the sequence. For a much more complete overview of RNNs, I refer you to Andrej Karpathy's blog post "The Unreasonable Effectiveness of Recurrent Neural Networks".

So these are the main differences. There are many other new techniques on top of them, such as attention, Generative Adversarial Networks (GAN), deep reinforcement learning, and multi-task learning, but we won’t discuss them.

Interestingly, neural networks are not a new idea. They have been around since the 1950s, but have become increasingly popular in recent years thanks to the advances in computing power and the amount of available text on the web.

What is not yet working perfectly?

Although the popular media likes to paint an optimistic picture of “AI is solved” (or rather, a very pessimistic, false picture of “AI is taking over humanity”), in practice there are still many limitations to current deep methods. Here are several of them, mostly from the point of view of someone who works on semantics (feel free to add more limitations in the comments!).

The Need for Unreasonable Amount of Data

It’s difficult to convey the challenges in our work to people outside the NLP field (and related fields), when there are already many products out there that perform so well. Indeed, we have come a long way to be able to generally claim that some tasks are “solved”. This is especially true to low-level text analysis tasks such as part of speech tagging. But the performance on more complex semantic tasks like machine translation is surprisingly good as well. So what is the secret sauce of success, and why am I still unsatisfied?

First, let’s look at a few examples of tasks “solved” by DL, not necessarily in NLP:

Automatic Speech Recognition (ASR) - also known as speech-to-text. Deep learning-based methods have reported human-level performance last year, but this interesting blog post tells us differently. According to this post, while the the recent improvements are impressive, the claims about human-level performance are too broad. ASR works very well on American accented English with high signal-to-noise ratios. It has been trained on conversations by mostly American native English speakers with little background noise, which is available in large-scale. It doesn’t work well, definitely not human-level performance, for other languages, accents, non-native speakers, etc.

Facial Recognition - another task claimed to be solved, but this article from the New York Times (based on a study from MIT), says that it’s only solved for white men. An algorithm trained to identify gender from images was 99% accurate on images of white man, but far less accurate -- only 65% -- for dark-skinned women. Why is that so? a widely used dataset for facial recognition was estimated to be more than 75% male and more than 80% white.

Machine Translation - not claimed to be solved yet, but the release of Google Translate neural models in 2016 reported large performance improvements. The paper reported “60% reduction in translation errors on several popular language pairs”. The language pairs were English→Spanish, English→French, English→Chinese, Spanish→English, French→English, and Chinese→English. All these languages are considered “high-resource” languages, or in simple words, languages for which there is a massive amount of training data. As we discussed in the blog post about translation models, the training data for machine translation systems is a large collection of the same texts written in both the source language (e.g. English) and the target language (e.g. French). Think book translations as an example for a source of training data.

In other news, popular media was recently worried that Google Translate spits out some religious nonsense, completely unrelated to the source text. Here is an example. The top example is with religious content, the bottom example is not religious, only unrelated to the source text.

Figure 8: “who has been using these technologies for a long time”? Hopefully not the Igbo speakers, translating to English. Google Translate makes up things when translating Gibberish from the low-resource language “Igbo”.

This excellent blog post offers a simple explanation to this phenomenon: “low-resource” language pairs (e.g. Igbo and English), for which there is not a lot of available data, have worse neural translation models. Not surprising so far. The reason the translator generates nonsense is that when it is given unknown inputs (the input nonsense in the figure 8), the system tries to provide a fluent translation and ends up “hallucinating” sentences. Why religious texts? Because religious texts like the Bible and the Koran exist in many languages, and they are probably the major part of the available training data for translations between low-resource languages.

Can you spot a pattern among the successful tasks? Having a tremendous amount of training data increases the chances of training high-quality neural models for the task. Of course, it is a necessary-but-not-sufficient condition. In her RepL4NLP keynote, Yejin Choi talked about “solved” tasks and said that they all have in common a lot of training data and enough layers. But with respect to NLP tasks, there is another factor. Performing well on machine translation is possible without having a model with deep text understanding abilities, but rather by relying on the strong alignment between the input and the output. Other tasks, which require deeper understanding, such as recognizing fake news, summarizing a document, or making a conversation, have not been solved yet. (And may not be solvable just by adding more training data or more layers?).

The models of these “solved” tasks are only applicable for inputs which are similar to the training data. ASR for Scottish accent, facial recognition for black women, and translation from Igbo to English are not solved yet. Translation is not the only NLP example; whenever someone tells you some NLP task is solved, it probably only applies to very specific domains in English.

The Risk of Overfitting

Immediately following the previous point, when the training data is limited, we have a risk of overfitting. By definition, overfitting happens when a model performs extremely well on the training set, while performing poorly on the test set. It happens because the model memorizes specific patterns in the training data instead of looking at the big picture. If these patterns are not indicative of the actual task, and are not present in the test data, the model would perform badly on the test data. For example, let’s say you’re working on a very simplistic variant of text classification, in which you need to distinguish between news and sports articles. Your training data contains articles about sports and tweets from news agencies. Your model may learn that an article with less than 280 characters is related to news. The performance would be great on the training set, but what your model actually learned is not to distinguish between news and sports articles but rather between tweets and other texts. This definitely won’t be helpful in production when your news examples can be full-length articles.

Overfitting is not new to DL, but what’s changed from traditional machine learning are two main aspects: (1) it’s more difficult to “debug” DL models and detect overfitting, because we no longer have nice manually-designed features, but automatically learned representations; and (2) the models have many many more parameters than traditional machine learning models used to have - the more layers, the more parameters. This means that a model can now learn more complex functions, but it’s not guaranteed to learn the best function for the task, but rather the best function for the given data. Unfortunately, it’s not always the same, and this is something to keep in mind!

With respect to the first aspect, updating the pre-trained word embeddings during training can lead to overfitting. Given that the training set is limited and doesn’t cover the entire vocabulary, we’re only moving some words in the embedding space but keeping others in their place. The words we move (e.g. kangaroo) now have better vectors for the specific task, but they’re further away from their (distributional) neighbours which are not found in the training set. This hurts the model generalization abilities: when it encounters an unobserved word (e.g. wallaby), its vector is no longer located next to similar words (kangaroo) which have been moved, so the model doesn’t know much about it. Updating the embeddings during training is a good idea only if your task has very different needs from its embeddings than just plain similarity (and then you may want to start with a random initialization rather than pre-trained embeddings), and only if you have enough training data that covers a broad vocabulary.

For the next, related point, I’m going to broaden the definition of “overfitting” to the phenomenon of a model that memorizes the peculiarities in the data rather than what’s actually important for the task, regardless of its performance on the test set.

The Artificial Data Problem

Machine learning enthusiasts like to say that given enough training data, DL can learn to estimate any function. I personally take the more pessimistic stand that some language tasks are too complex and nuanced to solve using DL with any reasonable amount of training data. Nevertheless, like everyone else in the field, I’m constantly busy thinking of ways to get more data quickly and without spending too much money.

One representative example (of many) to a task in which the release of a large amount of training data raised researchers’ interest in the task is recognizing textual entailment (RTE, sometimes called “natural language inference” or NLI). This is an artificial task that was invented because many language understanding abilities we develop can be reduced to this task. In this task, two sentences -- a premise and a hypothesis -- are given. The task is for a model to automatically determine what a human reading the premise can say with respect to the hypothesis. Let’s look at the definitions along with a simple example from the SNLI dataset. For the premise “A street performer is doing his act for kids” and the hypothesis:

Entailment: the hypothesis must also be true.
Hypothesis = “A person performing for children on the street”.
The premise tells us there is a street performer, so we can infer that he is a person and he is on the street. The premise also tells us that he is doing his act - therefore performing, and for kids = for children. The hypothesis only repeats the information conveyed in the premise and information which can be inferred from it.
Neutral: the hypothesis may or may not be true.
Hypothesis = “A juggler entertaining a group of children on the street”.
In addition to repeating information from the premise, the hypothesis now tells us that it’s a juggler. The premise is more general and can refer to other types of street performers (e.g. a guitar player), so this may or may not be a juggler.
Contradiction: the hypothesis must be false.
Hypothesis = “A magician performing for an audience in a nightclub”.
The hypothesis describes a completely different event happening at a nightclub.

This task is very difficult for humans and machines, and requires background knowledge, commonsense, knowledge about the relationship between words (whether they mean the same thing, one of them is more specific than the other, or do they contradict each other), recognizing that different mentions refer to same entity (coreference), dealing with syntactic variations, etc. For an extensive list, I recommend reading this really good summary.

In the early days, methods’ performance was mediocre. The available data for this task contained a few hundreds of annotated examples. They required many different types of knowledge to answer correctly, and were diverse in the their topics. Unfortunately, there were too few of them to throw a neural network at... A few years ago, the huge SNLI dataset was released, containing half a million examples. What enabled scaling up to such a large dataset was (1) taking the premises from an already available collection of image captions; and (2) asking people to generate entailed, neutral, and contradicting hypotheses. This made the data collection simple enough to not require experts but rather be able to use crowdsourcing workers.

Following the release of this dataset, the interest of the NLP community in RTE peaked. Specifically, many neural methods have been developed. What they basically do is encode each sentence (premise, hypothesis) into a vector, normally by running each sentence through an RNN. The premise and hypothesis vectors are then combined using arithmetic operations and sent to a classifier that outputs the label (entailment, neutral, or contradiction). This approach (left side of figure 9) can get you as far as ~87% accuracy on the test set, and the difference between the specific methods is mostly in the technicalities. More sophisticated methods encode the hypothesis conditioned on the premise, using the neural attention mechanism (right side of figure 9). In simple words, the hypothesis encoder is allowed to “look at” the premise words, and it roughly aligns each hypothesis word to a related premise word. For example, given the premise “A street performer is doing his act for kids” and the hypothesis “A juggler entertaining a group of children on the street”, the alignments would be juggler-performer, act-entertaining, street-street, etc. (In practice, it’s not a 1-on-1 alignment, but a weighted 1-to-many attention). This approach gets you to over 90% accuracy today, which is beyond human performance on this dataset. Yes, you read correctly. A statistical DL method is better than humans on this dataset.

Figure 9: two common architectures of neural RTE systems. Left: sentence encoding models that encode each sentence separately. Right: attention model that encodes the hypothesis conditioned on the premise. Both models extract features from the premise and hypothesis vectors and use them for classification.

Anyone who’s ever worked on textual entailment and who is even a tiny bit skeptical of DL as the solution to everything, had to be suspicious. Indeed, many of us were. Fast forward a few months, a flood of papers confirming that the task is indeed not solved. Instead of solving the general, very difficult textual entailment task, our models are memorizing very specific peculiarities of the training data (which are also common in the test data). 1, 2 and 3 concurrently showed that a model which has access only to the hypothesis can solve the task with performance which is much better than a random guess (which is what we’d expect from a model that has no access to the premise). They all pointed out some peculiarities in the data that enable that. For example, hypotheses of contradicting examples tend to contain more negative words. This happens because the premises are image captions (“a dog is running in the park”). Image captions rarely describe something that doesn’t happen. The easiest thing for a person asked to write a contradicting sentence is to add negation: “a dog is not running in the park”.

Funnily, 1 also showed that the appearance of the word “cat” in the hypothesis can indicate contradiction, as there were many dog images, and what contradicts a dog better than a cat? In reality, cats are lovely creatures, and a sentence with a cat doesn’t immediately contradict any other sentence.

We also showed that state-of-the-art models fail on examples that require knowledge about relations between words, even when the example is super simple. For instance, the models would think that “a man starts his day in India” and “a man starts his day in Malaysia” are entailing, just because India and Malaysia are similar (although mutually exclusive) words. We showed that the models only learns to distinguish between such words if the specific words appear enough times in the training data. For example, many contradiction examples in the training data have a man in the premise doing something and a woman in the premise doing the same thing. Having observed enough of these examples, the models learn that man and woman are mutually exclusive. But they fail in the India/Malaysia example, because they didn’t observe this exact pair of words in enough training examples. Since it’s unreasonable to rely on the training set to provide enough examples of each possible pair of words, a different solution, probably involving incorporating external knowledge from dictionaries and taxonomies, is needed.

The main lesson from this story should not be that DL methods are unsophisticated parrots that can only repeat exactly what they saw in the training data. Instead, there are several things to consider:

Good performance on the test set doesn’t necessarily indicate solving the task. Whenever our training and test data are not “natural” but rather generated in a somewhat artificial way, we run the risk that they will both contain the same peculiarities which are not properties of the actual task. A model learning these peculiarities is wasting energy on remembering things that are unhelpful in real-world usage, and creating an illusion of solving an unsolved task. I’m not saying we shouldn’t in any case process our data - just that we should be aware of this. If your model is getting really good performance on a really difficult task, there’s reason to be suspicious.
DL methods can only be as good as the input they get. When we make inferences, we employ a lot of common sense and world knowledge. This knowledge is simply not available in the training data, and we can never expect the training data to be extensive enough to contain it. Domain knowledge is not redundant, and in the near future someone will come up with smart ways to incorporate it into a neural model, and achieve good performance on newer, less simple datasets.

The Shaky Ground of Distributional Word Embeddings

At the core of many deep learning methods lie pre-trained word embeddings. They are a great tool for capturing semantic similarity between words. They mostly capture topical similarity, or relatedness (e.g. elevator-floor), but they can also capture functional similarity (e.g. elevator-escalator). Relying on pre-trained word embeddings is a great way to make your model generalize to new inputs, and I recommend it for any task that doesn’t have loads of available data.

With that said, it’s not perfect. In many tasks we need to know the exact relationship between two words. It’s not enough to know that elevator and escalator (or India and Malaysia) are similar - we need to know that they are mutually exclusive. And word embeddings don’t tell us that. In fact, they conflate lots of semantic relations together.

I think I have a good way to simulate that, and I’ve been using it in my talks for the last few months. The idea is to take some text, say lyrics of a song, a script of a TV series, a famous speech, anything you like. Then, go over the text and replace each noun with its most similar word in word2vec. (It doesn’t have to be word2vec and doesn’t have to be only for nouns - this is what I did. The code and some other examples are available here.) Here is a part of my favorite example: Martin Luther King’s “I Have a Dream” speech. This is what you get when you replace words by their word2vec neighbours:

Figure 10: a part of Martin Luther King’s “I Have a Dream” speech after replacing nouns with most similar word2vec words.

Apart from being funny, this is a good illustration of this phenomenon: words have been replaced with other words in different relationships with them. For example, country instead of nation is not quite the same, but synonymous enough. Kids instead of children is perfectly fine. But a daydream is a type of a dream, week and day are mutually exclusive (they share a mutual category of time unit), Classical.com is completely unrelated to content (yes, statistical methods have errors…), and protagonist is synonymous with the original word character, but in the sense of a character in the book - and not of individual’s qualities.

In the last 3 years there are also multiple methods for learning a different type of word embeddings, that captures -- in addition to or instead of this fuzzy similarity -- semantic relations from taxonomies like WordNet. For example, the Retrofitting method started with distributional (regular) word embeddings and then moved vectors in the space such that two words that appear together in WordNet as synonyms would be close to each other in the vector space. The Attract-Repel method did the same but also made sure that antonyms would be further apart (e.g. think again of the good/bad vectors in sentiment analysis). Other methods include Order Embeddings, Poincaré Embeddings, LEAR, and many more. While these methods are elegant, and have shown to capture the semantic relations they get as input, they have yet to improve the performance of NLP applications upon a version of the system that uses regular embeddings.

The Unsatisfactory Representations Beyond the Word Level

Recurrent neural networks allow us to process texts in arbitrary length: from phrases consisting of several words to sentences, paragraphs, and even full documents. Does this also mean that we can have phrase, sentence, and document embeddings, that will capture the meaning of these texts?

Now is a good time to remember the famous quote from Ray Mooney:

To be honest, I completely agree with this opinion when talking about general-purpose sentence embeddings. While we have good methods to represent sentences for specific tasks and objectives, it’s not clear to me what a “generic” sentence embedding needs to capture and how to learn such a thing.

Many researchers think differently, and sentence embeddings have been pretty common in the last few years. To name a few, the Skip-Thought vectors build upon the assumption that a meaningful sentence representation can help predicting the next sentence. Given that even I as a human can rarely predict the next sentence in a text I’m reading, I think this is a very naive assumption (could you predict that I’ll say that?...). But it probably predicts the next sentence with more accuracy than it would predict a completely unrelated sentence, creating kind of a topical, rather than meaning representation. In the example in figure 11, the model considered lexically-similar sentences (i.e. sentences that share the same words or contain similar words) as more similar to the target sentence than a sentence with a similar meaning but a very different phrasing. I’m not surprised.

Figure 11: The most similar sentences to the target sentence “A man starts his day in India” out of the 3 other sentences that were encoded, using the sent2vec demo.

Another approach is the Autoencoder, which creates a vector representing the input text, and is trained to reproduce the input from that vector. The core assumption is that to be able to predict the original sentence, the representation must capture important aspects of the sentence’s meaning. You can think about it as a type of compression.

Finally, the byproduct of tasks that represent sentences as vectors, like textual entailment models (for classification) or machine translation models (for text generation) - are the sentence embeddings! Yes, they are trained to capture a specific aspect relating to their end task (entailment / translation), but assuming that these tasks require deep understanding of the meaning of a sentence, the embeddings can be used as general representations.

So what aspects of the sentence do these representations capture? This paper did a pretty extensive analysis of various types of sentence embeddings. They defined some interesting properties which may be conveyed in a sentence, starting from shallow things like the sentence length (number of words) and whether a specific word is in the sentence or not; moving on to syntactic properties such as the order of words in the sentences; and finally semantic properties like whether the sentence is topically coherent (in other words, is it possible to distinguish between a “real” sentence and a sentence in which one word was replaced with a completely random word). To find out which of these properties are encoded in which sentence embeddings approach, they used the sentence embeddings as inputs to very simple classifiers, each trained to recognize a single property (e.g. using the vector of “this is a vector” in the sentence length classifier to predict 4). The performance of the various sentence embeddings on all methods was somewhere between the performance of a very simple baseline method (e.g. using random vectors) and the human performance on the task. Not surprisingly, there is more room for improvement on the complex and more semantic tasks.

I’d like at this point to repeat Ray Mooney’s quote and say that you still can’t cram the meaning of a whole sentence into a single vector. It’s impressive that we have gone so far to have representations that capture all these properties, but is this all there is to a sentence’s meaning? Here are some things that I don’t know whether the embeddings capture or not, but mostly assume they don’t:

Do they capture things which are not said explicitly, but can be implied? (I didn’t eat anything since the morning implies I’m hungry).
Do they capture the meaning of the sentence in the context it is said? (No, thanks can mean I don’t want to eat if you’ve just offered me food).
Do they always assume that the meaning of a sentence is literal and compositional (composed of the meanings of the individual words), or do they have good representations for idioms? (I will clean the house when pigs fly means I will never clean the house).
Do they capture pragmatics? (Can you tell me about yourself actually means tell me about yourself. You don’t want your sentence vectors to be like the interviewee in this joke).
Do they capture things which are not said explicitly because the speakers have a common background? (If I tell a local friend that the prime minister must go home, we both know the specific prime minister I’m talking about).

I can go on and on, but I think these are enough examples to show that we have yet to come up with a meaning representation that mimics whatever representation we have in our heads, which is derived by making inferences and basing on common sense and world knowledge. If you need more examples, I recommend taking a look at the slides from Emily Bender’s “100 Things You Always Wanted to Know about Semantics & Pragmatics But Were Afraid to Ask” tutorial.

The Non Existing Robustness

Anyone who’s ever tried to reimplement neural models and reproduce the published results knows we have a problem. More often than not, you wouldn’t get the exact same published results. Sometimes not even close. The reason this happens is often due to differences in the values of “hyper-parameters”: technical settings that have to do with training such as the number of epochs, the regularization values and methods, and others. While they are seemingly not super important, in practice hyper-parameter values can make large performance differences.

The problem starts when training new models. You come up with an elegant model, you implement and train it, but the results are not as expected. This doesn't mean that the architecture or the data is not good; it often means that you need to tweak the hyper-parameter values and re-train the model, yielding completely different and hopefully better performance. Unfortunately, it’s almost impossible to tell in advance which values would yield better performance. There is no best-practice, just a lot of trial and error. We train many different models with various settings, then test their performance on the validation set (a set of examples separate from the training and the test sets) and choose the best performing model (which is then tested on the test set).

Hyper-parameter tuning is an exhausting and often frustrating process. I’m sure that many good models get lost on the way because the researcher lost the patience or ran out of computational resources (yes, neural models also take longer to train… and strong machines cost a lot of money).

It’s pretty discouraging to think that achieving good performance on some test set is sometimes due to arbitrary settings rather than thanks to sound scientific ideas and model design.

The Lack of Interpretability

Last but not least, the interpretability issue is probably the worst caveat of DL.

Machines don’t give an explanation for their predictions. While this was also true for traditional ML, it was easier to analyze the predictions and come up with an explanation in retrospect. Having algorithms which also learn the representations and networks with a lot more parameters makes this much more difficult. To put it into simple words, we generally have no idea what’s happening inside the networks we train, they are “black box” models.

Why is this a problem? First, being able to interpret our models would help us debug our models and easily understand why they are not working and what it is exactly they are learning. It will make the development cycle much shorter and our models more robust and trustable. While it’s nearly impossible today, there are people working to change it.

Second, and more importantly, in some tasks, transparency and accountability is crucial. Specifically, tasks concerned with safety, or which can discriminate against particular groups. Sometimes there is even a legal requirement to provide an explanation. Think of the following examples (not necessarily NLP):

Self-driving cars
Predicting probability of a prisoner to commit another crime (who needs to be released?)
Predicting probability of death for patients (who is a better healthcare “financial investment”)
Deciding who should be approved a loan
more...

In the next post I will elaborate on some of these examples and how ML sometimes discriminates against particular groups. This post is long enough already!

Targeted Content

2018-04-13T13:23:00.000+03:00

You must have heard of, or have suspected first-handedly, the famous conspiracy theory that the Facebook app listens to your phone's microphone in order to better target ads that match your current interests. I've had the funniest experience with that myself: a friend in the cosmetics business has told me about this conspiracy, and in the same conversation she mentioned that an advertising agent has called her to offer advertising her business. Later that day, I got a Facebook ad "advertise your cosmetics business". What the heck? What are the odds of that? And I don't even have a Facebook app installed, just the Facebook messenger.

Although Mark Zuckerberg denied this conspiracy theory in his senate hearing, I doubt that people would stop believing it whenever the ads algorithm surprises them. Choosing to believe Zuckerberg that they don't listen to our microphones (yet, I suspect), I'm pretty confident that they, as well as other companies, are using our written content (emails, social media posts, search queries).

Most people are alarmed by these suspicions from the privacy aspect: what data does this company hold about me? how do they use it? who do they share it with? This post will not be about that. Instead, this post will be about the technical aspect, which is what interests me most as an NLP researcher. If we assume that our apps constantly listen to us and that our written content is monitored and analyzed, what does it say about the text understanding capabilities of these companies?

Oh, and expect no answers. This post is all about questions and conspiracy theories!

What is personalized content?
Personalized content doesn't have to come in the form of an ad. It can take the form of recommendations (products to buy based on previous purchases, songs to listen to, as in this post). It can be relevant professional content from LinkedIn, discounts on services you've previously consumed, cheap flights to your planned destinations, and so on. Some of this will be a direct result of the preferences and settings you defined in the website. For example, I've registered in several websites to get updates on concerts of my favorite bands, and I get healthy vegetarian recipes from Yummly. Some of this content will be based on inferences that the system makes, assuming that certain content is relevant for you. Here is one example:

Lately I've been getting @quora digest emails on topics related to conversations I had with people (in Hebrew!). 1/5
— Vered Shwartz (@VeredShwartz) October 17, 2017

In that case I was amazed by the accuracy of the Quora digest emails I was getting. Specifically, I had a conversation with my husband about the confidence it takes to admit you don't know something, and he mentioned he likes to say something more helpful than "I don't know" to someone who needs help. The next day, I got a personally-tailored Quora digest email that contained an answer to the question "Could you say something nice instead of 'I don't know'?". It wasn't under any of the topics that I follow (computer science related topics and parakeets).

In what follows I will exemplify most of my points using ads.

What we think these algorithms do
OK, so in my case, I have to try to put my knowledge about the limitations of this technology and my skepticism aside for a second and think like the average person. In that case, I think that:

If the ad is about a topic that I discussed in a spoken conversation, then there must be a recorder, and a speech-to-text component that converts the speech into written text.
Which language did I speak or have written when this happened? In case this happened for more than one language, it's possible that the company has different algorithms (or at least different trained models of the same algorithm) for each language.
Written content and transcribed speech are processed to match with the available content/ads.
In some cases, it seems that even simple keyword matching leads to nice results. E.g., if you mentioned a vacation in Thailand you will be matched with ads containing the words vacation and Thailand (I will let you know if I get any such ads after writing this post...). It takes no text understanding capabilities to do so, it only requires recognizing that a bunch of words said in the same sentence (or in a short period of time) also appear in some ad. If you insist, it may work with information retrieval (IR) algorithms to recognize the most important words.
In other cases, it seems that a deeper understanding of the meaning of my queries and conversations is required in order to match it to the relevant content. A good example is the Quora digest example from above. Based on IR algorithms, searching for common words like I, don't, know, helpful, nice, say, something will not get you as far as searching for more rare content words like vacation and Thailand. So it must be that the algorithm has built some meaning representation to our conversation, and compared it with the one of that Quora answer, which was phrased with slightly different words. On top of everything, our conversation was in Hebrew, so it must have a universal multi-lingual meaning representation mechanism.

Alternative explanations
Skepticism returns; I can believe that my speech is recorded and transcribed fairly accurately to text when I speak English. It's a bit harder to believe when it happens in other languages (e.g. Hebrew in my case), but I can still find it somewhat reasonable; Automatic speech recognition (ASR), although isn't perfect, still works reasonably well. It's the text understanding component I'm much, much more skeptical about. Despite the constant progress, and although popular media makes it seem like AI is solved and computers completely understand human language, I know it definitely isn't the case yet. So what other explanations can there be for the targeted content we see?

By Chance. None of this actually happens and we're just imagining. Well, OK, not none of this, but in some cases, it's really just chance.

One of the reasons that we're not easily convinced by this "by chance" argument is that we generally tend to pay attention only to the true-positive cases ("hits") in which we talked about something and immediately got an ad about it. It's much harder to notice the "misses": an ad that seems off (false positive) or all the things that we discussed and got no ads about (false negative).

In the end of the day, we're all just common people that share many common interests. Advertisers may reach us because they try to reach a large audience and we happen to fall under the very broad categories they target (e.g. age group). It could be that by chance we see ads exactly for the product or service we need now.

Other Means. Technically speaking, rather than understanding text, it's much easier to consider other parameters such as your location, your declared interests (i.e. pages you've liked on Facebook, search results you clicked on in Google), your location, your age, gender, marital status, and more. If you didn't provide one or more of these details, no worries! Your friends have, and it's likely you share some of these details with them!

Here is one good example:

I keep getting babies and pregnancy ads on Facebook. I'm a married woman in her 30s, both information items are available in my Facebook profile, and that alone is enough to assume this topic is relevant for me (personally, it is not, but the percent of women like me is too small to care about the error rate, and I totally accept that). Add to this that many of my Facebook friends are other people in my age who are members of parenting groups, have liked pages of baby-related stuff, etc. I can't ever make this stop, but I guess it will stop naturally when I'm in my late forties.

I'd like to finish with an anecdote about how non-sophisticated targeted content can sometimes be, to the point where you rub your eyes in disbelief and say "how stupid can these algorithms be?". A few days ago I've written to someone in an email "I'll be in Seattle on May 30". Minutes later, I get an email from Booking.com with the title "Vered, Seattle has some last-minute deals!". That would have been smart, unless I've already used Booking.com to book a hotel room in Seattle for exactly these dates.

I may be way off and it may be that these companies have killer AI abilities which are kept very well in secret. In that case, some of my readers who work for these companies must be giggling now. To paraphrase Joseph Heller (or whoever said it first), just because you're paranoid, doesn't mean they're not after you, but hey, there's no way their technology is good enough to do what you think it does, so some of it is just pure chance. Not as catchy as the original quote, I know.

Fun with lyrics

2018-01-02T16:03:00.000+02:00

This post stems from a (very boring) casual thought I've had about a year ago: "Hmm... I wonder whether there is more rain in British songs?", which later generalized into "Is there any correlation between song lyrics and the weather in the country of origin of the artists?". I've spent an entire weekend writing code to scrap lyrics from the web, and then life got in the way and I've never finished this (uninteresting) project.

Since I already have a very large corpus of lyrics,¹ I've figured why not combine two of my loves -- text analysis and music -- into one blog post? So in this post I will show you some fun analyses that people commonly do with lyrics.

Word Clouds

Word clouds provide a nice illustration to the frequency of word occurrences. Given a text, the word cloud contains the most common k words in the text, where more frequent words appear larger and in the center of the cloud. In this case, I chose an artist and created a word cloud from the lyrics of all the songs of that artist. I lowercased all the words, removed punctuation, stop words (very common function words like "and" and "the"), and the word "chorus". I used worditout to draw the word clouds. Here are a few examples (click on the links to enlarge):

Left to right: word clouds for the lyrics of Red Hot Chili Peppers, Morrissey, and Eminem.

A few interesting, though expected, observations: Red Hot Chili Peppers often sing about love, Morrissey mostly moan. When he doesn't moan ("Oh"), he sings about serious topics such as war, the world and life. Eminem curses a lot. Funnily, since I kept the words in their inflected form, we get multiple variations of the F word in his word cloud.

Topics
Now that we see which words are common in each artists' lyrics, we can take it a step forward and try to visualize the topics that they sing about. There are many ways to do that, and we'll do it simply by visualizing their word embeddings using t-SNE, a technique for projecting high-dimensional vectors to 2-dimensional space. The underlying assumption of word embeddings is that words with similar meanings or those that belong to the same topics would have similar vectors. This should also reflect in their 2-dimensional visualization.

To give the lyrics some context and demonstrate how they relate to all the possible topics in the world, I took the words from the lyrics and visualized their vectors along with the 2,500 most common words in English, highlighting words from the lyrics in red. Here is the result for Morrissey:

You'd have to scroll through the graph and look for clusters of red dots, then try to figure out what is their common theme. For example, I've found adjectives describing negative feelings (unhappy, sad, tired, weary, ...), words related to love (hearts, lonely, love, hug, kiss), body parts (body, arms, hands, head), and people (young, children, nephew, girl, boy, woman, ...).

And here is the result for Muse:

Here I see positive emotions (love, dream, fate), negative emotions (sorrow, shame, greed, apathy, bitterness), evil stuff (daemons, evil, exorcise, sins) and war-related words (war, struggle, fighting, revolt).

[Some technical details for my technical readers: I took the first 2500 words from this list of 10k most common words in English. For the lyrics, I considered the 500 most common words which are adjectives, nouns or verbs. I drew the t-SNE graph using this script, and used the pre-trained 50d GloVe word embeddings].

Generating New Songs

As the word clouds may suggest, each artist has a specific style which is reflected in the word choice and topic of their songs. We can train a model that captures this specific style, mimics this artist and generates new songs that would look like they've been written by this artist.

Unfortunately, for better-quality results, you need a large amount of training data, so forget about generating new songs of artists who tragically died after releasing only a few records (e.g. 1, 2, 3) or of your favorite indie bands that have relatively few songs (e.g. 1, 2, 3, 4, 5, 6, 7, 8). We'll stick with the more mainstream bands and try to generate new songs by Muse, Weezer, and Red Hot Chili Peppers.

For that purpose, we are going to learn an artist-specific language model. I've written an elaborate post about language models in the context of machine translation; in short, language models estimate the probability of a certain text in the language (e.g. English, or a more specific domain, like Twitter data or Muse lyrics). Each word in the text depends on the previous words, so in an English language model, for instance, the probability of "she doesn't" is larger than that of "she don't" (although, this may not be the case for English rap songs language models!). Language models can be used to compute the probability of an existing text, but they can also be used to generate new texts by sampling words from the distribution. We're going to use them for generation.

As opposed to the language models in my blog post, we will train a neural language model. These are explained very clearly in Andrej Karpathy's blog post "The Unreasonable Effectiveness of Recurrent Neural Networks". In short, a recurrent neural network (RNN) is a model that receives as input a sequence (e.g. of words / characters) and outputs vectors representing each subsequence (i.e. the first item, the first two items, ..., the entire sequence). These vectors can then be used by other machine learning models, e.g. for classification.

In the context of language models, the RNN learns to model the probability distribution of the next item in the sequence (e.g. the next word in the song). During training, the model goes over the entire text corpus (e.g. all the lyrics of a specific artist) and tries to predict the next item (word). If the predicted next item is incorrect, i.e. different from the actual next item, the model adjusts itself, until it is accurate enough. At test time, once the model parameters are settled, you can use it to generate new texts by sampling from the distribution of possible items (words) and constantly sampling new words conditioned on the already-sampled ones. The result should look similar to the original text corpus it was trained on. Very often, generated sequences will be actual texts from the corpus (and then you've just trained a parrot... Thanks Don Patrick for the great metaphor, I'm constantly quoting you on this!).

[Some technical details for my technical readers: I trained a word-level LSTM using DyNet, largely based on the char-level RNN example. My code is available here.]

So, let's take a look at the results! After training each model, I sampled a single song. I sampled each sentence separately, so subsequent sentences are not expected to be related to each other. I enforced the song structure by forcing a line break after every 5 lines. Here is the new "Weezer" song:

let me see the joy
holding on to what they give,
turn it, turn it,
i'd bury diamonds
woo-hoo-hoah
you're just smile

excuse my manners if i make a scene
we're just visiting
i'm still afloat
and i'm lost without your love
why are all american girls so rough?

i'm a robot
and kick you out of sight
and if you're up all night
i cried for you, you were the blast
tonight...

i'm just meant to be your latest tragedy
why are all american girls so rough?
you are.
how man is this in the world
i feel safe

oo-ee-oo i look just like buddy holly
wish that they would teach me when our critics
i don't want your love
chicks are trying to freak

Some sentences are clearly copied from existing songs ("oo-ee-oo I look just like Buddy Holly") but others are brand new. Overall it feels like a Weezer song to me!

Moving on to the new Muse song:

than you could ever give
and i want you now
i wish i could
and make a fuss
like the evil in your veins
you are

(your time is now)
our hopes and expectations
we don't belong here
i won't let you bury it
i wish i could

they will pull us down
in your world
now i just was to name
with who knows who
i'm growing tired of fighting

in my sleep
policies
loneliness be over
vast human and material resources
you're unsustainable

is it enough
killed by drones
and our time is running out
you and i must fight to survive

This one is a bit disappointing, because the only reason it feels like a real Muse song is that it's a "summary of Muse songs" created by copying whole sentences from their songs. My intuition is that the amount of training data was too small, leading to "overfitting" (the training data is regenerated perfectly). This calls for an action by Muse to release more albums!

And the highlight is this new Red Hot Chili Peppers song:

when i find my peace of mind,
that i could find the fireflies
[m1]
someone
to close a right today
that i slept

you say the is least my love
start jumping and that sherri meet?
funky crime funky crime
just a mirror for the sun
i wrote a letter to you

[chorus:]
i've been here before
stuck in the muck of the pond
to be afraid
play your hand and glory

well, i'm gonna ride a sabertooth horse
let's play
mother angel in your hand
take a star in a telegram
upon the places beyond

today loves smile for me
part of my scenery
i'll play all night
i am not wide

Wow... this looks nothing like every Red Hot Chili Peppers song ever. It doesn't even contain the word California! Maybe I should've trained the model for a few more iterations. It is pretty cool, though, that most sentences are new, and they make sense at least like the actual lyrics by RHCP make sense.

Statistics
Now that we've got the data, we can finally answer the sleep-depriving question: "is there a correlation between the occurrence of rain-related words in lyrics and the country of origin of the artist?". For the lyrics that I scraped from the web I've also the kept the artists' countries. For these countries I've also looked up the annual precipitation statistics. I then looked for the occurrence of either of the following words in lyrics: rain, rainning, rained, rains, storm, stormy, cloud, cloudy, drizzle, flood. I computed the percentage of "rain" songs per country (out of all of the songs by artists in this country). The hypothesis was that artists from countries with a high average of annual precipitation are more likely to sing about it.

The percentage of songs mentioning rain in each country compared with average annual precipitation.

I was wrong. There was no correlation. It is also possible that this was a failed experiment because the number of songs for some countries was too small to draw any meaningful statistical conclusions.

Can we answer more interesting questions regarding lyrics? For example, this every Red Hot Chili Peppers song ever claims that all they ever sing about is California, but this wasn't reflected in the word cloud, nor in the generated song, meaning that this specific word was not very frequent in the corpus. However, if we only check which US states were mentioned in the songs, would California be more frequent?² And if we make this question more general, do artists tend to sing more about the countries of origin, and do some places get more attention regardless of where the artists are originally from?

This time I focused on American artists, and took the lyrics of the first 200 artists from each state, checking for mentions of any states. I created a 51x51 table in which the columns represent the mentions and the rows represent the artists' state of origin. Rather than displaying this messy table, I plotted a heatmap where the lighter colors represent higher values (and 0 values are colored black).³

Mention of states in lyrics by artists' state of origin. Columns: states mentioned in lyrics. Rows: states of origin.

Here's how to interpret this heatmap: light values on the diagonal are pretty common, meaning that it's common for artists to sing about their states of origin. Two columns have light values across many rows: California and New York. Those are states which are common in lyrics, regardless of the artist origin.

Notice that the states are sorted alphabetically, so it's difficult to answer the question whether artists tend to sing about states in their proximity. A better visualization would be if we could place these statistics on a map. We can, and I used the Google Maps API to do so! Click on a state from the list and you'll see the states that sing about it visualized on a map.

I think I can see a pattern of states singing about their neighbors (this kind of visualization was helpful for someone like me who doesn't know much about US geography...).

Sentiment Analysis

Many words in Morrissey's word cloud are notably negative: kill, hate, die, leave, gone, etc. This is of no surprise as anyone who's been listening to Morrissey or to the Smiths knows most of their songs are gloomy; according to this study, one of the gloomiest among UK artists.

This negativity can be "proved" computationally, using software for sentiment analysis. Sentiment analysis takes a text and determines its sentiment: either negative/positive, or a range of sentiments. Traditional models used to look at the words that appear in the text independently and score the sentence according to the individual words' sentiment, recognizing "good" and "bad" words. For example, "I am happy today" would be considered positive thanks to the positivity of the word happy (and the neutrality of the other words). Today's models are mostly based on neural networks, and sometimes they also take into account the structure of the sentence (which should be helpful in recognizing that "I am not happy today" is negative). The Stanford Sentiment Analysis system is an example for such a model.

I was planning to compute the sentiment of all the lyrics of Morrissey vs. all the lyrics of a presumably more cheerful artist (e.g. Queen, David Bowie), but I've found that most analyzers I've tried to use did pretty bad on recognizing the sentiment of lyrics. To be fair, they are usually trained on movie/restaurant reviews, and lyrics are often more sophisticated (As a proof: we've had a human disagreement on the sentiment of several Morrissey lines at home...). Here are some examples from the Stanford Sentiment Analysis demo:

A positive sentence from David Bowie. Sounds fun.

A negative sentence from Muse. A bit less fun.

Finally, this last example is a subtle insult (at least in my interpretation) from Morrissey: "you were good in your time", interpreted simply as a positive saying by the model. This was a difficult one!

1 In this post I use the lyrics I downloaded (315,357 songs) along with two lyrics corpora from Kaggle: from Sergey Kuznetsov (57,650 songs) and from Gyanendra Mishra (380,000 songs). I was planning to share the code for scraping the lyrics from the web, but when I finally started writing this post, I've found out that the website I've been using has changed and scraping lyrics with my code no longer works.  ^↩

2 It is very, very frequent in general, so the prior probability of the occurrence of California in songs is high, not just the conditional probability given that it's a RHCP song. I never realized how common it is until I came back from California last summer and tried to fill the void by creating and constantly listening to this America playlist (biased towards songs about California).  ^↩

3 One note about the statistics in this post: they are inaccurate. Some states have just a few artists, the number of mentions is counted equally if they are one from song or many songs, I didn't normalize the statistics by the size of each state, I didn't check for mentions of cities, etc.  ^↩

Ambiguity

2017-10-30T21:11:00.000+02:00

One of the problems with teaching computers to understand natural language, is that much of the meaning in what people say is actually hidden in what they don't say. As humans, we trivially interpret the meaning of ambiguous words, written or spoken, according to their context. For example, this blog post is published in a blog that largely discusses natural language processing, so if I write "NLP", you'd know I refer to natural language processing rather than to neuro-linguistic programming. If I told you that the blog post doesn't fit into a tweet because it's too long, you'd know that the blog post is too long and not that the tweet is too long. You would infer that even without having any knowledge about Twitter's character limit, because it just doesn't make sense otherwise. Unfortunately, common-sense and world knowledge that come so easily for us are not trivial to teach to machines. In this post, I will present a few cases in which ambiguity is a challenge in NLP, along with common ways in which we try to overcome it.

Polysemous words, providing material for dad jokes since... ever.

Lexical Ambiguity
Lexical ambiguity can occur when a word is polysemous, i.e. has more than one meaning, and the sentence in which it is contained can be interpreted differently depending on its correct sense.

For example, the word bank has two meanings - either a financial institute or the land alongside the river. When we read a sentence with the word bank, we understand which sense of bank the text refers to according to the context:

(1) Police seek person who robbed bank in downtown Reading.
(2) The faster-moving surface water travels along the concave bank.

In these example sentences, "robbed" indicates the first sense while "water" and "concave" indicate the second.

Existing Solutions for Lexical Ambiguity
Word embeddings are great, but they conflate all the different senses of a word into one vector. Since word embeddings are learned from the occurrences of a word in a text corpus, the word embedding for bank is learned from its occurrences in both senses, and will be affected from neighbors related to the first sense (money, ATM, union) and of the second (river, west, water, etc.). The resulting vector is very likely to tend towards the more common sense of bank, as can be seen in this demo: see how all the nearest words to bank are related to its financial sense.

Word Sense Disambiguation (WSD) is an NLP task aimed at disambiguating a word in context. Given a list of potential word senses for each word, the correct sense of the word in the given context is determined. Similar to the way humans disambiguate words, WSD systems also rely on the surrounding context. A simple way to do so, in a machine-learning based solution (i.e. learning from examples), is to represent a word-in-context as the average of its context word vectors ("bag-of-words"). In the example above, we get for the first occurrence of bank: feature_vector(bank) = 1/8( (vector(police) + vector(seek) + vector(person) + vector(who) + vector(robbed) + vector(in) + vector(downtown) + vector(reading)), and for the second: feature_vector(bank) = 1/9(vector(the) + vector(faster) + vector(moving) + vector(surface) + vector(water) + vector(travels) + vector(along) + vector(the) + vector(concave)).

Can Google expand the acronym "ACL" correctly for me?

Acronyms
While many words in English are polysemous, things turn absolutely chaotic with acronyms. Acronyms are highly polysemous, some having dozens of different expansions. To make things even more complicated, as opposed to regular words, whose various senses are recorded in dictionaries and taxonomies like WordNet, acronyms are often domain-specific and not commonly known.

Take for example a Google search for "ACL 2017". I get results both for the Annual Meeting of the Association for Computational Linguistics (which is what I was searching for) and for the Austin City Limits festival. I have no idea whether this happens because (a) these are the two most relevant/popular expansions of "ACL" lately or the only ones that go with "2017"; or (b) Google successfully disambiguated my query, showing the NLP conference first, and leaving also the musical festival ranked lower in the search results, since it knows I also like music festivals. Probably (a) :)

Existing Solutions for Acronym Expansion
Expanding acronyms is considered a different task from WSD, in which there is no inventory of potential expansions for each acronym. Given enough context (e.g. "2017" is a context word for the acronym ACL), it is possible to find texts that contain the expansion. This can either be by searching for a pattern (e.g. "Association for Computational Linguistics (ACL)") or considering all the word sequences that start with these initials, and deciding on the correct one using rules or a machine-learning based solution.

Syntactic Ambiguity

No beginner NLP class is complete without at least one of the following example sentences:

They ate pizza with anchovies
I shot an elephant wearing my pajamas
Time flies like an arrow

Common to all these examples is that each can be interpreted as multiple different meanings, where the different meanings differ in the underlying syntax of the sentence. Let's go over the examples.

The first sentence "They ate pizza with anchovies", can be interpreted as (i) "they ate pizza and the pizza had anchovies on it", which is the more likely interpretation, illustrated on the left side of the image below. This sentence has at least two more crazy interpretations: (ii) they ate pizza using anchovies (instead of using utensils, or eating with their hands), as in the right side of the image below, and (iii) they ate pizza and their anchovy friends ate pizza with them.

Visual illustration of the interpretations of the sentence "They ate pizza with anchovies".
Image taken from https://explosion.ai/blog/syntaxnet-in-context.

The first interpretation considers "with anchovies" as describing the pizza, while the other two consider it as describing the eating action. In the output of a syntactic parser, the interpretations will differ by the tree structure, as illustrated below.

Possible syntactic trees for the sentence "They ate pizza with anchovies", using displacy.

Although this is a classic example, both the Spacy and the Stanford Core NLP demos got it wrong. The difficulty is that syntactically speaking, both trees are likely. Humans know to prefer the first one based on the semantics of the words, and using their knowledge that anchovy is something that you eat rather than eat with. Machines don't come with this knowledge.

A similar parser decision is crucial in the second sentence, and just in case you haven't managed to find the funny interpretations yet: "I shot an elephant wearing my pajamas" has two ambiguities: first, does shoot mean taking a photo of, or pointing a gun to? (a lexical ambiguity). But more importantly, who's wearing the pajamas? Depending on whether wearing is attached to shot (meaning that I wore the pajamas while shooting) or to elephant (meaning that the elephant miraculously managed to squeeze into my pajamas). This entire scene, regardless of the interpretation, is very unlikely, and please don't kill elephants, even if they're stretching your pajamas.

The third sentence is just plain weird, but it also has multiple interpretations, of which you can read about here.

Existing Solutions for Syntactic Ambiguity
In the past, parsers were based on deterministic grammar rules (e.g. a noun and a modifier create a noun-phrase) rather than on machine learning. A possible solution to the ambiguity issue was to add different rules for different words. For more details, you can read my answer to Natural Language Processing: What does it mean to lexicalize PCFGs? on Quora.

Today, similarly to other NLP tasks, parsers are mostly based on neural networks. In addition to other information, the word embeddings of the words in the sentence are used for deciding on the correct output. So potentially, such a parser may learn that "eat * with [y]" yields the output in the left of the image if y is edible (similar to word embeddings of other edible things), otherwise the right one.

Coreference Ambiguity
Very often a text mentions an entity (someone/something), and then refers to it again, possibly in a different sentence, using another word. Take these two paragraphs from a news article as an example:

From https://www.theguardian.com/sport/2017/sep/22/donald-trump-nfl-national-anthem-protests.
The various entities participating in the article were marked in different colors.

I marked various entities that participate in the article in different colors. I grouped together different mentions of the same entities, including pronouns ("he" as referring to "that son of a bitch"; excuse my language, I'm just quoting Trump) and different descriptions ("Donald Trump", "the president"). To do that, I had to use my common sense (the he must refer to that son of a bitch who disrespected the flag, definitely not to the president or the NFL owners, right?) and my world knowledge (Trump is the president). Again, any task that requires world knowledge and reasoning is difficult for machines.

Existing Solutions for Coreference Resolution
Coreference resolution systems group mentions that refer to the same entity in the text. They go over each mention (e.g. the president), and either link it to an existing group containing previous mentions of the same entity ([Donald Trump, the president]), or start a new entity cluster ([the president]). Systems differ from each other, but in general, given a pair of mentions (e.g. Donald Trump, the president), they extract features referring either to each single mention (e.g. part-of-speech, word vector) or to the pair (e.g. gender/number agreement, etc.), and decide whether these mentions refer to the same entity.

Note that mentions can be proper-names (Donald Trump), common nouns (the president) and pronouns (he); identifying coreference between pairs of mentions from each type requires different abilities and knowledge. For example, proper-name + common noun may require world knowledge (Donald Trump is the president), while pairs of common nouns can sometimes be solved with semantic similarity (e.g. synonyms like owner and holder). Pronouns can sometimes be matched to their antecedent (original mention) based on proximity and linguistic cues such as gender and number agreement, but very often there is more than one possible option for matching.

A nice example of solving coreference ambiguity is the Winograd Schema challenge, of which I've first heard from this post in the Artificial Detective blog. In this contest, computer programs are given a sentence with two nouns and an ambiguous pronoun, and they need to answer which noun the pronoun refers to, as in the following example:

The trophy would not fit in the brown suitcase because it was too big. What was too big?
Answer 0: the trophy
Answer 1: the suitcase

Answering such questions requires, yes, you guessed correctly - commonsense and world knowledge. In the given example, the computer must reason that for the first object to fit into the second, the first object must be smaller than the second, so if the trophy could not fit into the suitcase, the trophy must be too big. Conversely, if instead of big, the question would have read small, the answer would have been "the suitcase".

Noun Compounds

Words are usually considered as the basic unit of a language, and many NLP applications use word embeddings to represent the words in the text. Word embeddings do a pretty decent job in capturing the semantics of a single word, and sometimes also its syntactic and morphological properties. The problem starts when we want to capture the semantics of a multi-word expression (or a sentence, or a document). The embedding of a word, for example dog, is learned from its occurrences in a large text corpus; the more common a word is, the more occurrences there are, and the higher the quality of the learned word embedding would be (it would be located "correctly" in the vector space near things that are similar to dog). A bigram like hot dog is already much less frequent, even less frequent is hot dog bun, and so on. The conclusion is clear - we can't learn embeddings for multi-word expressions the same way we do for single words.

The alternative is to try to somehow combine the word embeddings of the single words in the expression into a meaningful representation. Although there are many approaches for this task, there is no one-size-fits-all solution for this problem; a multi-word expression is not simply the sum of its single word meanings (hot dog is an extreme counter-example!).

One example out of many would be noun-compounds. A noun-compound is a noun that is made up of two or more words, which usually consists of the head (main) noun and its modifiers, e.g. video conference, pumpkin spice latte, and paper clip. The use of noun-compounds in English is very common, but most noun-compounds don't appear frequently in text corpora. As humans, we can usually interpret the meaning of a new noun-compound if we know the words it is composed of; for example, even though I've never heard of watermelon soup, I can easily infer that it is a soup made of watermelon.

Similarly, if I want my software to have a nice vector representation of watermelon soup, there is no way I can base it on the corpus occurrences of watermelon soup -- it would be too rare. However, I used my commonsense to build a representation of watermelon soup in my head -- how would my software know that there is a made of relation between watermelon and soup? This relation can be one out of many, for example: video conference (means), paper clip (purpose), etc. Note that the relation is implicit, so there is no immediate way for the machine to know what's the correct relation between the head and the modifier.¹To complicate things a bit further, many noun-compounds are non-compositional, i.e. the meaning of the compound is not a straightforward combination of the meaning of its words, as in hot dog, baby sitting, and banana hammock.

Existing Solutions for Noun-compound Interpretation
Automatic methods for interpreting the relation between the head and the modifier of noun-compounds have largely been divided into two approaches:

(1) machine-learning methods, i.e. hand-labeling a bunch of noun-compounds to a set of pre-defined relations (e.g. part of, made of, means, purpose...), and learning to predict the relation for unseen noun-compounds. The features are either related to each single word (head/modifier), such as their word vectors or lexical properties from WordNet, or to the noun-compound itself and its corpus occurrences. Some methods also try to learn a vector representation for a noun-compound in the form of applying a function to the word embeddings of its single words (e.g. vector(olive oil) = function(vector(olive), vector(oil))).

(2) finding joint occurrences of the nouns in a text corpus, some of which would explicitly describe the relation between the head and the modifier. For example "oil made of olives".

While there has been a lot of work in this area, success on this task is still mediocre. A recent paper suggested that current methods succeed mostly due to predicting the relation based solely on the head or on the modifier - for example, most noun-compounds with the head "oil" hold the made of relation (olive oil, coconut oil, avocado oil, ...). While this guess can be pretty accurate most of the times, it may cause funny mistakes as in the meme below.

From http://www.quickmeme.com/meme/3r9thy.

For the sake of simplicity, I focused on two-word noun-compounds, but noun-compounds with more than two words have an additional ambiguity - a syntactic ambiguity - what are the head-modifier relations in the compound? It is often referred to as bracketing. Without getting into too many details, consider the example of hot dog bun from before. It should be interpreted as [[hot dog][bun]] rather than [hot [dog bun]].

More to read?
Yeah, I know it was a long post, but there is so much more ambiguity in language that I haven't discussed. Here is another selected topic, in case you're looking for more to read. We all speak a second language called emoji, which is full of ambiguity. Here are some interesting articles about it: Emoji could cause confusion, trouble in the workplace, The real meaning of all those emoji in Twitter handles, Learning the language of emoji, and Why emojis may be the best thing to happen to language in the digital age. For the older people among us (and in the context of emoji, I consider myself old too, so no offence anyone), if you're not sure about the meaning of an emoji, why don't you check emojipedia first, just to make sure you're not accidentally using phallic symbols in your grocery list?

1 In this very interesting paper by Preslav Nakov there is a nice observation: a noun-compound is a "compression device" that allows saying more with less words. ^↩

Paraphrasing

2017-08-09T19:24:00.000+03:00

One of the things that make natural language processing so difficult is language variability: there are multiple ways to express the same idea/meaning. I mentioned it several times in this blog, since it is a true challenge for any application that aims to interact with humans. You may program it to understand common things or questions that a human may have, but if the human decides to deviate from the script and phrase it slightly differently, the program is helpless. If you want a good example, take your favorite personal assistant (Google assistant, Siri, Alexa, etc.) and ask it a question you know it can answer, but this time use a different phrase. Here is mine:

Both questions I asked have roughly the same meaning, yet, Google answers the first perfectly but fails to answer the second, backing off to showing search results. In fact, I just gave you a "free" example of another difficult problem in NLP which is ambiguity. It seems that Google interpreted showers as "meteor showers" rather than as a light rain.

One way to deal with the language variability difficulty is to construct a huge dictionary that contains groups or pairs of texts with roughly the same meaning: paraphrases. Then, applications like the assistant can, given a new question, look up the dictionary for any question they were programmed to answer which has the same meaning. Of course, this is a naive idea, given that language is infinite and one can always form a new sentence that has never been said before. But it's a good start, and it may help developing algorithms that can associate a new unseen text to an existing dictionary entry (i.e. generalizing).

Several approaches have been used to construct such dictionaries, and in this post I will present some of the simple-but-smart approaches.

Translation-based paraphrasing

The idea behind this approach is super clever and simple: suppose we are interested in collecting paraphrases in English. If two English texts are translated to the same text in a foreign language, then they are likely paraphrases of each other. Here is an example:

The English texts on the left are translated into the same Italian text on the right, implying that they have the same meaning.

This approach goes as far as 2001. The most prominent resource constructed with this approach is the paraphrase database (PPDB). It is a resource containing hundreds of millions of text pairs with roughly the same meanings. Using the online demo, I looked up for paraphrases of "nice to meet you", yielding a bunch of friendly variants that may be of use for conference small talks:

it was nice meeting you

it was nice talking to you

nice to see you

hey, you guys

it's nice to meet you

very nice to meet you

nice to see you

i'm pleased to meet you

it's nice to meet you

how are you

i'm delighted

it's been a pleasure

Paraphrases of "nice to meet you", from PPDB.

In practice, all these texts appear as paraphrases of "nice to meet you" in the resource, with different scores (to what extent is this text a paraphrase of "nice to meet you"?). These texts were found to be translated to the same text in a single or in multiple foreign languages, and their scores correspond to the translation scores (as explained here), along with other heuristics.²

While this approach provides a ton of very useful paraphrases, as you can guess, it also introduces errors, as in every automatic method. One type of an error occurs when the foreign word has more than one sense, each translating into a different, unrelated English word. For example, the Spanish word estacion has two meanings: station and season. When given a Spanish sentence that contains this word, it is translated (hopefully) to the correct English word according to the context. This paraphrase approach, however, does not look at the original sentences in which these words occur, but only at the phrase table -- a huge table of English phrases and their Spanish translations without their original contexts. It has therefore no way at this point to tell that stop and station refer to the same meaning of estacion, and are therefore paraphrases, while season and station are translations of two different senses of estacion.

Even without making such a horrible mistake of considering two texts as paraphrases when they are not related at all, paraphrasing is not well-defined, and the paraphrase relation encompasses many different relations. For example, looking for paraphrases of the word tired in PPDB, you will get equivalent phrases like fatigued, more specific phrases like overtired/exhausted, and related but not-quite-the-same phrases like bored. This may occur when the translator likes being creative and does not remain completely faithful to the original sentence, but also when the target language does not contain an exact translation for a word, defaulting in a slightly more specific or more general word. While this is not a specific phenomenon of this approach but rather of all the paraphrasing approaches (for different reasons), this has been studied by the PPDB people who did an interesting analysis of the different semantic relations the resource captures.

The following approaches focus on paraphrasing predicates. A predicate is a text describing an action or a relation between one or more entities/arguments, very often containing a verb. For example: John ate an apple or Amazon acquired Whole Foods. Predicate paraphrases are pairs of predicate templates -- i.e. predicates whose arguments were replaced by placeholders -- that would have roughly the same meaning given an assignment to their arguments. For example, [a]₀ acquired [a]₁ and [a]₀ bought [a]₁are paraphrases given the assignment [a]₀= Amazon and [a]₁= Whole Foods.¹Most approaches focus on binary predicates (predicates with two arguments).

Argument-distribution paraphrasing

This approach relies on a simple assumption: if two predicates have the same meaning, they should normally appear with the same arguments. Here is an example:

In this example, the [a]₀ slots in both predicates are expected to contain names of companies that acquired other companies while the [a]₁ slot is expected to contain acquired companies.

The DIRT method represents each predicate as two vectors: (1) the distribution of words that appeared in its [a]₀ argument slot, and (2) the distribution of words that appeared in its [a]₁ argument slot. For example, the [a]₀ vectors of the predicates in the example will have positive/high values for names of people and names of companies that acquired other companies, and low values for other (small) companies and other unrelated words (cat, cookie, ...). To measure the similarity between two predicates, the two vector pairs ([a]₀ in each predicate and [a]₁ in each predicate) are compared using vector similarity measures (i.e. cosine similarity), and a final score averages the per-slot similarities.

Now, while it is true that predicates with the same meaning often share arguments, it is definitely not true that predicates that share a fair amount of their argument instantiations are always paraphrases. A simple counterexample would be of predicates with opposite meanings, that often tend to appear with similar arguments: for instance, "[stock] rise to [30]" and "[stock] fall to [30]" or "[a]₀ acquired [a]₁" and "[a]₀ sold [a]₁" with any [a]₀that once bought an [a]₁and then sold it.

Following this approach, other methods were suggested, such as capturing a directional inference relation between predicates (e.g. [a]₀ shot [a]₁ => [a]₀ killed [a]₁ but not vice versa), releasing a huge resource of such predicate pairs (see the paper); and a method to predict whether one predicate entails the other, given a specific context (see the paper).

Event-based paraphrases

Another good source for paraphrases is multiple descriptions of the same news event, as various news reporters are likely to choose different words to describe the same event. To automatically group news headlines discussing the same story, it is common to group them according to the publication date and word overlap. Here is an example of some headlines describing the acquisition of Whole Foods by Amazon:

We can stop here and say that all these headlines are sentential paraphrases. However, going a step further, if we've already observed in the past Google to acquire YouTube / Google is buying YouTube as sentential paraphrases (and many other similar paraphrases), we can generalize and say that [a]₀ to acquire [a]₁ and [a]₀ is buying [a]₁are predicate paraphrases.

Early works relying on this approach are 1, 2, followed by some more complex methods like 3. We recently harvested such paraphrases from Twitter, assuming that tweets with links to news web sites that were published on the same day are likely to describe the same news events. If you're interested in more details, here are the paper, the poster and the resource.

This approach is potentially more accurate than the argument-distribution approach. The latter assumes that predicates that often occur with the same arguments are paraphrases, while the former considers predicates with the same argument as paraphrases only if it believes that they discuss the same event.

What does the future hold? neural paraphrasing methods, of course. I won't go into technical details (I feel that there are enough "neural network for dummies" blog posts out there, and I'm by no means an expert on that topic). The idea is to build a model that reads a sequence of words and then generates a different sequence of words that has the same meaning. If it sounds like inexplicable magic, it is mostly because even the researchers working on this task can at most make educated guesses on why something works well or not. In any case, if this ever ends up working well, it will be much better than the resources we have today, since it will be capable of providing paraphrases / judging correctness of paraphrases for new texts that were never observed before.

1 Of course, given a different choice of arguments, these predicates will not be considered as paraphrases. For example, Mary acquired a skill is not a paraphrase of Mary bought a skill. The discussed approaches consider predicate-pairs as paraphrases, if there exists an argument assignment (/context) under which these predicates are paraphrases. ^↩
2 See also more recent work on translation-based paraphrasing. ^↩

Women in STEM*

2017-03-01T16:58:00.000+02:00

This is a special post towards International Women's Day (March 8th). Every year I find myself enthusiastically conveying my thoughts about the topic to the people around me, so I thought I might as well share it with a broader audience. As always, this post presents my very limited knowledge/interpretation to a broadly discussed and studied topic. However, it may be a bit off topic for this blog, so if you're only interested in computational stuff, you can focus on section 3.

1. The Problem
Even though we are half of the population, women are quite poorly represented in STEM:

USA: the percentage of computing occupations held by women has been declining since 1991, when it reached a high of 36%. The current rate is 25%. [2016, here]

OECD member countries: While women account for more than half of university graduates in scientific fields in several OECD countries, they account for only 25% to 35% of researchers in most OECD countries. [2006, here]

2. The Causes (and possible solutions)

2.1 Cognitive Differences
There is a common conception that female abilities in math are biologically inferior to those of males. Many highly cited psychology papers prove differently, for example:

"Stereotypes that girls and women lack mathematical ability persist, despite mounting evidence of gender similarities in math achievement." [1].

"...provides evidence that mathematical and scientific reasoning develop from a set of biologically based cognitive capacities that males and females share. These capacities lead men and women to develop equal talent for mathematics and science." [2]

From https://imgs.xkcd.com/comics/how_it_works.png.

In addition, if cognitive differences were so prominent, there wouldn't be so many women graduating in scientific fields. It seems that the problem lies in occupational gender segregation, which may be explained by any one of the following:

2.2 Family Life
Here are some references from studies conducted about occupational gender segregation:

"In some math-intensive fields, women with children are penalized in promotion rates." [3]

"[...] despite the women's movement and more efforts in society to open occupational doors to traditional male-jobs for women, concerns about balancing career and family, together with lower value for science-related domains, continue to steer young women away from occupations in traditionally male-dominated fields, where their abilities and ambitions may lie." [4]

"women may “prefer” those [jobs] with flexible hours in order to allow time for childcare, and may also “prefer” occupations which are relatively easy to interrupt for a period of time to bear or rear children." [5] (the quotation marks are later explained, indicating that this is not a personal preference but rather influenced by learned cultural and social values).

I'd like to focus the discussion now on my local point view of the situation in Israel, since I suspect that it is the most prominent cause of the problem here. I would be very interested in getting comments regarding what it is like in other countries.

From http://explosm.net/comics/2861/

According to the Central Bureau of Statistics, in 2014, 48.9% of the workers in Israel were women (and 51.1% were men). The average salary was 7,439 NIS for women and 11,114 for men. Wait, what?... let me introduce another (crucial) factor.

While the fertility rate has decreased in all other OECD member countries, in Israel it remained stable for the last decade, with an average of 3.7 children per family. On a personal note, as a married woman without children, I can tell you that it is definitely an issue, and "when are you planning to have children already?" is considered a perfectly valid question here, even from strangers (and my friends with 1 or 2 children often get "when do you plan to have the 2nd/3rd child?").

Paid maternity leave is 14 weeks with a possibility (used by anyone who can afford it) to extend it to 3 more unpaid months. Officially, any one of the parents can take maternity leave, but in practice, since this law was introduced in 1998, only roughly 0.4% of the parents who took maternity leave were fathers.

Here is the number connecting the dots, and explaining the salary gap: in 2014, the average number of work hours per week was 45.2 for men and 36.7 for women. The culture in Israel is torn between the traditional family roles (mother as a main parent) and the modern opportunities for women. Most women I know have a career in the morning, and a second job in the afternoon with the kids. With a hard constraint of leaving work before 16:00 to pick up the kids, in a demanding market like in Israel, it is much harder for a woman to get promoted. It poses the high-tech industry, in which the working hours are known to be long, as a male-dominated environment. Indeed, in 2015, only 36.2% of the high-tech workers in Israel were women.

This situation is doubly troubling: on the one hand, it is difficult for women who do choose demanding careers. They have to juggle between home and work in a way that men are never required to. On the other hand, we are oriented since childhood to feminine occupations that are less demanding in working hours.

Don't get me wrong, I'm not here to judge. Being a feminist doesn't entail that the woman must have a career while the man has to stay at home with the children. Each couple can decide on their division of labor as they wish. It's the social expectations and cultural bias that I'm against. I've seen this happening time after time: the man and the woman both study and build up their careers, they live in equality, and then the birth of their first child, and specifically maternity leave, is the slippery slope after which equality is a fantasy.

To make a long story short, I think it is not women the market is against, but mothers. When I say "against" I include allegedly good ideas such as allowing a mother to leave work at 16:00. While I'm not against leaving work at 16:00 (modern slavery is a topic for another discussion...), I don't see why this "privilege" should be reserved only for mothers. In my humble opinion, it will benefit mothers, fathers, children and the market if men and women could each get 3 days a week to leave work as "early" as at 16:00. It wouldn't hurt if both men and women will have the right to take parental leave together, developing their parenthood as a shared job. This situation will never change unless the market will overcome ancient society rules and stop treating parenthood as a job for women.

2.3 Male-dominated Working Environments
Following the previous, tech workplaces (everywhere) are dominated by men, so that even women who choose to work in this industry might feel uncomfortable in their workplaces. Luckily for me I can't attest this by my own experience: I've never been treated differently as a woman, and have never felt threatened or uncomfortable in situations in which I was an only woman. This article exemplifies some of the things that other women experienced:

"Many [women] will say that their voice is not heard, they are interrupted or ignored in meetings; that much work takes place on the golf course, at football matches and other male-dominated events; that progress is not based on merit and women have to do better than men to succeed, and that questions are raised in selection processes about whether a woman “is tough enough”."

From http://www.phdcomics.com/comics/archive.php?comicid=490

I've only become aware of these problems recently, so I guess it is both a good sign (that it might not be too common, or at least that not all women experience that), but also a bad sign (that many women still suffer from it and there's not enough awareness). This interesting essay written by Margaret Mitchell suggests some practical steps to make women feel more comfortable in their workplaces.

Of course, things get much worse when you consider sexual harassment in workplaces. I know the awareness to the subject is very high today, an employer's duty to prevent sexual harassment is statutory in many countries, and many big companies require new employees to undergo a sexual harassment prevention training. While this surely mitigates the problem, it is still too common, with a disturbing story just from the last week (and many other stories untold). As with every other law, there will always be people breaking it, but it is the employers' duty to investigate any reported case and handle it even at the cost of losing a valuable worker.

2.4 Gender Stereotypes
Simply because it's so difficult to change reality; even if some of the reasons why women were previously less likely to work in these industries are no longer relevant, girls will still be less oriented to working in these fields since they are considered unsuitable for them.

From https://www.facebook.com/DoodleTimeSarah/

An interesting illustration was provided in this work, where 26 girls (around 4 years old) were shown different Barbie dolls and asked whether they believed women could do masculine jobs. When the Barbie dolls were dressed in "regular" outfits, many of them replied negatively, but after being showed a Barbie dressed up in a masculine outfit (firefighter, astronaut, etc.), the girls believed that they too could do non-stereotypical jobs.

This is the vicious circle that people are trying to break by encouraging young girls to study scientific subjects and supporting woman already working in these fields. Specifically, by organizing women-only conferences, offering scholarships for women, and making sure that there is a female representative in any professional group (e.g. panel, committee, etc). While I understand the rational behind changing the gender distribution, I often feel uncomfortable with these solutions. I'll give an example.

Let's say I submitted a paper to the main conference in my field, and that paper was rejected. Then somebody tells me "there's a women-only workshop, why don't you submit your paper there?". If I submit my paper there and it gets accepted, how can I overcome the feeling of "my paper wasn't good enough for a men's conference, but for a woman's paper it was sufficient"?

For the same reason, I'm uncomfortable with affirmative action. If I'm a woman applying for a job somewhere and I find out that they prefer women, I might assume that there was a man who was more talented/adequate than me but they settled for me because I was a woman. If that's true, it is also unfair for that man. In general, I want my work to be judged solely based on its quality, preferably without taking gender into consideration, for better and for worse.

I know I'm presenting a naive approach and that in practice, gender plays a role, even if subconsciously. I also don't really have a better solution for that, but I do hope that if we take care of all the other reasons I discussed, this distribution will eventually change naturally.

3. Statistics and Bias

Last year there was an interesting paper [6], followed by a lengthy discussion, about gender stereotypes in word embeddings. Word embeddings are trained with the objective of capturing meaning through co-occurrence statistics. In other words, words that often occur next to the same neighboring words in a text corpus are optimized to be close together in the vector space. Word embeddings have proved to be extremely useful for many downstream NLP applications.

The problem that this paper presented was that these word embeddings capture also "bad" statistics, for example gender stereotypes with regard to professions. For instance, word embeddings have a nice property of capturing analogies like "man:king :: woman:queen", but these analogies contain also gender stereotypes like "father:doctor :: mother:nurse", "man:computer programmer :: woman:homemaker", and "he:she :: pilot:flight attendant".

Why this is happening is pretty obvious - word embeddings are not trained to capture "truth" but only statistics. If most nurses are women, they would occur in the corpus next to words that are more likely to occur with feminine words than with masculine words, resulting in higher similarity between nurse and woman than nurse and man. In other words, if the input corpus reflects stereotypes and biases of society, so will the word embeddings.

So why is this a problem, anyway? Don't we want word embeddings to capture the statistics of the real world, even the kind of statistics we don't like? If something should be bothering us, it is the bias in society, rather than the bias these word embeddings merely capture. Or in other words:

What's this deal with de-biasing embeddings? Shouldn't we rather let embeddings as is and aim at de-biasing society instead?

— Angeliki Lazaridou (@aggielaz) September 30, 2016

I like this tweet because I was wondering just the same when I first heard about this work. The key concern about bias in word embeddings is that these vectors are commonly used in applications, and this might inadvertently amplify unwanted stereotypes. The example in the paper mentions web search aided by word embeddings. The scenario described is of an employer looking for an intern in computer science by searching for terms related to computer science, and the authors suggest that a LinkedIn page of a male researcher might be ranked higher in the results than that of a female researcher, since computer science terms are closer in the vector space to male names than to female names (because of the current bias). In this scenario, and in many other possible scenarios, the word embeddings are not just passively recording the gender bias, but might actively contribute to it!

Hal Daumé III wrote a blog post called Language Bias and Black Sheep about the topic, and suggested that the problem goes even deeper, since corpus co-occurrences don't always capture real-world co-occurrences, but rather statistics of things that are talked about more often:

"Which leads us to the "black sheep problem." We like to think that language is a reflection of underlying truth, and so if a word embedding (or whatever) is extracted from language, then it reflects some underlying truth about the world. The problem is that even in the simplest cases, this is super false."

Prior to reading this paper (and the discussion and blog posts that followed it), I never realized that we are more than just passive observers of data; the work we do can actually help mitigate biases or inadvertently contribute to them. I think we should all keep this in mind and try to see in our next work whether it can have any positive or negative affect on that matter -- just like we try to avoid overfitting, cherry-picking, and annoying reviewer 2.

References:
[1] Cross-national patterns of gender differences in mathematics: A meta-analysis. Else-Quest, Nicole M.; Hyde, Janet Shibley; Linn, Marcia C. Psychological Bulletin, Vol 136(1), Jan 2010, 103-127.
[2] Sex Differences in Intrinsic Aptitude for Mathematics and Science?: A Critical Review. Spelke, Elizabeth S. American Psychologist, Vol 60(9), Dec 2005, 950-958.
[3] Women's underrepresentation in science: Sociocultural and biological considerations. Ceci, Stephen J.; Williams, Wendy M.; Barnett, Susan M. Psychological Bulletin, Vol 135(2), Mar 2009, 218-261.
[4] Why don't they want a male-dominated job? An investigation of young women who changed their occupational aspirations. Pamela M. Frome, Corinne J. Alfeld, Jacquelynne S. Eccles, and Bonnie L. Barber. Educational Research And Evaluation Vol. 12 , Iss. 4,2006
[5] Women, Gender and Work: What Is Equality and How Do We Get There? Loutfi, Martha Fetherolf. International Labour Office, 1828 L. Street, NW, Washington, DC 20036, 2001.
[6] Quantifying and Reducing Stereotypes in Word Embeddings. Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, Adam Kalai. 2016 ICML Workshop on #Data4Good: Machine Learning in Social Good Applications.

^*STEM = science, technology, engineering and mathematics

Antonymy

2016-11-23T17:31:00.000+02:00

In the Seinfeld episode, "the opposite", George says that his life is the opposite of everything he wanted it to be, and that every instinct he has is wrong. He decides to go against his instincts and do the opposite of everything. When the waitress asks him whether to bring him his usual order, "tuna on toast, coleslaw, and a cup of coffee", he decides to have the opposite: "Chicken salad, on rye, untoasted. With a side of potato salad. And a cup of tea!". Jerry argues with him on what's the opposite of tuna, which is according to him, salmon. So which one of them is right? If you ask me, nor salmon nor chicken salad is the opposite of tuna. There is no opposite of tuna. But this funny video demonstrates one of the biggest problems in the task of automatically detecting antonyms: even us humans are terrible at that!

It's a Bird, It's a Plane, It's Superman (not antonyms)

Many people would categorize a pair of words as opposites if they represent two mutually exclusive options/entities in the world, like male and female. black and white, and tuna and salmon. The intuition is clear when these two words x and y represent the only two options in the world. In set theory, it means that y is the negation/complement of x. In other words, everything in the world which is not x, must be y (figure 1).

Figure 1: x and y are the only options in the world U

In this sense, tuna and salmon are not antonyms - they are actually more accurately defined as co-hyponyms: two words that share a common hypernym (fish). They are indeed mutually exclusive, as one cannot be both a tuna and a salmon. However, if you are not a tuna, you are not necessarily a salmon. You can be another type of fish (mackerel, cod...) or something else which is not a fish at all (e.g. person). See figure 2 for a set theory illustration.

Figure 2: salmon and tuna are mutually exclusive, but not the only options in the world

Similarly, George probably had in mind that tuna and chicken salad are mutually exclusive options for sandwich fillings. He was probably right; a tuna-chicken salad sandwich sounds awful. But since there are other options for sandwich fillings (peanut butter, jelly, peanut butter and jelly...), these two can hardly be considered as antonyms, even if we define antonyms as complements within a restricted set of entities in the world (e.g. fish, sandwich fillings). I suggest the "it's a bird, it's a plane, it's superman" binary test for antonymy: if you have more than two options, it's not antonymy!

Wanted Dead or Alive (complementary antonyms)

What about black and white? These are two colors out of a wide range of colors in the world, failing the bird-plane-Superman test. However, if we narrow our world down to people's skin colors, these two may be considered as antonyms.

Other examples for complementary antonyms are day and night, republicans and democrats, dead and alive, true and false, stay and go. As you may have noticed, they can be of different parts of speech (noun, adjective, verb), but the two words within each pair both share the same part of speech (comment if you can think of a negative example!).

Figure 3: Should I stay or should I go now?

So are we cool with complementary antonyms? Well, not quite. If you say that female and male are complementary antonyms, people might tell you that gender is not binary, but a spectrum. Some of these antonyms actually have other, uncommon or hidden options. Like in coma for the dead and alive pair, libertarians in addition to republicans and democrats, etc. Still, these pairs are commonly considered as antonyms, since there are two main options.

So what have we learned about complementary antonyms? That they are borderline, they depend on the context in which they occur, and they might be offensive to minorities. Use them with caution.

The Good, the Bad [and the Ugly?] (graded antonyms)
Even the strictest definition of antonymy includes pairs of gradable adjectives representing the two ends of a scale. Some examples are hot and cold, fat and skinny. young and old, tall and short, happy and sad. Set theory and my binary test aren't suitable for these types of antonyms.

Set theory isn't adequate because a gradable adjective can't be represented as a set, e.g. "the set of all tall people in the world". The definition of a graded adjective changes depending on the context and is very subjective. For example, I'm relatively short, so everyone looks tall to me, while my husband is much taller than me, so he is more likely to say someone is short. The set of tall people in the world changes according to the person who defines it.

In addition, by definition, testing for binarism fails. A cup of coffee can be more than just hot or cold. It can be boiling, very hot, hot, warm, cool, cold or freezing. And we can add more and more discrete options to the scale of coffee temperature.

What makes specific pairs of gradable adjectives into antonyms? While the definition requires that they would be in the ends of the scale, intuitively I would say that they should only be symmetric in the scale, e.g. hot and cold, boiling and freezing, warm and cool, but not hot and freezing.

Antonymy in NLP
While there is a vast linguistics literature about antonyms, I'm less familiar with it, and I'm going to focus on some observations and interesting points about antonymy that appear in NLP papers that I read.

The natural logic formulation ([1]) makes a distinction between "alternation" - words that are mutually exclusive, and "negation" - words that are both mutually exclusive and cover all the options in the world. While I basically claimed in this post that the former is not antonymy, we've seen that in some cases, if the two words represent the two main options, they may be considered as antonyms.

However, people tend to disagree on these borderline word pairs, so sometimes it's easier to conflate them under a more loose definition. For example, [2] had an annotation task in which they asked crowdsourcing workers to choose the semantic relation that holds for a pair of terms. They followed the natural logic relations, but decided to merge "alternation" and "negation" into a weaker notion of "antonyms".

More interesting observations about antonyms, and references to linguistic papers, can be found in [3], [4], and [5].

After we established that humans find it difficult to decide whether two words are antonyms, you must be wondering whether automatic methods can do reasonably well on this task. There has been a lot of work on antonymy identification (see the papers in the references, and their related work sections). I will focus on my little experience with antonyms. We've just published a new paper ([6]) in which we analyze the roles of two main information sources used for automatic identification of semantic relations. The task is defined as follows: given a pair of words (x, y), determine what is the semantic relation that holds between them, if any (e.g. synonymy, hypernymy, antonymy, etc). As in this post, we've used information from x and y's joint occurrences in a large text corpus, as well as information about the separate occurrences of each word x and y. We found that among all the semantic relations we tested, antonymy was almost the hardest to identify (only synonymy was worse).

The use of information about separate occurrences of x and y is based on the distributional hypothesis, which I've mentioned several times in this blog. Basically, if we look at the distribution of neighboring words of a word x, it may tell us something about the meaning of x. If we'd like to know what's the relation between x and y, we can compute something on top of the neighbor distributions of each word. For example, we can expect the distributions of x and y to be similar if x and y are antonyms, since one of the properties of antonyms is that they are interchangeable (a word can be replaced with its antonym and the sentence will remain grammatical and meaningful). Think about replacing tall with short, day with night, etc. The problem is that this is similarly true for synonyms - you can expect high and tall to also appear with similar neighboring words. So basing the classification on distributional information may lead to confusing antonyms with synonyms.

The joint occurrences may help identifying the relation that holds between the words in a pair, as some patterns indicate a certain semantic relation - for instance, "x is a type of y" may indicate that y is a hypernym of x. The problem is that patterns that are indicative of antonymy, such as "either x or y" (either cold or hot) and "x and y" (day and night), may also be indicative of co-hyponymy (either tuna or chicken salad). In any case, this seems far less bad than confusing antonyms with synonyms; in some applications it may suffice to know that x and y are mutually exclusive, with no importance to whether they are antonyms or co-hyponyms. For instance, when you query a search engine, you'd like it to retrieve results including synonyms of your search query (e.g. returning New York City subway map when you search for NYC subway map), but you wouldn't want it to include mutually exclusive words (e.g. Tokyo subway map).

One last thing to remember is that these automatic methods are trained and tested on data collected from humans. If we can't agree on what's considered antonymy, we can't expect these automatic methods to succeed in this any better than we do.

References

[1] Natural Logic for Textual Inference. Bill MacCartney and Christopher D. Manning. RTE 2007.
[2] Adding Semantics to Data-Driven Paraphrasing. Ellie Pavlick, Johan Bos, Malvina Nissim, Charley Beller, Benjamin Van Durme, and Chris Callison-Burch. ACL 2015.

[3] Computing Word-Pair Antonymy. Saif Mohammad, Bonnie Dorr and Graeme Hirst. EMNLP 2008.
[4] Computing Lexical Contrast. Saif Mohammad, Bonnie Dorr, Graeme Hirst, and Peter Turney. CL 2013.
[5] Taking Antonymy Mask off in Vector Space. Enrico Santus, Qin Lu, Alessandro Lenci, Chu-Ren Huang. PACLIC 2014.
[6] Path-based vs. Distributional Information in Recognizing Lexical Semantic Relations. Vered Shwartz and Ido Dagan. CogALex 2016.

Question Answering

2016-11-12T19:26:00.001+02:00

In the my introductory post about NLP I introduced the following survey question: when you search something in Google (or any other search engine of your preference), is your query:
(1) a full question, such as "What is the height of Mount Everest?"
(2) composed of keywords, such as "height Everest"

I never published the results since, as I suspected, there were too few answers to the survey, and they were probably not representative of the entire population. However, my intuition back then was that only older people are likely to search with a grammatical question, while people with some knowledge in technology would use keywords. Since then, my intuition was somewhat supported by (a) this lovely grandma that added "please" and "thank you" to her search queries, and (b) this paper from Yahoo Research that showed that search queries with question intent do not form fully syntactic sentences, but are made of segments (e.g. [height] [Mount Everest]).

Having said that, searching the web to get an answer to a question is not quite the same as actually asking the question and getting a precise answer:

Here's the weird thing about search engines. It was like striking oil in a world that hadn't invented internal combustion. Too much raw material. Nobody knew what to do with it.
— Ex Machina

It's not enough to formulate your question in a way that the search engine will have any chance of retrieving relevant results. Now you need to process the returned documents and search for the answer.

Getting an answer to a question by querying a search engine is not trivial; I guess this is the reason so many people ask questions in social networks, and some other people insult them with Let me Google that for you.

The good news is that there are question answering systems, designed to do exactly that: automatically answer a question given as input; the bad news is that like most semantic applications in NLP, it is an extremely difficult task, with limited success.

Question answering systems have been around since the 1960s. Originally, they were developed to support natural language queries to databases, before web search was available. Later, question answering systems were able to find and extract answers from free text.

A successful example of a question answering system is IBM Watson. Today Watson is described by IBM as "a cognitive technology that can think like a human", and is used in many of IBM's projects, not just for question answering. Originally, it was trained to answer natural logic questions -- or more precisely, to form the correct question to a given answer, as in the television game show Jeopardy. On February 2011, Watson competed in Jeopardy against former winners of the show, and won! It had access to millions of web pages, including Wikipedia, which were processed and saved before the game. During the game, it wasn't connected to the internet (so it couldn't use a search engine, for example). The Jeopardy video is pretty cool, but if you have no patience watching it all (I understand you...), here's a highlight:

HOST: This trusted friend was the first non-dairy powdered creamer. Watson?

WATSON: What is milk?

HOST: No! That wasn’t wrong, that was really wrong, Watson.

Another example is the personal assistants: Apple's Siri, Amazon's Alexa, Microsoft's Cortana, and Google Assistant. They are capable of answering an impressively wide range of questions, but it seems they are often manually designed to answer specific questions.

So how does question answering work? I assume that each question answering system employs a somewhat different architecture, and some of the successful ones are proprietary. I'd like to present two approaches. The first is a general architecture for question answering from the web, and the second is question answering from knowledge bases.

Question answering from the web

I'm following a project report I submitted to a course 3 years ago, in which I exemplified this process on the question "When was Mozart born?". This example was originally taken from some other paper, which is hard to trace now. Apparently, it is a popular example in this field.

The system preforms the following steps:

A possible architecture for a question answering system.

Question analysis - parse the natural language question, and extract some properties:

Question type - mostly, QA systems support factoid questions (a question whose answer is a fact, as in the given example). Other types of questions, e.g. opinion questions, will be discarded at this point.

Answer type - what is the type of the expected answer, e.g. person, location, date (as in the given example), etc. This can be inferred with simple heuristics using the WH-question word, for example who => person, where => location, when => date.

Question subject and object - can be extracted easily by using a dependency parser. These can be used in the next step of building the query. In this example, the subject is Mozart.

Search - prepare the search query, and retrieve documents from the search engine. The query can be an expected answer template (which is obtained by applying some transformation to the question), e.g. "Mozart was born in *". Alternatively, or in case the answer template retrieves no results, the query can consist of keywords (e.g. Mozart, born).

Upon retrieving documents (web pages) that answer the query, the system focuses on certain passages that are more likely to contain the answer ("candidate passages"). These are usually ranked according to the number of query words they contain, their word similarity to the query/question, etc.

Answer extraction - try to extract candidate answers from the candidate passages. This can be done by using named entity recognition (NER) that identifies in the text mentions of people, locations, organizations, dates, etc. Every mention whose entity type corresponds to the expected answer type is a candidate answer. In the given example, any entity recognized as DATE in each candidate passage will be marked as a candidate answer, including "27 January 1756" (the correct answer) and "5 December 1791" (Mozart's death date).

The system may also keep some lists that can be used to answer closed-domain questions, such as "which city [...]" or "which color [...]" that can be answered using a list of cities and a list of colors, respectively. If the system identified that the answer type is color, for example, it will search the candidate passage for items contained in the list of colors. In addition, for "how much" and "how many" questions, regular expressions identifying numbers and measures can be used.

Ranking - assign some score for each candidate answer, rank the candidate answers in descending order according to their scores, and return a list of ranked answers. This phase differs between systems. The simple approach would be to represent an answer by some characteristics (e.g. surrounding words) and learn a supervised classifier to rank the answers.

An alternative approach is to try to "prove" the answer logically. In the first phase, the system creates an expected answer template. In our example it would be "Mozart was born in *". By assigning the candidate answer "27 January 1756" to the expected answer template, we get the hypothesis "Mozart was born in 27 January 1756", which we would like to prove from the candidate passage. Suppose that the candidate passage was "[...] Wolfgang Amadeus Mozart was born in Salzburg, Austria, in January 27, 1756. [...]", a person would know that given the candidate passage, the hypothesis is true, therefore this candidate answer should be ranked high.

To do this automatically, Harabagiu and Hick ([1]) used a textual entailment system: the system receives two texts and determines whether if the first text (text) is true, it means that the second one (hypothesis) is also true. Some of these systems return a number, indicating to what extent this is true. This number can be used for ranking answers.

While this is a pretty cool idea, the unfortunate truth is that textual entailment systems do not perform better than question answering systems, or very good in general. So reducing the question answering problem to that of recognizing textual entailment doesn't really solve question answering.

Question answering from knowledge bases

A knowledge base, such as Freebase/Wikidata and DBPedia, is a large-scale set of facts about the world in a machine-readable format. Entities are related to each other via relations, creating triplets like (Donald Trump, spouse, Melania Trump) and (idiocracy, instance of, film) (no association between the two facts whatsoever ;)). Entities can be people, books and movies, countries, etc. Example relations are birth place, spouse, occupation, instance of, etc. While these facts are saved in a format which is easy for a machine to read, I never heard of a human who searches information in knowledge bases. Which is too bad, since it contains an abundance of information.

So some researchers (e.g. [2], following [3]) came up with the great idea of letting people ask a question in natural language (e.g. "When was Mozart born?"), parsing the question automatically to relate it to a fact in the knowledge base, and answer accordingly.
This reduces the question answering task to understanding the natural language question, whereas querying for the answer from a knowledge base requires no text processing. The task is called executable semantic parsing. The natural language question is mapped into some logic representation, e.g. Lambda calculus. For example, the example question would be parsed to something like λx.DateOfBirth(Mozart, x). The logical form is then executed against a knowledge base; for instance, it would search for a fact such as (Mozart, DateOfBirth, x) and return x.

Despite having the answer appear in a structured format rather than in free text, this task is still considered hard, because parsing a natural language utterance into a logical form is difficult.*

By the way, simply asking Google "When was Mozart born?" seems to take away my argument that "searching the web to get an answer to a question is not quite the same as actually asking the question and getting a precise answer":

Google understands the question and answers precisely.

Only that it doesn't. Google added this feature to its search engine in 2012, in which it presents information boxes above the regular search results, for some queries and questions. They parse the natural language query and try to retrieve results from their huge knowledge base, known as Google knowledge graph. Well, I don't know exactly how they do it, but I guess that similarly to the previous paragraph, their main effort is in parsing and understanding the query, which can then be matched against facts in the graph.

References:

[1] Methods for Using Textual Entailment in Open-Domain Question Answering. Sanda Harabagiu and Andrew Hick. In ACL and COLING 2006.

[2] Semantic Parsing on Freebase from Question-Answer Pairs. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. In EMNLP 2013.
[3] Learning to parse database queries using inductive logic programming. John M. Zelle and Raymond J. Mooney. In AAAI 1996.

* If you're interested in more details, I recommend going over the materials from the very interesting ESSLLI 2016 course on executable semantic parsing, which was given by Jonathan Berant.

Crowdsourcing (for NLP)

2016-08-28T12:06:00.000+03:00

Developing new methods to solve scientific tasks is cool, but they usually require data. We researchers often find ourselves collecting data rather than trying to solve new problems. I've collected data for most of my papers, but never thought of it as an interesting blog post topic. Recently, I attended Chris Biemann's excellent crowdsourcing course at ESSLLI 2016 (the 28th European Summer School in Logic, Language and Information), and was inspired to write about the topic. This blog post will be much less technical and much more high-level than the course, as my posts usually are. Nevertheless, credit for many interesting insights on the topic goes to Chris Biemann.¹

Who needs data anyway?
So let's start from the beginning: what is this data and why do we need it? Suppose that I'm working on automatic methods to recognize the semantic relation between words, e.g. I want my model to know that cat is a type of animal, and that wheel is a part of a car.

At the very basic level, if I already developed such a method, I will want to check how well it does compared to humans. Evaluation of my method requires annotated data, i.e. a set of word pairs and their corresponding true semantic relations, annotated by humans. This will be the "test set"; the human annotations are considered as "gold/true labels". My model will try to predict the semantic relation between each word-pair (without accessing the true labels). Then, I will use some evaluation metric (e.g. precision, recall, F1 or accuracy) to see how well my model predicted the human annotations. For instance, my model would have 80% accuracy if for 80% of the word-pairs it predicted the same relation as the human annotators.

Figure 1: an example of dataset entries for recognizing the semantic relation between words.

If that was the only data I needed, I would have been lucky. You don't need that many examples to test your method. Therefore, I could select some word-pairs (randomly or using some heuristics), and annotate them myself, or bribe my colleagues with cookies (as I successfully did twice). The problem starts when you need training data, i.e., when you want your model to learn to predict something based on labelled examples. That usually requires many more examples, and annotating data is a very tiring and Sisyphean work.

What should we do, then? Outsource the annotation process -- i.e., pay with real money, not cookies!

What is crowdsourcing?

The word crowdsourcing is a blend word composed of crowd (intelligence) + (out-)sourcing [1]. The idea is to take a task that can be performed by experts (e.g. translating a document from English to Spanish), and outsource it to a large crowd of non-experts (workers) that can perform it.

The requester defines the task, and the workers work on it. The requester than decides whether to accept/reject the work and pays the workers (in case of acceptance).

The benefits of using "regular" people rather than experts are:

You pay them much less than experts - typically a few cents per question (/task). (e.g., [2] found that in translation tasks, the crowd reached the same quality as the professionals, with less than 12% of the costs).
They are more easily available via crowdsourcing platforms (see below).
By letting multiple people work on the task rather than a single/few experts, the task could be completed in a shorter time.

The obvious observation is that the quality of a worker is not as good as the expert; in crowdsourcing, it is not a single worker that replaces the expert, but the crowd. Rather than trusting a single worker, you assign each task to a certain number of workers, and combine their results. A common practice is to use the majority voting. For instance, let's say that I ask 5 workers what is the semantic relation between cat and dog, giving them several options. 3 of them say that cat and dog are mutually exclusive words (e.g. one cannot be both a cat and a dog), one of them says that they are opposites, and one says that cat is a type of dog. The majority has voted in the favor of mutually exclusive, and this is what I will consider as the correct answer.²

The main crowdsourcing platforms (out of many others) are Amazon Mechanical Turk and CrowdFlower. In this blog post I will not discuss the technical details of these platforms. If you are interested in a comparison between the two, refer to these slides from the NAACL 2015 crowdsourcing tutorial.

Figure 2: An example of a question in Amazon Mechanical Turk, from my project.

What can be crowdsourced?

Not every data we need to collect can be collected via crowdsourcing; some data may require expert annotation, e.g. if we need to annotate the syntactic trees of sentences in natural language, that's probably a bad idea to ask non-experts to do so.

The rules of thumb for crowdsourcability are:

The task is easy to explain, and you as a requester indeed explain it simply. They key idea is to keep it simple. The instructions should be short, i.e. do not expect workers to read a 50 page manual. They don't get paid enough for that. The instructions should include examples.
People can easily agree on the "correct" answer, e.g. "is there a cat in this image?" is good, "what is the meaning of life?" is really bad. Everything else is borderline :) One thing to consider is the possible number of correct answers. For instance, if the worker should reply with a sentence (e.g. "describe the following image"), they can do so in so many ways. Always aim one possible answer for a question.
Each question is relatively small.
Bonus: the task is fun. Workers will do better if they enjoy the task. If you can think of a way to gamify your task, do so!

Figure 3: Is there a cat in this image?

Some tasks are borderline and may become suitable for crowdsourcing if presented in the right way to the workers. If the task at hand seems too complicated to be crowdsourced, ask yourself: can I break it into smaller tasks that can each be crowdsourced? For example, let workers write a sentence that describes an image, and accept all answers; then let other workers validate the sentences (ask them: does this sentence really describe this image?).

Some examples for (mostly language-related) data collected with crowdsourcing
(references omitted, but are available in the course slides in the link above).

Checking whether a sentence is grammatical or not.
Alignment of dictionary definitions - for instance, if a word has multiple meanings, and hence has multiple definitions in each dictionary - the task was to align the definitions corresponding to the same meaning in different dictionaries.
Translation.
Paraphrase collection - get multiple sentences with the same meaning. These were obtained by asking multiple workers to describe the same short video.
Duolingo started as a crowdsourcing project!
And so did reCAPTCHA!

How to control for the quality of data?

OK, so we collected a lot of data. How do we even know if it's good? Can I trust my workers to do well on the task? Could they be as good as experts? And what if they just want my money and are cheating on the task just to get easy money?

There are many ways to control for the quality of workers:

The crowdsourcing platforms provide some information about the workers, such as the number of tasks they completed in the past, their approval rate (% of their tasks that were approved), location, etc. You can define your requirements from the workers based on this information.
Don't trust a single worker -- define that your task should be answered by a certain number of workers (typically 5) and aggregate their answers (e.g. by majority voting).
Create control questions - a few questions for which you know the correct answer. These questions are displayed to the worker just like any other questions. If a worker fails to answer too many control questions, the worker is either not good or trying to cheat you. Don't use this worker's answers (and don't let the worker participate in the task anymore; either by rejecting their work or by blocking them).³
Create a qualification test - a few questions for which you know the correct answer. You can require that any worker who wants to work on your task must take the test and pass it. As opposed to the control questions, the test questions don't have to be identical in format to the task itself, but should predict the worker's ability to perform the task well.
Second-pass reviewing - create another task in which workers validate previous workers' answers.
Bonus the good workers - they will want to keep working for you.
Watch out for spammers! Some workers are only after your money, and they don't take your task seriously, e.g. they will click on the same answer for all questions. There is no correlation between the number of questions workers answer and their quality, however, it is worth looking at the most productive workers: some of them may be very good (and you might want to give them bonuses), while some of them may be spammers.

Ethical issues in crowdsourcing

As a requester, you need to make sure you treat your workers properly. Always remember that workers are first of all people. When you consider how much to pay or whether to reject a worker's work, think of the following:

Many workers rely on crowdsourcing as their main income.
They have no job security.
Rejection in some cases is unfair - even if the worker was bad in the task, they still spent time working (unless you are sure that they are cheating).
New workers do lower-paid work to build up their reputation, but underpaying is not fair and not ethical.
Are you sure you explained the task well? Maybe it is your fault if all the workers performed badly?

The good news is that, from my little experience, paying well pays off for the requester too. If you pay enough (but not too much!), you get good workers that want to do the task well. When you underpay, the good workers don't want to work on your task - they can get better paying tasks. The time to complete the task will be longer. And if you are like me, the thought of underpaying your workers will keep you awake at night. So pay well :)⁴

Important take-outs for successful crowdsourcing:

Work in small batches. If you have 10,000 questions, don't publish all at once. Try some, learn from your mistakes, correct them and publish another batch. Mistakes are bound to happen, and they might cost you good money!
Use worker errors to improve instructions (remember: it might be your fault).
KEEP. IT. SIMPLE.
Use quality control mechanisms.
Don't underpay!
Always expect workers to be sloppy. Repeat guidelines and questions and don't expect workers to remember them.
If your questions are automatically generated, use random order and try to balance the number of questions with each expected answer, otherwise workers will exploit this bias (e.g. if most word-pairs are unrelated, they will mark all of them as unrelated without looking twice).
Make workers' lives easier, and they will perform better. For instance, if you have multiple questions regarding the same word, group them together.
If you find a way to make your task more fun, do so!

References
[1] Howe, Jeff. The rise of crowdsourcing. Wired magazine 14.6 (2006).
[2] Omar F. Zaidan and Chris Callison-Burch Crowdsourcing translation: professional quality from non-professionals. In ACL 2011.

1 And I would also like to mention another wonderful crowdsourcing tutorial that I attended last year at NAACL 2015, which was given by Chris Callison-Burch, Lyle Ungar, and Ellie Pavlick. Unfortunately, at that time I had no personal experience with crowdsourcing, nor believed that my university will ever have budget for that, therefore made no effort to remember the technical details; I was completely wrong. A year later I published a paper about a dataset collected with crowdsourcing, on which I even got a best paper award :) ^↩
2 For more sophisticated aggregation methods that assign weights to workers based on their quality, see MACE. ^↩

3 Blocking a worker means that they can't work on your tasks anymore. Rejecting a worker means that they are not paid for the work they have already done. As far as I know, it is not recommended to reject a worker, because then they write bad things about you in Turker Nation and nobody wants to work for you anymore. In addition, you should always give workers the benefit of the doubt; maybe you didn't explain the task well enough.^↩

4 So how much should you pay? First of all, not less than 2 cents. Second, try to estimate how long a single question takes and aim an hourly pay of around 6 USDs. For example, in this paper I paid 5 cents per question, which I've been told is the higher bound for such tasks.^↩

Linguistic Analysis of Texts

2016-06-20T17:35:00.000+03:00

Not long ago, Google released their new parser, oddly named Parsey McParseface. For a couple of days, popular media was swamped with announcements about Google solving all AI problems with their new magical software that understands language [e.g. 1, 2].

Well, that's not quite what it does. In this post, I will explain about the different steps applied for analyzing sentence structure. These are usually used as a preprocessing step for higher-level tasks that try understanding the meaning of sentences, e.g. intelligent personal assistants like Siri or Google Now.

The following tools are traditionally used one after the other (also known as the "linguistic annotation/processing pipeline"). Generally speaking, the accuracy of available tools for the tasks in this list is in decreasing order. Some low-level tasks are considered practically solved, while others still have room for improvement.

Sentence splitting - as simple as it sounds: receives a text document/paragraph and returns its partition to sentences. While it sounds like a trivial task -- cut the text on every occurrence of a period -- it is a bit trickier than that; sentences can end with an exclamation / question mark, and periods are also used in acronyms and abbreviations in the middle of the sentence. The simple period rule will fail on this text, for example. Still, sentence splitting is practically considered a solved task, using predefined rules and some learning algorithms. See this for more details.
Tokenization - a tokenizer receives a sentence and splits it to tokens. Tokens are mostly words, but words that are short forms of negation or auxiliaries are split to two tokens, e.g. I'm => I 'm, aren't => are n't.
Stemming / Lemmatization - Words appear in natural language in many forms, for instance, verbs have different tense suffixes (-ing, -ed, -s), nouns have plurality suffixes (s), and adding suffixes to words can sometimes change their grammatical categories, as in nation (noun) => national (adjective) => nationalize (verb).
The goal of both stemmers and lemmatizers is to "normalize" words to their common base form, such as "cats" => "cat", "eating" => "eat". This is useful for many text-processing applications, e.g. if you want to count how many times the word cat appears in the text, you may also want to count the occurrences of cats.
The difference between these two tools is that stemming removes the affixes of a word, to get its stem (root), which is not necessarily a word on its own, as in driving => driv. Lemmatization, on the other hand, analyzes the word morphologically and returns its lemma. A lemma is the form in which a word appears in the dictionary (e.g. singular for nouns as in cats => cat, infinitive for verbs as in driving => drive).
Using a lemmatizer is always preferred, unless there is no accurate lemmatizer for that language, in that case a stemmer is better than nothing.

Part of speech tagging - receives a sentence, and tags each word (token) with its part of speech (POS): noun, verb, adjective, adverb, preposition, etc. For instance, the following sentence: I'm using a part of speech tagger is tagged in Stanford Parser as:
I/PRP 'm/VBP using/VBG a/DT part/NN of/IN speech/NN tagger/NN ./. Which means that I is a personal pronoun, 'm (am) is a verb, non-3rd person singular present, and if you're interested, here's the list to interpret the rest of the tags.
(POS taggers achieve around 97% accuracy).

Syntactic parsing - analyzes the syntactic structure of a sentence, outputting one of two types of parse trees: constituency-based or dependency-based.

Constituency - segments the sentence into syntactic phrases: for instance, in the sentence the brown dog ate dog food, [the brown dog] is a noun phrase, [ate dog food] is a verb phrase, and [dog food] is also a noun phrase.

An example of constituency parse tree, parsed manually by me and visualized using syntax tree generator.

Dependency - connects words in the sentence according to their relationship (subject, modifier, object, etc.). For example, in the sentence the brown dog ate dog food, the word brown is a modifier of the word dog, which is the subject of the sentence. I've mentioned dependency trees in the previous post: I used them to represent the relation that holds between two words, which is a common use.
(Parsey McParseface is a dependency parser. Best dependency parsers achieve around 94% accuracy).

An example of dependency parser output, using Stanford Core NLP.

Other tools, which are less basic, but are often used, include:

Named entity recognition (NER) - receives a text and marks certain words or multi-word expressions in the text with named entity tags, such as PERSON, LOCATION, ORGANIZATION, etc.

An example of NER from Stanford Core NLP.

Coreference resolution - receives a text and connects words that refer to the same entity (called "mentions"). This includes, but not limited to:
pronouns (he, she, I, they, etc.) - I just read Lullaby. It is a great book.
different names / abbreviations - e.g., the beginning of the text mentions Barack Obama, which is later referred to as Obama.
semantic relatedness - e.g. the beginning of the text mentions Apple which is later referred to as the company.

This is actually a tougher task than the previous ones, and accordingly, it achieves less accurate results. In particular, sometimes it is difficult to determine which entity a certain mention refers to (while it's easy for a human to tell): e.g. I told John I don't want Bob to join dinner, because I don't like him. Who does him refer to?
Another thing is that it is very sensitive to context, e.g. in one context apple can be co-referent with the company, while in another, that discusses the fruit, it is not true.
Word sense disambiguation (WSD) - receives a text and decides on the correct sense of each word in the given context. For instance, if we return to the apple example, in the sentence Apple released the new iPhone, the correct sense of apple is the company, while in I ate an apple after lunch the correct sense is the fruit. Most WSD systems use WordNet for the sense inventory.
Entity linking - and in particular, Wikification: receives a text and links entities in the text to the corresponding Wikipedia articles. For instance, in the sentence 1984 is the best book I've ever read, the word 1984 should be linked to https://en.wikipedia.org/wiki/Nineteen_Eighty-Four (rather than to the articles discussing the films / TV shows).
Entity linking can complement word sense disambiguation, since most proper names (as Apple or 1984) are not present in WordNet.
Semantic role labeling (SRL) - receives a sentence and detects the predicates and arguments in the sentence. A predicate is usually a verb, and each verb may have several arguments, such as agent / subject (the person who does the action), theme (the person or thing that undergoes the action), instrument (what was used for doing the action), etc. For instance, in the sentence John baked a cake for Mary, the predicate is bake, and the arguments are agent:John, theme:cake, and goal:Mary. This is not just the final task in my list: it is the task which is the closest to understanding the semantics of a sentence.

Here is an example for (a partial) analysis of the sentence: The brown dog ate dog food, and now he is going to sleep, using Stanford Core NLP:

Analysis of The brown dog ate dog food, and now he is going to sleep, using Stanford Core NLP.

All this effort, and we are not even yet talking about deep understanding of the sentence meaning, but rather analyzing the sentence structure, perhaps as a step toward understanding its meaning. As my previous posts show, it's hard as it is to understand a single word's meaning. In one of my next posts I will describe methods that deal with the semantics of a sentence.

By the way, if you are potential users of these tools, and you are looking for a parser, Google's parser is not the only one available. BIST is more accurate and faster than Parsey McParseface, and spaCy is slightly less accurate, but much faster than both.

Improving Hypernymy Detection

2016-05-25T18:24:00.002+03:00

From time to time I actually make some progress with my own research, that I think might be interesting or beneficial for others. Now is such a time.* so let me share that with you.

If you've read my blog post about lexical inference, then you should already be familiar with my research goals. I'll explain it shortly: I'm working on automated methods to recognize that a certain term's meaning (word or multi-word expression) can be inferred from another's.

There are several interesting lexical-semantic relations, such as synonymy/equivalence (intelligent/smart, elevator/lift), hypernymy/subclass (parrot/bird, stare/look), meronymy/part-of (spoon/cutlery, London/England), antonymy/opposite (short/tall, boy/girl), and causality (flu/fever, rain/flood).

These relations are interesting, because whenever we encounter one of the terms in a given sentence, we can use our knowledge to infer new facts. For instance, the sentence "I live in London" (wishful thinking...), could be used to infer "I live in England", knowing that London is a part of England. Of course we also need to know something about the sentence itself, because saying that "I left London" doesn't necessarily entail that "I left England". I might have just taken the train to Reading for the Festival :) But this is another line of research which I haven't explored deeply yet, so we'll leave that for another post.

In this particular work, we've focused on the common hypernymy relation between nouns (and noun phrases). We developed a method that given a pair of nouns (x, y) (e.g. (cat, animal), (abbey road, record), (apple, animal)) predicts whether y is a hypernym of x - or in other words, whether x is a subclass of y (e.g. cats are a subclass of animals) or an instance of y (e.g. abbey road is an instance of record). I'll try to keep things simple. If you're interested in more details or the references to other papers, please refer to the paper.

There are two main approaches in the literature of hypernymy detection: path-based and distributional. Path-based methods are very elegant (a matter of taste, I guess). They assume that if y is a hypernym of x, then this relation must be expressed frequently enough when looking at a large text corpus. A pair of words that tends to be connected through patterns such as "X and other Y", "Y such as X", "X is a Y" is likely to hold the hypernymy relation (e.g. cats and other animals, fruit such as apples). To overcome some adjectives and relative clauses that stand in the way of the important information (as in Abbey Road is [the 11th studio] album), a dependency parser is used, outputting the syntactic relation between words in the sentence (e.g. Abbey Road is connected to is and is is connected to album). See the figure below for an example of such a path.

An example of a dependency path between parrot and bird

These paths serve as features for classification. These methods use training data - a list of (x, y) pairs with their corresponding label (e.g. (cat, animal, True), (apple, animal, False)). Each (x, y) pair is represented by all the dependency paths that connected them in the corpus: the feature vector holds an entry for each dependency path in the corpus, and the value is the frequency in which this path connected x and y (e.g. for (cat, animal) how many times "cat is an animal" occur in the corpus, how many times "animals such as cats" occur, etc.). A classifier is trained over these vectors to predict whether y is a hypernym of x.

Though this method works nicely, it suffers from one major limitation: it cannot generalize. If x1 and y1 are mainly connected by the "X is considered as Y" pattern, and x2 and y2 are connected via "X is regarded as Y", they practically share no information. These are considered two different paths. Attempts to generalize such paths by replacing words along the paths with wild-cards (e.g. "X is * as Y") or part-of-speech tags (e.g. "X is VERB as Y") may end up in paths too-general (e.g. "X is denied as Y", which also generalizes to "X is VERB as Y", is a negative path).

In contrast, the distributional approach considers the separate occurrences of x and y in the corpus. It relies on the distributional hypothesis, that I've already mentioned here and here. The main outcome of this hypothesis is that words can be represented using vectors, with similar words (in meaning) sharing similar vectors. In recent years, people have been using these vectors in supervised hypernymy detection. To represent each (x, y) pair as a vector, they somehow combined x and y's vectors (e.g. by concatenating them). They trained a classifier on top of these vectors and predicted for new pairs (e.g. (apple, fruit)) whether y is a hypernym of x. These methods have shown good performance, but it was later found that they tend to overfit to the training data, and are pretty bad in generalizing; for example, if you are trying to predict hypernymy for a new pair (x, y) with rare words x and y that weren't observed in the training data, the prediction will be (only slightly better than) a guess.

To sum up recent work - path-based methods can leverage information about the relation between a pair of words, but they do not generalize well. On the other hand, distributional methods might not recognize the relation between a pair of words, but they contain useful information about each of the words. Since these two approaches are complementary, we combined them!

We started by improving path representation, using a recurrent neural network. I can't possibly explain the technical details without writing a long background post about neural networks first, so I'll skip most of the technical details. I'll just say that this is a machine learning model that processes sequences (e.g. sequence of words, letters, edges, etc), that can, among other things, output a vector representing the entire sequence. In our case, we split the dependency path to edges, and let the network learn a vector representation of the entire path, by going over the edges sequentially. Then, we replace the traditional path features that represent a term-pair, e.g. "X is defined as Y", with the vector representing the path - the path embbeding.

The nice thing about these path embbedings is that -- can you guess? similar paths have similar path embeddings. This happens thanks to two things. First, the network can learn that certain edges are important for detecting hypernymy, while others are not, which may lead to consolidating paths that differ by certain unimportant edges.

Moreover, since neural networks can only work on vectors, the entire information we use is encoded as vectors. For instance, the words along the path (e.g. is, defined, as) are all encoded as vectors. We use word embeddings, so similar words have similar vectors. This results in similar vectors for paths that differ by a a pair of similar words, e.g. "X is defined as Y" and "X is described as Y".

Similar paths having similar vectors is helpful for the classifier. In the paper, we show that our method performed better than the prior methods. Just to give an intuition, let's say for instance that the classifier learned that the path "X company is a Y", which was pretty common in the corpus, indicates hypernymy. And let's say that "X ltd is a Y" only occurred once for a positive (x, y) pair. The previous methods would probably decide that such a path is not indicative of hypernymy, since they don't have enough evidence about it. However, our method recognizes that ltd and company are similar words, yielding similar path vectors for these two paths. If "X company is a Y" is considered indicative, then so does "X ltd is a Y".

At last, we combined the complementary path-based and distributional approaches. To add distributional information to our model (the information on the separate occurrences of each term x and y), we simply added the word embedding vectors of x and y to the model, allowing it to rely on this information as well. With this simple change we achieve significant improvement in performance compared to prior methods in each approach. For so many years people have been saying these two approaches are complementary, and turns out it was actually not too difficult to combine them :)

Paper details
Improving Hypernymy Detection with an Integrated Path-based and Distributional Method. Vered Shwartz, Yoav Goldberg and Ido Dagan. ACL 2016. link

* Now = a few months ago, but the paper was under review and I couldn't publish anything about it.

Text Classification

2016-03-08T23:49:00.000+02:00

Given a piece of text (document), can software recognize the topic(s) it discusses? If you're not convinced that such a thing could be helpful, let's just start with two facts:

90% of the data in the world today has been created in the last two years [1].
Our attention span is now less than that of a goldfish [2], and we almost never read through an article [3].

These two together lead to sooo much data that might be of interest to you, that you'll never read. If only there were someone who could read articles for you and decide whether they intersect with your topics of interest. Well, automatic text classification can assist you with that.

Representation

As in the last post about word representation, we must first decide how to represent the documents. Intuitively, we want the algorithm to classify a document to a certain topic based on the document's content, as in figure 1. We need the document representation to reflect that.

Figure 1: Two example documents, one (Doc1) about computer science and the other (Doc2) about news.

The simplest and most common approach is the "bag-of-words" approach, in which each document is represented by all its words. Some words may be indicative of one topic or another, e.g. soccer, player, and match might indicate that a document is about sports, while government, prime minister, and war suggest that this is a news article. If you remember the spam filtering example from the supervised learning post, this is exactly the same: some words in the mail message may indicate that it is spam. Spam filtering can be regarded as a text classification task in which each document is a mail message, classified to one of the two topics spam / not spam.

The bag-of-words approach is simple, yet may yield nice results. However, it ignores multi-word expressions (document classification) some of which are non-compositional, i.e. the phrase's meaning is different from the separate word meanings (rock and roll). It also ignores word order, and syntax. Some other methods can consider these features as well. In this post we will stick to the simple bag-of-words approach, which is often good enough for this task.

Methods
Choosing the method in which documents will be classified to topics depends, first of all, on the available data. If we have a sample of labeled documents (documents with known topics), we will prefer supervised classification. In supervised learning, the algorithm is given a training set of labeled instances (e.g. document and its topic), and it tries to learn a model that predicts those labels correctly. This model is later used to predict the label (topic) of unseen instances (documents).

Otherwise, if we only have a bunch of documents without their topics (which is the more common case), we will apply unsupervised classification (clustering). Instead of trying to attach a label (topic) to each instance (document), the algorithm groups together similar documents, which seem to be about the same topic. The output of the clustering process is clusters of documents, each cluster represents a topic.

Supervised Document Classification
Different methods to classify documents differ in the instance representation and the learning algorithms. The general scheme is to represent each document as a feature vector, use a multi-class classifier and feed it the training instances and labels (documents and their topics) to learn a model. Then, given a new unlabeled document, the classifier can predict the document's topic, with some level of success.

Following the bag-of-words approach, we may represent each document as a |V|-dimensional vector, where V is our vocabulary. Each cell represents a word which may or may not appear in the document. There are a few variants of the cells content, here are some common ones:

Binary - 1 if the word occurred in the document, 0 otherwise. It is a set representation of the document's words.
Frequency - the number of times that a word occurred in the document. We can expect that topic prominent words would appear frequently in the topic documents, while other words would occur occasionally.
TF-IDF - I might be skipping a few simpler metrics, but this one is very useful. When we count the frequency of each word, we might end up with some words that are frequent in all documents, regardless of the topic. For instance, stop words and function words (the, it, and, what...) are never indicative of a certain topic. The TF-IDF metric handles this problem by measuring the importance of a term in a certain document, given all the other documents. The TF (Term-Frequency) measure is proportional to the word frequency in the document. The IDF (Inverse-Document-Frequency) decreases if the word is generally frequent in all documents. This way, a word gets a high score if it is relatively non-common in general, but common in the specific document.

Figure 2: The corresponding (partial) bag-of-words vectors for the example documents, with the binary (top) and frequency (bottom) variants.

It is no coincidence this post is following the one about word representations: speaking in word representation terminology, we can now see that the frequency document vector is the sum of one-hot vectors of all the words in the document. Can you see it?
Correspondingly, when working with word embeddings (continuous dense vectors), there is a variant of bag-of-words called CBOW (continuous bag-of-words): it is the sum/average of word vectors for each word in the document. This results in a D-dimensional vector that represents the entire document, where D is the word vectors dimension.

Once we represented each document as a feature vector, we let the classifier train over the feature vectors and corresponding labels. It may notice that feature vectors with high values in the cells of soccer, player, and match tend to occur with the label sports, for example.

There is a broad choice of algorithms, some may perform better than others. Roughly, they all try to do the same thing: the multi-class classifier learns a weight for each feature and each label. In our text classification task, it learns the salience of each word in the vocabulary for each topic. For example, soccer, player, and match will be assigned high weights and government and prime minister will be assigned low (maybe negative) weights for the sports label, and the opposite will occur for the news label. When the classifier is trained, the objective is to learn the weights that maximize the accuracy of the training set, i.e., classifying as many documents as possible to the correct topic.

For you coders, here is the simplest proof of concept. Other people may skip the code. This is a Python script, that works with scikit-learn, a machine learning package for Python. I used a subset of the topics in the 20 newsgroup dataset. I trained a simple logistic regression classifier, removing stop words and punctuation, and representing each document using CBOW with GloVe word embeddings (my personal favorite). This yields 83% F1 score, which has much potential of improvement, but it's still nice with such little effort.

document_classification.py

document_classification.py


1    import sys 
2    import nltk 
3    import string 
4    import codecs 
5     
6    from sklearn.datasets import fetch_20newsgroups 
7    from sklearn.metrics import precision_recall_fscore_support 
8    from sklearn.linear_model import LogisticRegression 
9     
10   import numpy as np 
11    
12    
13   def main(): 
14    
15       # Load the word vectors 
16       words, wv = load_embeddings('glove.6B.50d.txt') 
17       word_to_num = { word : i for i, word in enumerate(words) } 
18    
19       # Load the stop words 
20       with codecs.open('English_stop_words.txt', 'r', 'utf-8') as f_in: 
21           stop_words = set([line.strip() for line in f_in]) 
22    
23       # Load the datasets 
24       topics = ['talk.politics.guns', 'soc.religion.christian', 
25                 'comp.windows.x', 'rec.sport.baseball', 'sci.med'] 
26       newsgroups_train = fetch_20newsgroups(subset='train', 
27                                             remove=('headers', 
28                                                     'footers', 
29                                                     'quotes'), 
30                                             categories=topics) 
31       y_train = list(newsgroups_train.target) 
32       newsgroups_test = fetch_20newsgroups(subset='test', 
33                                            remove=('headers', 
34                                                    'footers', 
35                                                    'quotes'), 
36                                            categories=topics) 
37       y_test = list(newsgroups_test.target) 
38    
39       # Create the feature vectors 
40       X_train = create_doc_vectors(newsgroups_train.data, 
41                                    word_to_num, wv, stop_words) 
42       X_test = create_doc_vectors(newsgroups_test.data, 
43                                   word_to_num, wv, stop_words) 
44    
45       # Create the classifier 
46       classifier = LogisticRegression() 
47    
48       # Train the classifier 
49       classifier.fit(X_train, y_train) 
50    
51       # Predict the topics of the test set and compute 
52       # the evaluation metrics 
53       y_pred = classifier.predict(X_test) 
54       precision, recall, f1, support = \ 
55           precision_recall_fscore_support(y_test, y_pred, 
56                                           average='weighted') 
57    
58       print 'Precision: %.02f%%, Recall: %.02f%%, F1: %.02f%%' \ 
59             % (precision * 100, recall * 100, f1 * 100) 
60    
61    
62   def create_doc_vectors(data, word_to_num, wv, stop_words): 
63       """ 
64       Create a matrix in which each row is a document, 
65       and each document is represented as CBOW 
66       """ 
67       doc_vecs = [] 
68    
69       for doc in data: 
70           tokens = nltk.word_tokenize(doc.lower()) 
71           tokens = [w for w in tokens 
72                     if w not in string.punctuation 
73                     and w not in stop_words] 
74           tokens = [word_to_num.get(w, -1) for w in tokens] 
75           doc_vector = [wv[w] for w in tokens if w > -1] 
76           doc_vector = np.mean(np.vstack(doc_vector), axis=0) \ 
77               if len(doc_vector) > 0 else np.zeros((1, 50)) 
78           doc_vecs.append(doc_vector) 
79    
80       instances = np.vstack(doc_vecs) 
81       return instances 
82    
83    
84   def load_embeddings(embedding_file): 
85       """ 
86       Load the pre-trained embeddings from a file 
87       """ 
88       with codecs.open(embedding_file, 'r', 'utf-8') as f_in: 
89           words, vectors = \ 
90               zip(*[line.strip().split(' ', 1) for line in f_in]) 
91       vectors = np.loadtxt(vectors) 
92    
93       # Normalize each row (word vector) in the matrix to sum-up to 1 
94       row_norm = np.sum(np.abs(vectors)**2, axis=-1)**(1./2) 
95       vectors /= row_norm[:, np.newaxis] 
96    
97       return words, vectors 
98    
99    
100  if __name__ == '__main__': 
101      main() 
102

As an example, I printed one of the documents and the topic that the classifier assigned to it:

 I really think that this is the key point. When I saw the incident on Baseball Tonight Sunday, I couldn't believe how far away from the plate Gant went. Then he casually leaned against his bat.  I don't blame the umpire at all for telling the pitcher to pitch. The worst part of the whole incident was the Braves coming out onto the field.  What were they going to do, attack the umpire? The only people who should've been out there were Cox and maybe the coaches, but NO players. I agree with the person who posted before that Cox should be suspended for having no control over his team.

rec.sport.baseball

I'd say it's an easy one, since it specifically mentions "baseball". However, nothing should be taken for granted in NLP! Anything that works is a miracle :)

Unsupervised Document Classification
In some cases, we don't have labeled data, so we can't learn characteristics of instances with specific labels, e.g., words that tend to occur in documents about politics. Instead, we can find common characteristics between documents about the same (unknown) topic, and group documents from the (seemingly) same topic together in one cluster.

Then, when a new document is presented, we can assign it to the most suitable cluster, based, for example, on how similar it is to the documents in each cluster. This has several purposes; first of all, we can let someone look at a few documents from each cluster and infer the topic represented by the cluster. We can also take the most common words in this cluster, automatically, and use them as "tags" that describe the topic. Another usage can be recommending a document to someone who read other documents in this cluster (assuming that they are interested in the cluster's topic).

While we don't have the true labels (topics) of the training instances (documents), we assume that each document has a certain topic, which is a hidden variable in our model (as opposed to the documents themselves, which are observed). One clustering approach is through generative models: we assume that a model generated our existing documents, and we use an algorithm that tries to reconstruct the parameters of the model, in a way the best explains our observed data.

The assumption of the generative model is that our training data was generated through probabilities, that we would like to reconstruct. The simpler model, called mixture of histograms, assumes that each document has one topic. A document was generated as follows:

The document's topic c was drawn from the topics' distribution (probability function) P(C).
For example, if we have 3 topics, news, sports and music, with the probabilities 0.5, 0.3, 0.2 respectively, then in half of the cases we are expected to "generate" a document about news.
Given the topic c, the words in the document were sampled from a distribution of words given the topic, P(w|c). For example, if the topic is news, the probability of each word w in the vocabulary in the news topic is P(w|news). Since there are many words in the vocabulary, the probability for each of them is quite small (because the probability of all words sums up to 1). Anyhow, words that are likely to appear in news documents, such as war and report will have higher probabilities, e.g. P(war|news) > P(football|news). When we sample words for the generated news document, we will get mostly words discussing news.

Figure 3: An illustration of the probability distribution of word given a topic.

The goal of the algorithm is to learn the probability functions P(c), and P(w|c) for each topic, given solely the documents themselves. These probabilities are estimated using an iterative algorithm called EM (expectation maximization). Since the topic of each document is unknown (hidden), you should first decide on the number of topics (how fine-grained would you like the clustering to be? should you distinguish between football and basketball or is sports enough?). Then, start with a random guess of the probabilities. The algorithm works in iterations, improving the probabilities at each iteration. Each iteration is made of two steps:

Assign a topic to each document based on the current probabilities.
Given the documents' topic computed at the previous step, re-estimate the probabilities using relative frequency (e.g., the probability of news is the ratio between number of the documents assigned this topic, and the total number of documents).

This algorithm works nicely and it can be used to solve other problems with hidden variables. In the end, each document is assigned to a certain (meaningless) topic. As I mentioned before, the meaning of this cluster of documents can be understood by looking at the common features of several documents in the cluster (e.g. do they all discuss music?).

That's it about text classification for now. Now, can you code something that automatically infers the topic of this blog post? :)

There is so much more that I didn't cover in this post, because I don't want to exhaust anyone. If you're interested in reading more, I recommend:

The difference between generative and discriminative models - it's a general machine learning topic, not specific to document classification. In this post I gave an example of a supervised discriminative model and an unsupervised generative model, but:

The k-means algorithm is an unsupervised discriminative model that can be used for text classification.
Naïve Bayes is a supervised generative classifier that can be used for text classification.

Document classification can also be used for language identification.
And you can't write a post about text classification without mentioning LDA. It was a bit too complex for this post, but here, I mentioned it.

Representing Words

2016-01-03T17:06:00.000+02:00

We're already after 6 posts on the topic of natural language processing, and I can't believe I haven't discussed this basic topic yet. So today I'm going to discuss words; More accurately - I will discuss how words are represented in natural language processing.

A word is the most basic unit in semantic tasks. While other, low-level tasks, such as part-of-speech tagging (detecting the part of speech of each word in a sentence) and lemmatization (finding the lemma - a word's basic form) might be interested in the characters (mostly affixes) of the word, semantic tasks are concerned with meaning. A word is the most basic unit that conveys meaning.

While a word is basically stored in the computer as a string - a sequence of characters, this representation says nothing about the word's meaning. It allows detecting a certain similarity, in very specific cases. For example, morphological derivations (e.g. sing-singer) and inflections (e.g. cat-cats, listen-listened) modify words, creating related words (with a different part-of-speech, plurality form, etc). Such words are therefore similar both in meanings and in their string representations, since they share common characters. This similarity could be detected using Levenshtein distance.

Needless to say, most of the words that are similar by meaning do not share many common characters. Synonyms such as elevator and lift and related words such as food and eat, are not similar at the character-level.

In addition, performing operations on strings is highly inefficient. Computers can handle numbers much better than strings. Better representations are needed also for faster computation. I assume that I've convinced you why words need better representations. Now I can move on to telling you how words could be better represented as vectors (arrays).

one-hot vectors

As I mentioned, working with strings is computationally inefficient. A simple solution to this inefficiency is to convert every string to a number. First, we need to define the vocabulary: this is the set of all distinct words in the language (or at least those that we observed in a very large corpus). Unlike a dictionary, in which the entries are word basic forms, a vocabulary contains different entries for inflections and derivations of the same basic form (e.g., cat and cats).
When a semantic application processes a text, it can now replace every word in the text with its index in the vocabulary, e.g. cat is 12424, cats is 12431, dog is 15879, etc.

This is the most compact representation for a word - all it takes to store a word is one number. So what do vectors have to do with this? Another way to look at this representation is as a vector of dimension |V| (with V denoting the vocabulary), with zeros in all entries, and one set bit at the word's index (see figure 1).

Figure 1: an illustration of one-hot vectors of cat, cats and dog. The only non-zero entry in each vector is the index of the word in the vocabulary.

While one-hot vectors could be stored efficiently, their main problem is that they don't capture any information about the similarity between words. They even lose the information about string-similarity, that sometimes indicates semantic similarity (when words share the same lemma / basic form). Since we're interested in representing words for semantic tasks, a better solution is needed.

Distributional vectors
One important characteristic of a word is the company it keeps. According to the distributional hypothesis ^[1] , words that occur in similar contexts (with the same neighboring words), tend to have similar meanings (e.g. elevator and lift will both appear next to down, up, building, floor, and stairs); simply put, "tell me who your friends are and I will tell you who you are" - the words version.

This idea was leveraged to represent words by their contexts. Suppose that each word is again represented by a |V|-dimensional vector. Each entry in the vector corresponds to an index in the vocabulary. Now, instead of marking a word with its own index, we mark the occurrences of other words next to it. Given a large corpus, we search for all the occurrences of a certain word (e.g. elevator). We take a window of a pre-defined size k around elevator, and every word in this window is considered as a neighbor of elevator. For instance, if the corpus contains the sentence "The left elevator goes down to the second floor", with a window of size 5, we get the following neighbors: the, left, goes, down. A larger k will also include characteristic words such as floor.

We then update elevator's vector by increasing the number of occurrences in the indices of the, left, goes, and down. At the end of this process, each word vector contains the frequencies of all its neighbors. We can normalize the vector to get a probability distribution (how likely each word is to appear as a neighbor of the target word). There are some more complex metrics, but this is the main idea. These methods are also referred to as "counting methods".

The main advantage of distributional vectors is that they capture similarity between words: similar words => similar neighbors => similar vectors. Measuring similarity between vectors is possible, using measures such as cosine similarity, for example. We get a simple method of comparing similarities between words - we can expect elevator and lift to yield a higher similarity score than elevator and, say, cat (and I wouldn't miss the chance to watch a cat in an elevator video). These simple vector similarities are highly effective in recognizing word similarity in semantic tasks (e.g., to overcome lexical variability).

Figure 2: an illustration of distributional vectors of food, eat and laptop. Notice that the vectors of the similar words food and eat are similar, while different from the vector of the dissimilar word laptop.

Yet, distributional vectors pose computational obstacles -- the vocabulary size is typically very high (at least a few hundred thousand words). Storing each of these |V| words in a |V|-dimensional vector results in a |V|²cells matrix - this is quite large, and performing operations on all words is computationally heavy.

Word embeddings
Word embeddings to the rescue. The basic idea is to store the same contextual information in a low-dimensional vector; each word is now represented by a D-dimensional vector, where D is a relatively small number (typically between 50 and 1000). This approach was first presented in 2003 by Bengio et al. ^[2] , but gained extreme popularity with word2vec ^[3] in 2013. There are also some other variants of word embeddings, like GloVe ^[4].

Instead of counting occurrences of neighboring words, the vectors are now predicted (learned). Without getting too technical, this is what these algorithms basically do: they start from a random vector for each word in the vocabulary. Then they go over a large corpus and at every step, observe a target word and its context (neighbors within a window). The target word's vector and the context words' vectors are then updated to bring them close together in the vector space (and therefore increase their similarity score). Other vectors (all of them, or a sample of them) are updated to become less close to the target word. After observing many such windows, the vectors become meaningful, yielding similar vectors to similar words.

What are the advantages of word embeddings? First of all, despite the low dimensions, the information regarding word similarity is kept. Similar words still have similar vectors - this is what the algorithms aim to do while learning these vectors. Second, they are compact, so any operation on these vectors (e.g. computing similarities) is efficient and fast.

Moreover, people have done some amazing things with these vectors. Some of these things (if not all) are possible with high-dimensional vectors as well, but are computationally difficult.

Most similar words - we can easily find the most similar words to a certain word, by finding the most similar vectors. For instance, the 5 most similar words to July are June, January, October, November, and February. We can also find a word which is similar to a set of other words: for example, the most similar word to Israel + Lebanon + Syria + Jordan is Iraq, while the most similar word to France + Germany + Italy + Switzerland is Austria! Isn't this cool? If that's not impressive enough, using techniques for projecting high-dimensional vectors to 2-dimensional space (t-SNE, PCA), one can visualize word embeddings, and get some really nice insights about what the vectors capture:

Figure 3: a t-SNE visualization of a tiny sample of GloVe word embeddings. It seems to have noticed that bird and fly are related, as well as bird and cat, but love, marriage and wedding are not, for some reason.

Figure 4: two-dimensional projection of word2vec vectors of countries and their capitals, from the paper. The lines between countries and their capitals are approximately parallel, indicating that the vectors capture this relation.

Analogies - Following the above figure, the authors of this paper presented some nice results regarding the vectors' ability to solve analogy questions: a is to b as c is to d; given a, b, and c, find the missing word d. Turns out that this can be solved pretty accurately by selecting a vector which is similar to b and c, but not to a. This is done by addition and subtraction, and the most famous example is the "king + woman - man = queen", demonstrated in the following figure:

Figure 5: an illustration of the analogical male-female relation between word pairs such as (king, queen), (man, woman) and (uncle, aunt), from the paper.

While these results are very impressive (I'm still impressed and I've known about them for a long time now...), it still isn't clear how useful these vectors are. Turns out they are amazingly useful. Almost any semantic task has been re-implemented recently with the help of word embeddings. Many of them benefit from a performance boost, and it's mostly just very easy to use them. In future posts, I'll give some examples of tasks that rely on word embeddings.

Suggested Reading:
Deep Learning, NLP, and Representations, from Christopher Olah's blog (more technical and advanced).

References:

[1] Harris, Zellig S. "Distributional structure." Word. 1954. ^↩

[2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin, "A neural probabilistic language model", The Journal of Machine Learning Research, 2003. ^↩
[3] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient estimation of word representations in vector space." CoRR, 2013.. ^↩
[4] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. "GloVe: Global Vectors for Word Representation." EMNLP 2014. ^↩

Recommender Systems

2015-11-11T17:29:00.001+02:00

Recommender systems suggest users specific content according to their preferences, by predicting the users' rating or preference of items.

If this doesn't ring a bell, let me tell you how common recommender systems are in your world.

Shopping sites, such as Amazon, eBay, and AliExpress, recommend items to purchase based on your previous purchases.*

Amazon recommended me to buy a recycling bin sticker because I had recycle bins in my shopping basket

IMDB recommends movies you might like, based on your other movie ratings.

IMDB recommended my husband to watch "Pirates of Silicon Valley", based on his interest in similar movies.

Music sites, such as Rdio, Spotify and the late GrooveShark, recommend artists similar to the artists you listen to.

Rdio suggested artists that it considered similar to those I've been listening to

And of course, YouTube recommends videos based on your previous watches.

YouTube identified that I like budgies and Muse and that I'm a beginner electric guitar player

You can't overestimate the business value of these recommendations. Studies have shown that they increase sales. So how does Amazon know which products are likely to interest me? How does Rdio know to recommend I listen to Kings of Leon and would never recommend listening to Taylor Swift? In general, these systems know that I like certain products/artists/movies/videos and would like to predict other products/artists/movies/videos I might like. Many of these systems use a quite simple algorithm, called Collaborative Filtering.

Collaborative Filtering

Let's take music recommendation as an example. The system has many registered users, and many artists. For simplicity, assume that rating applies to artists rather than albums or songs. Users can rate artists in a scale of 1-5. An average user would rate a few artists; there would still be many other artists that he doesn't rate, either because he doesn't know them or because he didn't listen to them through the website.

In many cases, the user doesn't actively rate the artist, but the system infers a rating based on the user's behavior - for example, if the user listened to a certain artist many times, it counts as high rating. On the other hand, if the system offered the user this artist a couple of times and he always clicked "skip", he must really not like this artist. The rating technique doesn't really matter.

User preferences are stored in a large matrix (table), in which each row represents a user, and each column represents an artist:

A sample user artist ranking matrix. with 5 users and 7 artists. The ranking is between 1 (hate) to 5 (love). If a user hasn't ranked a certain artist, the rank is unknown and is marked with a question mark. This is a toy example; much more data is needed for accurate inference.

The idea behind the algorithm is basically to get rid of the question marks in the matrix, and replace them with predicted ratings. The basic assumptions are:

If I like a certain artist (e.g. Muse), I would like similar artists (e.g. Royal Blood).
If I'm user A, and I agree with user B's ratings on many artists, I might agree with user B's ratings of other artists, which I haven't rated (and maybe don't know yet). For example, if both user B and I really love Taylor Swift and Miley Cyrus, and user B also really likes Bruno Mars, which I haven't rated, then I might also like Bruno Mars.

From these two reasonable assumptions, the algorithm diverges to two possible implementations:

Based on user similarity
This implementation looks at a certain user and tries to complete missing ratings for this user. According to the second assumption, all we need to do is measure to what extent this user's preferences are similar to those of any other user, and then use the similar users' ratings to complete missing ratings.

It is easy to see that each user is represented by a row vector. The dimension of this vector is the number of artists, and every cell in the vector represents the user's rating of a certain artist. For example, this is user 2's vector:

We can measure the correlation of user 2's ratings with another user's rating, by looking at the subset of artists that both users rated. We treat each rating with respect to the user's average rating of artists. For example, user 2's average rating is (1 + 5 + 4 + 4) / 4 = 3.5, so his rating of Arctic Monkeys is -2.5 below the average, while his rating for Taylor Swift is +1.5 above the average. High correlation between users occurs when they have many mutual ratings with similar distance from the average. I'll spare you the formula, but let's look at an example. It seems that users 1 and 3 share a fairly similar taste in music:

Rating vectors of user 1 (top) and 3 (bottom). The mutual ratings are similar.

While they both differ from user 5, that seems to hate both Arctic Monkeys and Muse:

If we would like to predict user 1's missing rating of Bruno Mars, we first need to find the k most similar users to user 1 that rated Bruno Mars (k being a parameter of the system). For each of these similar users, we compute the distance of their rating of Bruno Mars from their average rating. For example, +1, -2.5, etc. Then, we compute a weighted average of these distances, with the user-similarity as the weight. The weighted average means that the more similar a certain user is to the target user (e.g. user 1), the more weight we put on his preferences.

This weighted average is the predicted distance from the average rating of user 1. For example, setting k=1 will take into account user 3 as the most similar user. User 3's average rating is (5 + 1 + 5 + 2 + 5 + 1) / 6 = 3.167. The distance of his rating for Bruno Mars from the average is 1 - 3.617 = -2.617. Since we chose k=1, this would be exactly the distance for user 1. His average rating is (4 + 5 + 3) / 3 = 4, so his predicted rating for Bruno Mars is 4 - 2.617 = 1.83. Pretty reasonable considering that both users like rock, but user 1 shows more tolerance to pop music by liking Adele a bit more than user 3.
Based on item similarity
This variant is pretty straightforward after understanding the user-based similarity. We now look at each artist as a column vector, trying to predict what users would rate this artist.

The column vector for Arctic Monkeys (left) and Muse (right). They are fairly similar in the given toy example (while not so similar in real life, given other rock artists).

The k most similar artists of each artist are computed in a similar fashion. Again, with respect to the distance of each user's rating from his average (that's because different users have different rating scales - some are more generous than others). The predicted rating of a certain artist by a certain user is the weighted average of the ratings of this user for similar artists. For example, to predict user 4's rating of Arctic Monkeys with k=1, we simply take his rating of Muse.

Next time that some website miraculously predicts which artist/video/movie/product you would like, you should know it wasn't a wild guess but a rather simple heuristic.

* No items were purchased during the writing of this post.

Fun with Google Ngrams

2015-10-09T19:11:00.000+03:00

Google N-Grams is a dataset released by Google a few years back, which is based on Google Books. The dataset is available for multiple languages. All Google books of a certain language were scanned, and the frequency of each n-gram, for n = 1 to 5, was counted by the publish year. For example, how many times did the word "no" occur in books from 1800 to 1900? How many times did the trigram "freedom of speech" occur in the year 2000? These counts were normalized and the result is approximated probability of n-grams throughout time. The data is available for download, and can also be viewed via the viewer.

While this data is very helpful for training NLP models, it can also provide some cultural, historical and sociological insights... and hours of fun!

Let's warm up with a simple example, exploring the change in the English language throughout time. I took a few synonyms of the word happy, and compared their usage between 1800 and 2008 (the last year available in the viewer). While the curves of merry, cheerful and delighted look pretty similar, the gay curve departs from the others in the 1980s. There's a reason for that, and that could be explained with the following graph: The word gay in the meaning of homosexual, has been in use since the 1950s, boosting its usage since.

The frequency of a certain term sometimes correlates with historical events. For example, while the word war is constantly in use, it was mostly prominent in books during and after World War I and World War II. See the peaks in the graph:

Another interesting thing to notice is that people (at least authors) are actually peaceful. Whenever they talk about wars, they also talk about peace. Look how similar the war and peace curves are:

The curve similarity may suggest that the same books that discuss war also mention peace, but since the war curve dominates the peace curve, I can only assume that war is the books' main topic and peace is only mentioned a couple of times. I hope that they say good things about peace.

Searching for World Trade Center shows that it was first mentioned around its construction in 1973, then there were a few years that it was hardly discussed, and then came along 09/11 and made it a very common topic. In some cases, the correlation with historical events is through new words that describe concepts or products. The time they start appearing in books is by the time of their invention / foundation. For example:

Facebook was founded in 2004.
Google was founded in 1998.
Twitter was founded in 2006. However, twitter is an English word that was already in use before 2006 (and as it seems, sometimes appeared capitalized, probably in the beginning of a sentence).
What about some older inventions?
The invention of the telephone, which is attributed to Alexander Graham Bell, in fact involved other inventors such as Antonio Meucci and Thomas Watson. They started in 1844, but Bell granted patent for the telephone in 1876. The television was invented in 1926. Which of them had greater influence on the world? If there is any correlation between being mentioned in books and having influence on the world, it seems like the television did. Having said that, telephone is commonly referred to as phone, and in recent years also includes cellphone and smartphone. So putting all these together changes the picture:
Some words were mentioned for periods of times and then just disappeared. Take for example this list of diseases, each relevant in different eras: Except for historical events, you can try to use the data to search for correlations between events or phenomena. Judge for yourself: Bear in mind that correlation doesn't imply cause-effect relation, and not even that a third factor impacts these two phenomena. Sometimes they just happen at the same time.

Just for the fun of it, can you guess which is the most important day of the week? It's Sunday! I expected that from English books, but I thought that Saturday will be more prominent in Hebrew books. That wasn't the case - the Hebrew graph was similar with Sunday way ahead of the other days. Maybe this happens because of translated books. Happy weekend everyone!

Translation Models

2015-09-29T18:17:00.000+03:00

This is the last part of the machine translation overview, in which I will discuss translation models. To recall, a statistical machine translation system produces a translation that is required to be both adequate, that is, as close as possible in its meaning to the source sentence, and fluent in the target language. Fluency is the responsibility of the target language model, that scores a every candidate translation according to its likelihood in the target language. The translation model, which will be presented in this post, takes care of adequacy: it scores candidate translations with respect to the original sentence in the source language - higher scores for sentences that better preserve the meaning of the original sentence.

Toilet sign at a restaurant in Mestre, Italy. Some kind of machine translation was used, translating toilet in Hebrew to the makeup. If you recognize funny translations in other languages, please comment!

As in language models, you don't need an expert to build the translation model. You don't even need to speak either the source or the target language. Using statistical methods, you can (theoretically), build a translation model from Swahili to Yiddish. The only requirement is to have a parallel corpus - a large amount of the same text, written in both languages. For example, movie subtitles or book translations in both languages. The texts are usually aligned at the sentence-level, so it can be regarded as a large collection of sentences in the source language and their translations to the target language. For example, the first sentence from George Orwell's novel 1984, in the original edition and in the Hebrew translation:

en: It was a bright cold day in April, and the clocks were striking thirteen.
he: יום אפריל צח וצונן, השעונים מצלצלים שלוש-עשרה.

can be considered as mutual translations. So do the rest of the sentence-pairs, as long as the translator is not too creative.

History Lesson
Here's a nice anecdote about using a parallel corpus for translation - it's actually not a modern technique at all. It has been here since the 19th century. The Rosetta Stone is an ancient Egyptian stone inscribed with a decree issued at Egypt, in 196 BC. The text on the stone is written in three scripts: ancient Egyptian hieroglyphs, Demotic script, and ancient Greek. Ancient Egyptian hieroglyphs were used until the end of the fourth century, after which the knowledge of how to read them was lost. For hundreds of years, scholars have tried to decode the ancient Egyptian hieroglyphs. In 1799, the Rosetta stone was rediscovered near the town of Rosetta in the Nile, and brought with it a major advancement in the decoding of the ancient Egyptian hieroglyphs. It was the recognition that the stone offered three versions of the same text that enabled the advancement, making it the first parallel corpus used for statistical translation (at this time, without machines). It was finally decoded in 1822 by the French scholar Jean-François Champollion. The stone is on public display at the British Museum (and is the most interesting exhibit there, in my opinion).


The Rosetta Stone

Learning the translation model
Using sentence pairs from a parallel corpus as a translation table is nice, but not enough. You can always generate a sentence in the source language that didn't occur in the corpus, so it wouldn't be in the table. However, a sentence is composed of phrases (words and multi-word expressions), so instead of constructing a sentence translation table, a phrase translation table could be built, enabling a phrase-by-phrase translation. If the corpus is large enough, you can assume that it covers at least most of the common words and phrases in these languages.

This is what an excerpt from a phrase table from English to Hebrew might look like:

source	target	score
day	יום	1.0
April	אפריל	1.0
bright	צח	0.58
bright	בהיר	0.42
cold	קר	0.7
cold	צונן	0.3
thirteen	שלוש עשרה	0.41
thirteen	שלושה עשר	0.21
thirteen	13	0.38

Each entry contains a source language phrase, a target language phrase and the score (probability) of translating the source phrase to the target phrase. These are not trivial to compute, since the corpus is aligned at the sentence level. All we know is that "יום אפריל צח וצונן, השעונים מצלצלים שלוש-עשרה" is a (possible) translation of "It was a bright cold day in April, and the clocks were striking thirteen", but we don't know which words in English are translated to which words in Hebrew. The assumption is that each word in the source sentence is translated to 0, 1 or more words in the target language. In the simple case, it is translated to one word. In other cases, a word may disappear in translation (for example, the determiner "a" in English doesn't exist in Hebrew) or be translated to a multi-word phrase (e.g. the word "thirteen" is translated to "שלוש עשרה").

The word-level alignment of a sentence-pair.

The solution is, again, to use statistical methods. In particular, aligning these sentence pairs at the word level using the corpus statistics. The most basic alignment model is IBM model 1. It goes over all the sentence pairs in the corpus, and counts for each source word its occurrences in the same sentence pair with target words - since every target word could be its translation. In the example sentences-pair, the Hebrew word יום is counted once with every one of the English words It, was, a, bright, cold, day, in, April, and, the, clocks, were, striking, thirteen. If it appears in another sentence pair, for example, "איזה יום יפה" and "what a beautiful day", the word day will have two occurrences with יום. Since this is the true translation, the word day will occur in every sentence pair in which the word יום occurs. These counts are used to estimate the probability of translating the source word to a target word. In some cases, an English word may have several possible translations, such as cold that could be translated both to צונן and קר. In this case, the English word cold will appear in some cases with צונן and in others with קר. The probability will be computed accordingly (and will be higher for the more common translation).

This is the basic model, and there are other IBM models (2-5) that handle some of the problems that the basic model doesn't solve (e.g. considering the distance between aligned words). This phase's output is a word-to-word table, and then another algorithm is applied to create a phrase table, merging multi-word expressions to one phrase (e.g. "hot dog" which is translated differently from "hot" and "dog").

Putting it all together
The decoder is responsible for performing the actual translation: given the source sentence, it constructs a new sentence in the target language, using the translation model to offer phrase translations and their scores, and the language model to rank the fluency of the translation.

There are multiple ways to segment the source sentence to phrases (e.g., should "hot dog" be regarded as a phrase, or segmented to "hot" and "dog"?), and in most cases there are also multiple ways to translate each phrase in the source language to a phrase in the target language (e.g., should "cold" be translated to "צונן" or to "קר"?). In addition, the phrases in the target language may be re-ordered to follow grammar rules in the target language (e.g. adjective before noun in English, but after noun in many languages such as Hebrew, Romanian and French). The decoder tries many of these segmentations, translations and orders and produces candidate translations.

Each candidate translation is scored by three components: the language model scores the translation according to its fluency in the target language. The re-ordering model (which we haven't discussed in details) gives a score based on the changes in the order of words in both languages. The last score is the one given by the translation model. Each phrase-to-phrase translation score is the probability to translate one phrase to the other. So the translation model's score for the entire sentence is the product of all phrase translation scores, for example, if the source sentence is "It's not cold in April":

score(לא קר באפריל) = TM(לא,not) TM(קר,cold) TM(ב,in) TM(אפריל,April) LM(לא קר באפריל) RM(לא קר באפריל, It's not cold in April)

And eventually the decoder would select the candidate translation with the highest score it could find.

As always, I'll end the post with hedging myself by saying that I really haven't presented the entire world of translation, just gave you a taste of it. I tried to simplify the basic models that I told you about, but they are a bit less simple than I described. Also, there are newer and more accurate models that involve machine learning techniques, or consider the syntax of the source and target sentences. I hope I could convey the basics clearly and interestingly enough :)

Language Models

2015-09-12T19:26:00.000+03:00

In my previous post about Machine Translation, I mentioned how language models are used in statistical machine translation. Language models are used in many NLP applications. In this post, I will explain about language models in general, and how to learn a certain kind of language models: n-gram language models.

A language model is for a specific language, for example, an English language model. It receives as input a sequence of words in English (sentence / phrase / word). For simplicity, let's say it receives a sentence. The language model score for a sentence s, P(s), is a score between 0 and 1, that can be interpreted as the probability of composing this sentence in English. This score determines how fluent s is in English; the higher the score, the more fluent the sentence is. Language models can capture some interesting language phenomena:

Which sentence is grammatically correct? - P("he eat pizza") < P("he eats pizza")

Which word order is correct? - P("love I cats") < P("I love cats")

and even some logic and world knowledge:

What is more likely? - P("good British food") < P("good Italian food")

It can also tell you that pdf is the fourth largest religion:

Google suggests words that are likely to complete the query. From here.

Learning a language model
What does it take to build such a language model? Just a large English text corpus (a large and structured set of texts). We are interested in the probability of sentences, words and phrases in the language, but we don't know the real distribution of words and sentences in the language. We can use a large-enough corpus to estimate this probability. The basic method is to use relative frequency (Maximum Likelihood). The probability of a certain word w to occur in English, p(w) is approximated by the ratio of the occurrences of w in the corpus (the number of occurrences of w / the number of any word occurrence). For example:

The word cat occurred 3853 times, out of total of 100,000,000 words, so its estimated probability is 0.00003853.
The word no, on the other hand, occurs more frequently: 226,985 times. So its probability is 0.00226985, and therefore when you compose a sentence in English, you are much more likely to say the word no than cat.

But we are also interested in computing the probability of multi-word expressions, phrases and sentences. Since any of them is simply a sequence of words, we can use the chain rule to compute the probability:¹

(1) P(A_1,A_2,...,A_m) = P(A₁) P(A₂|A₁) ... P(A_n|A₁A_2,...,A_m-1)

where P(A_i|A₁A_2,...,A_i-1) denotes the probability that the word A_i is the next word in the sequence A₁A_2,...,A_i-1. For example, P(I love my cat) = P(I) P(love|I) P(my |I love) P(cat |I love my). We can assume that the words are independent of each other, and get a much simpler formula:

(2) P(A_1,A_2,...,A_m) = P(A₁) P(A₂) ... P(A_m)

So whenever you pick an extra word to continue your sentence, you choose it by its distribution in the language and regardless of the previous words. This doesn't make much sense though. The probability of the word cat is lower than that of the word no. However, in the context of the incomplete sentence "I love my", the word cat is much more likely to complete the sentence than the word no.

To estimate the conditional probability of a word A_i (cat) given any number of preceding words A_1,A_2,...,A_i-1 (I love my), we need to count the number of occurrences of A_i after A_1,A_2,...,A_i-1 (how many times the sentence "I love my cat" appears in the corpus) and divide it in the number of times that A_1,A_2,...,A_i-1 occurred with any following word (how many times the sentence "I love my *" appears in the corpus for any word *). You would expect that P(cat|I love my) would be higher than P(no|I love my).
You would also see that the conditional probability P(cat|I love my) is different from the prior probability P(cat). I'm not sure if it would be higher though; but I'm sure that P(cat|Persian) > P(cat): you are more likely to say "cat" if you already said "Persian", than just like that out of the blue.

However, assuming that every word in the sentence depends on all the previous words is not necessary, and it causes a problem of sparsity. In simple words, there is not enough data to estimate the probabilities. In order to compute the probability of the word cat to complete the sentence My friend John once had a black, you would need the sequences "My friend John once had a black" and "My friend John once had a black cat" to actually appear in the corpus. The corpus is big, but it doesn't contain any sentence that anyone has ever said.

What's the solution? Markov assumption. We can assume every word only depends on k preceding words. For example, if k=1, we get:

(3) P(I love my cat) = P(I) P(love|I) P(my|love) P(cat|my)

This kind of language model is called n-gram language model, where an n-gram is a contiguous sequence of n words.² The model works with n-grams, so the assumption is that every word depends on the preceding (n-1) words. For example, a unigram (n=1) language model considers the words independent of each other (P(I love my cat) = P(I) P(love) P(my) P(cat)). A bigram language model assumes that every word depends on the previous word (P(I love my cat) = P(I) P(love|I) P(my|love) P(cat|my)). There are also trigram (n=3) and 4-gram language models; larger ns are less commonly used, to the best of my knowledge.

Smoothing
While choosing a small n reduces the sparsity, it doesn't solve the problem completely. Some rare words (e.g. absorbefacient, yes, it's an actual word) or n-grams (e.g. blue wine) may never occur in the corpus, but still be valid in the language. If we use a word's relative frequency as its probability, a word that never occurs in the corpus receives zero probability. If a sentence contains such a word (e.g. I went to the pub and ordered a glass of blue wine), its probability will be zero. While we would probably like this sentence to have a very low probability, we wouldn't want it to be zero; we are aware of the fact that our corpus may be missing some valid English words.

Smoothing solves this problem. The simplest smoothing technique "hallucinates" additional k occurrences of every word in the sentence. For example, add-1 smoothing would consider that the word absorbefacient occurred once (if it hasn't occurred at all in the corpus), and that the word cat occurred 3854 times (when it actually occurred 3853 times). The new probability is:

(4) P(cat) = (3853 + 1) / (100,000,000 + V)

where V is the size of the vocabulary (the number of words we "added" to the corpus).
The same thing applies for n-grams. With this new formula, the probability of unseen words (and n-grams) is small, but never zero.

And as always, there are more complex smoothing techniques (Back-off, Kneser-Ney, etc.), that I will not discuss in this post.

Do you want to try it yourself? I implemented a simple language model for this post.³Type a sentence, hit the button and you'll get the probability of the sentence (after a while...). Try it!

What can you do with language models?
As the demo shows, you can compute the probability of a sentence in a certain language.
As I explained in the previous post, statistical machine translation systems use a language model of the target language to prefer translations that are more fluent in the target language.

In the other direction, language models can be used to generate the next word in a sequence of words, by sampling from the distribution of words (given the previous word). It can complete your search query or suggest corrections to your text messages. One of the funnest things is to generate a whole new sentence. To illustrate, I used my language model and generated the not-very-sensible sentences⁴:

Who indulge yourself.

The american opinion have attachments of miners from hard lump sum of his search far as to ? and considerable number of the cavity.

As in Massachusetts.

To start a pretext for which the orbit and rostov smiled at mrs. rucastle seemed to give the society it must be the European settlement?'' said he was all men were to himself to your monstrosity such things he had already passed in the contrary to which alone , beyond the tarsal bones is she 's eyes were drawn up because he wanted.

It was about 87,000 soldiers.

sentences generated with a bigram language model

Better sentences could be generated with larger n or with smarter (not n-gram) language models. Anyway, generating sentences can fill up hours of fun.

Choosing the corpus from which you learn the language model greatly affects the final outcome. Needless to say, choosing the language of the corpus is crucial. If you want a French language model, you need a French corpus, etc. Furthermore, if you base your language model on Shakespeare's writings, and then try to use it to estimate the probabilities of recent posts by your Facebook friends, they will probably be very unlikely. If your corpus is from the medical domain, sentences with medical terms will have a higher probability than those discussing rock bands. So you must choose your corpus carefully according to your needs. For purposes such as machine translation, the corpus should be general and contain texts from diverse domains. However, if you develop a machine translation system for a specific application, e.g. a medical application, you may want your corpus to contain relevant documents, for instance medical documents.

Having trained your language model on a very specific corpus, e.g. a corpus of recipes or a corpus of all the songs by The Smiths, you can go along and generate a whole new sequence of words. If your language model is good enough, you might get a brand new recipe (dare to try it?) or a song by the Smiths that Morrissey has never heard about.⁵ In fact, language models don't have to be trained on a text corpus at all. You can train them on musical notes and compose a new melody. Here are some examples for melodies that were generated from musical notes n-gram language model.⁶

And just so you won't think that n-gram language models are the state-of-the-art: there are other language models, some perform much better. Maybe I'll mention some of them in other posts.

1 To be more accurate, P(A₁) represents the prior probability of A₁ (the probability of the word A₁ to occur in English), while we are interested in the conditional probability of A₁ given the beginning of a sentence. Therefore, the beginning of each sentence in the corpus is marked with a special sign <S>, and P(A₁) is replaced by P(A₁|<S>). This was omitted from the rest of the formulas for simplicity. ^↩

2 A good source for n-grams count is Google Ngrams, extracted from Google Books. ^↩

3 The language model in the demo is a bigram language model with add-1 smoothing. I trained it using the corpus big.txt from Peter Norvig's website. ^↩

4 I started with the special sign <S> and sampled the next word from the distribution given the previous word, until period was sampled. ^↩

5 In fact, I tried it, but it didn't work well because the corpus was too small. The Smiths were only active for 5 years and they don't have enough songs. ^↩

6 These examples are taken from Implementing A Music Improviser Using N-Gram Models by Kristen Felch and Yale Song. They were not the first to implement a musical n-gram model (I found a previous work, and I'm sure there are others), but they published some sample songs that are pretty good. ^↩

Machine Translation Overview

2015-08-31T15:56:00.000+03:00

Imagine you are at a restaurant in a foreign country, and by trying to avoid tourist traps, you found yourself at a nice local restaurant in a quiet neighborhood, no tourists except for you. The only problem is that the menu is in a foreign language... no English menu. What's the problem, actually? Pick your favorite machine translation system (Google Translate, Bing Translator, BabelFish, etc.) and translate the menu to a language you understand!

So, there's no need to elaborate on the motivation for translation. What I would like to do is give you an overview of how this magic works, and some idea of why it doesn't always work as well as you would expect.

The Tower of Babel, by Pieter Bruegel the Elder. Oil on board, 1563.

I'm going to focus on statistical machine translation. Translation means taking a sentence in one language (the source language) and producing a sensible sentence in another language (the target language) that has the same meaning. Machine means that it's done by software rather than a human translator. What does statistical mean?

It means that rather than coding expert knowledge into software and creating a lexicon and grammatical rules for translation between two specific languages, these systems are based on statistics on texts in the source and target languages. This is what makes it possible to produce translation between any source and target languages without additional effort, and without having to hire someone that actually speaks these languages. The only thing you need is a large amount of text in both languages.

Statistical Machine Translation

What makes a translation good?

It is as similar as possible in meaning to the original sentence in the source language.
It sounds correct in the target language, e.g., grammatically.

The first demands that the translation is adequate and the second that it is fluent.

SMT systems have a component for each of these requirements. The translation model makes sure that the translation is adequate and the language model is responsible for the fluency of the translation in the target language.

Language Model

I mentioned language models in my NLP overview post. They are used for various NLP applications. A language model (of a specific language, say English) receives as input a sentence in English and returns the probability of composing this sentence in the language. This is a score between 0 and 1, determining how fluent a sentence is in English - the higher the score, the more fluent the sentence is. Language models (LM) can capture grammatical rules (LM("she eat pizza") < LM("she eats pizza")), correct word order (LM("love I cats") < LM("I love cats")), better word choice (LM("powerful coffee") < LM("strong coffee")), and even some logic and world knowledge (LM("good British food") < LM("good Italian food")).
These models are obtained from a large corpus (structured set of texts) in the target language (e.g. English). ~~In the next post I will elaborate on how this is done~~ (edit 12/09/15: you can read in the next post how this is done).

Translation Model

A translation model (from the source language to the target language) receives as input a pair of sentences / words / phrases, one in each language, and returns the probability of translating the first sentence to the second. As the language model, it also gives a score between 0 and 1, determining how adequate a translation is - the higher the score, the more adequate the translation is.

These models are obtained from a parallel corpora (plural of corpus) - each corpus contains the same texts in a different language (one in the source language and one in the target language). ~~I will elaborate on how this is done in another post~~ (edit 29/09/15: you can read in this post how it is done).

Translating

Given these two components, the language model and the translation model, how does the translation work? The translation model provides a table with words and phrases in the source language and their possible translations to the target language, each with a score. Given a sentence in the source language, the system uses this table to translate phrases from the source sentence to phrases in the target language.

There are multiple possible translations for the source sentence; first, because the source sentence could be segmented to phrases in multiple ways. For example, take the sentence Machine translation is a piece of cake. The most intuitive thing to do would be to split it to words. This will yield a very literal translation (in Hebrew: תרגום מכונה הוא חתיכה של עוגה), which doesn't make much sense. The translation table probably also has an entry for the phrase piece of cake, translating it to a word or an idiom with the same meaning in the target language (in Hebrew: קלי קלות. Ask Google).

Second, even for a certain segmentation of the source sentence, some phrases have multiple translations in the target language. It happens both because the word in the source language is polysemous (has more than one meaning) (e.g. piece), and because one word in the source language can have many synonyms in the target language (e.g. cake).

The translation system chooses how to segment the source sentence and how to translate each of its phrases to the target language, using the scores that the two models give the translation. It multiplies the translation score for each phrase, and the language model score for the entire target sentence, for example:

P(תרגום מכונה הוא חתיכת עוגה|Machine translation is a piece of cake) = TM(תרגום,translation) TM(מכונה,machine) TM(הוא,is) TM(חתיכת,piece) TM(עוגה,cake) LM(תרגום מכונה הוא חתיכת עוגה)

This score could be understood as the conditional probability of translating Machine translation is a piece of cake to תרגום מכונה הוא חתיכת עוגה, but I'll spare you the formulas. The intuition behind multiplying the scores for the different translation components is joint probability of independent events.

Some things to note: the word of disappeared from the translation, and the words machine and translation switched places in the target sentence. These things happen and are allowed. Machine translation is a bit more complex than what I've told you. Just a bit :)

So each possible translation receives a final score, indicating both how adequate the translation is and how fluent it is in the target language, and the system chooses the translation with the highest score. In this case, Google ironically gets this one wrong.

Google ironically translates "Machine translation is a piece of cake" incorrectly.

Why is it a real bad idea to rely on machine translation when you wish to speak / write in a language that you don't speak?

Because you may say things that you don't mean.

I'll give some examples of problems in translation.

Ambiguity - as you probably remember, this problem keeps coming back in every NLP task. In translation, the problem is that a polysemous word in the source language may be translated to different words in the target language for different senses. For example, wood can be translated in Hebrew to עץ (a piece of a tree) or to יער (a geographical area with many trees). While a human translator can pick the correct translation according to context, machines find it more difficult.

It gets even worse when you use a polysemous word in its less common meaning; A few months ago I needed to send an email to the PC (program committee) chairs of the conference in which I published my paper. I've noticed something funny about my mail, and I had to check how Google Translate can handle it. My mail started with "Dear PC chairs". I translated it to Hebrew (and back to English, for the non-Hebrew speakers in the audience):

Dear PC chairs => כסאות מחשב יקרים => expensive computer chairs

Don't expect SMT systems to always understand what you mean

So what happened here? The word chair has two meanings; I meant the less common one, chairman, while Google translated it to the more common sense (furniture). Acronyms are much worse when it comes to polysemy, and PC refers, almost 100% of the times, to Personal Computer. On top of that, the adjective dear is translated in Hebrew to יקר, which means both dear and expensive. Google chose the wrong sense, creating a funny translation. However, given the knowledge about how SMT systems work, it's understandable that selecting the more common senses of words yields better scores for both the language model and the translation model. I can't blame Google for this translation.

This is just one example of a problem in machine translation. There are so many other problems: different languages have different word order (e.g. adjective before the noun in English, but after the noun in Hebrew, French and many other languages); in some languages nouns have gender while in others they don't; idioms are really tough for SMT systems - sometimes they are translated literally, like the piece of cake example (when it was a part of a sentence).

A good translation for an idiom.

These problems are handled by more complex machine translation systems, that enable word re-ordering and translation at the syntactic level. Nevertheless, as you probably encounter from time to time, this task is not yet performed perfectly.

Since machine translation systems are not very accurate, it is very funny to translate a sentence to a random foreign language and back to English several times and see how you often get a totally different sentence (sometimes meaningless) in the end of this process. This is what Bad translator does. I tried it several times, and it was very amusing. Their example from the Ten Commandments inspired me to try other commandments, resulting in very funny bad translations:

Thou shalt not make unto thee any graven image => You can move the portrait

Thou shalt not kill => You must remove.

Thou shalt not commit adultery => Because you're here, try three

Thou shalt not steal => woman

And some good ones:

Remember the sabbath day, to keep it holy => Don't forget to consider Saturday.

Honour thy father and thy mother => honor your father and mother

You are welcome to try it and post your funny bad translations in the comments!

Probability

2015-08-21T15:43:00.001+03:00

How likely are you to read this through? If your answer is a numerical value between 0 and 1, you may skip this post. You already know the material.

Why am I writing about probability? First of all, because I really LOVE probability. If you don't, I hope that by the end of this post you would like it a little bit more. Second, we use probability in everyday life: when we plan an outdoor activity, we estimate the probability of rain. When we make life decisions, we think of the probable consequences, since we can't tell the future... Unfortunately, most people use it wrong, without basic understanding of probability.

And last, I was about to write a post about Machine Translation, and realized that I can't explain anything without first introducing probability. Probability is widely used in NLP, as in many other computer science fields.

Probability is a numerical value between 0 and 1 measuring the likeliness that an event will occur. 0 means that the event will not occur, and 1 means that the event will certainly occur. An intuitive example is tossing a coin; there are two possible outcomes: "heads" and "tails". If the coin is fair, the probability (chance) of each outcome is ½ (50%): P(heads) = P(tails) = ½.

A fair coin and the two outcomes of its tossing.

In general, when conducting an experiment, such as tossing a coin, there are several possible outcomes (e.g. { heads, tails }). Every outcome's event (e.g. "the coin landed on heads") has a probability between 0 and 1. Since every experiment must have an outcome, the probability that any of the possible outcomes occurred is 1. In this example, P("heads or tails") = 1. This event represents the entire "probability space".

As you can see from the example above, an event can be composed of several outcomes. Think about another experiment: rolling a die. The possible outcomes are: {1, 2, 3, 4, 5, 6}. The event "the outcome is an odd number" is composed of the outcomes 1, 3, and 5. We can write it as A={1,3,5}.

A die with 3 out of 6 possible outcomes showing.

If two events have no common outcomes, they are called disjoint, and the probability that any one of them will occur is the sum of probabilities that each of them will occur. For example, A={1} and B={2}. P(A or B), denoted P(A ∪ B), is the probability that either A or B occurred. We know that a die can only show one number, so A and B can't both occur, and:

P(A ∪ B) = P(A) + P(B).

We already know that the probability of the entire probability space (all possible outcomes) is 1, so: P({1, 2, 3, 4, 5, 6}) = 1. We also know that the events are disjoint, therefore P({1, 2, 3, 4, 5, 6}) equals to the sum of probabilities of all events. If the die is fair (it is not biased towards a certain outcome), then the probability of all outcomes is equal. Therefore, P(1) = ... = P(6) = ⅙. This is called uniform distribution. In most real-world examples, this is not the case, otherwise, probability would have been boring (and probability is fascinating! Really!).

Every event has a complement. For example, the event A="the die shows an odd number"={1, 3, 5} has a complement Ā="the die shows an even number"={2, 4, 6}. The event B={1} has a complement B̄={2, 3, 4, 5, 6}. The complement of an event is all the other possible outcomes. Now you must notice that by definition, "A or Ā" is the entire probability space, and A and Ā are disjoint. Using the two properties we've just discussed:
1) the probability of the entire probability space is 1.
2) the probability that any of disjoint events occurred is the sum of their probabilities.
We can tell that P(A) + P(Ā) = 1. So if you know the probability that it would rain tomorrow P(R), you also know the probability that it won't rain tomorrow: 1 - P(R).

Joint & Conditional Probabilities
We can also discuss the joint probability of events A and B: this is the probability that both events will occur. For example, what is the probability of rolling an even number which is bigger than 2? Let's define two events. A is the event of even outcomes: A = {2, 4, 6}, and B is the event of outcomes larger than 2: B = {3, 4, 5, 6}. Then C is the intersection of A and B, denoted A ∩ B: it contains all the outcomes that are both even (in A) and larger than 2 (in B): C = {4, 6}. Since {4} and {6} are disjoint, and the probability of each outcome is ⅙, then P(C), which is also denoted as the joint probability of A and B, P(A, B) = P({4}) + P({6}) = ⅙ + ⅙ = ⅓.

Say that you know event A occurred, for example, you know that it rains today. Does it change the probability of another event B to occur, for example, that you will be late to work today? The probability of event B to occur, given that event A occurred, is the conditional probability P(B|A) (B given A). If A and B are dependent, this probability is different from the prior probability of B: P(B) (the probability of B, without having knowledge about A). The conditional probability of B given A is the ratio of how likely it is that A and B occur together, given that A has occurred:

(1) P(B|A) = P(A,B) / P(A)

For example, when rolling a die, let A={1,3,5} (odd outcome) and B={4,5,6} (outcome greater than 3). Then P(A,B) = P(A ∩ B) = P({5}) = ⅙. P(A) = ⅙ + ⅙ + ⅙ = ½.
Therefore, P(B|A) = P(A,B) / P(A) = ⅙ / ½ = ⅓ < P(B) = P({4, 5, 6}) = ⅙ + ⅙ + ⅙ = ½. So If you know that outcome was odd, the probability that it was greater than 3 has reduced.

On the other hand, some events may be independent. For example, if B is the event of outcomes larger than 2: B = {3, 4, 5, 6}, and A remains the same, then P(A,B) = P(A ∩ B) = P({3,5}) = ⅙ + ⅙ = ⅓. P(A) remains ½, and P(B|A) = P(A,B) / P(A) = ⅓ / ½ = ⅔ = P(B) = P({3, 4, 5, 6}) = ⅙ + ⅙ + ⅙ + ⅙ = ⅔. So knowing that the outcome was odd doesn't affect the chances the the outcome is greater than 2, and A and B are independent.

If two events A and B are independent, then P(B|A) = P(B), P(A|B) = P(A), and using equation (1) we get that P(A,B) = P(A)P(B). So if you know that two events are independent, and you want to know the probability that both of them will occur, you need to multiply the probabilities that each of them will occur. For example, what is the probability that a die will have an odd outcome (A) and a coin will show heads (B)? Intuitively, these two experiments are independent, so P(A) = ½, P(B) = ½, and P(A,B) = ½ * ½ = ¼. But don't trust your intuition, and always make sure that these events are really independent. Sometimes two events seem independent, while they are actually not (as in the butterfly effect).

If this butterfly flapped his wings yesterday, what are the chances I will be late to work next week?

Bayes Rule
Using equation (1) we get that P(A,B) = P(A) * P(B|A), and this gives us Bayes Rule:

(2) P(A|B) = P(A) * P(B|A) / P(B)

This can be useful in some cases when you know the conditional probability in one direction and would like to compute the other. For example, let's say that there is a clinical test that should diagnose a specific illness. This test is not a 100% accurate: if a person is ill, it will come out positive in 98% of the times. If the person is healthy, it will come out positive in 1% of the times. The ratio of ill people in the population is 2%. Say that someone took this test, and it came out positive. Does it necessarily mean that he is sick? No. It has some probability, that we can compute.

A is the event that a person has this illness. B is the event that the test came out positive. We know that P(B|A)=0.98 (the probability that the test came out positive for an ill person). We also know that P(A)=0.02 (the probability to suffer from this illness). We would like to compute the probability P(A|B). We can use equation (2), but we need to know P(B) - the probability that the test came out positive.

We can use the law of total probability according to which:

(3) P(B) = P(A,B) + P(Ā,B) = P(B|A) P(A) + P(B|Ā) P(Ā)

In this example, what is the probability that the test came out positive? There are two cases, one in which the person is ill, and another in which he is healthy. These events are disjoint (because a person is either ill or healthy but never both).

So we get that P(B) = P(B|A) P(A) + P(B|Ā) P(Ā) = 0.98*0.02 + 0.01*(1-0.02) = 0.0294, and using Bayes rule, P(A|B) = P(A) * P(B|A) / P(B) = 0.02*0.98 / 0.0294 = ⅔. Since the test is not very accurate, and the illness is so rare, if someone is tested positive for this illness, there is a probability of ⅓ that he is actually healthy!

The Chain Rule
We've seen that P(A,B) = P(A) * P(B|A). Sometimes it is useful in this direction. This is called the chain rule, and it can be extended to more than two events. In some cases, we would like to compute the probability of multiple events, rather than just one or two. For example, say that you know the probabilities of private names in a certain country. When a child is born, he has a certain probability to be named John (P(John)) and other probabilities for other names. If you know the names of his older siblings, this may affect the probability of his name; if his older sibling is called John, it reduces the probability that he will also be named John. And if his sister's name is Ablah, then he is more likely to be named Mohammad than David. If you want to compute the probability of the names of all children in the family, for example P(John, Jane, David), you can use the chain rule. You will need to know the prior probability of the name John (what is the probability that a kid is called John, if you don't have any knowledge about his siblings, or if he is the first child). Then, you will need to know the probability of a girl being called Jane, given that her brother's name is John. Last, you will need to know the probability of a boy being called David, given that he has two siblings named John and Jane. In general, the probability that events A_1,A_2,...,A_noccurred is the multiplication of the probability of each event, given that the previous events occurred:

(4) P(A_1,A_2,...,A_n) = P(A₁) P(A₂|A₁) ... P(A_n|A₁A_2,...,A_n-1)

This, again, is called the chain rule.

In some cases, we can make a Markov assumption that the probability of an event depends only on the preceding k events (for some fixed number k). For example, if a family has 5 children, the probability of their names will be:
P(A_1,A_2,A_3,A_4,A₅) = P(A₁) P(A₂|A₁) P(A₃|A₁A₂) P(A₄|A₁A₂A₃) P(A₅|A₁A₂A₃A₄)
But if we assume that a child's name only depends on his two immediate older siblings' names, then we get:
P(A_1,A_2,A_3,A_4,A₅) = P(A₁) P(A₂|A₁) P(A₃|A₁A₂) P(A₄|A₂A₃) P(A₅|A₃A₄)
which is easier to compute.

Approximation
Let's return to the names example. What if you don't know the probability function of names, but you do have access to the list of all given names in a certain time and country? You can estimate (approximate) the probability. One simple way of doing this is by counting. This method is called Maximum Likelihood Estimation (MLE). If you want to know what's the probability of a child being named John, check what's the ratio of people called John in the entire population, so that: P*(John) = #John/N, where N is the number of people in the list you have. Since this is not a real probability but an approximation, it is denoted by P* and not by P.

If you want to know the probability of someone being called Jane given that her immediate older sibling's name is John, count all the pairs of John followed by Jane and divide by all pairs of John followed by any name: P*(Jane|John) = #(John,Jane)/#(John,*).

There are more complex methods to approximate a probability function, but I think that's enough for one post.

So, given that you are reading this sentence now, what is the probability that you read the entire post?

Supervised Learning

2015-08-10T17:05:00.000+03:00

If mathematics is both the queen and servant of science, then machine learning must be the princess and the maid of AI (artificial intelligence). The main goal in AI is to develop software capable of intelligent behavior, for example, a self-driving car. One of the definitions of intelligent behavior is the ability to learn from "experience" (given data) and develop a model that can understand and react to new data. This goal is achieved with machine learning.

This post is a high-level overview of supervised learning, which is the simplest category of machine learning. In future posts, I can elaborate on the actual algorithms and I might also write about other categories, such as unsupervised learning and deep learning -- but I have to admit that I'll have to study them in more depth for that.

I think the best way to explain supervised learning is with a motivating example: spam detection. Your mail box contains a spam folder, and when spam is detected, it is stored in the spam folder rather than in the inbox. How does your mail provider recognize that a certain email is spam?

Suppose you had to manually decide whether a certain email is a spam or not. You would probably check if the sender is known or unknown and whether the message contains suspicious words and phrases such as "free", "cash bonus", and other spam triggering words. Then you can define rules on top of these observations, for instance: "classify as spam if the message is sent from an unknown sender and contains at least 2 spam triggering words, and as non-spam otherwise". In the same way, you can define these rules and let the software apply them automatically to new mails. This approach is called rule-based. While it can lead to accurate results, it requires the effort of defining the rules, which in some tasks should be performed by experts (spam detection is a relatively easy task).

An example for accurate spam detection

The common solution is to let the machine learn a model (function) that receives as input an email, and returns whether it is a spam message or not, based on observed features of the message, such as the sender and content.

In supervised learning, the machine is provided with a set of labeled examples, called training set. This is a small set of data items that their classification is known in advance (e.g. annotated by humans). Each instance describes one data item (e.g. an email message), using predefined features that are relevant for the certain task (e.g. the sender address, each of the words that occur in the subject of the message, etc). In addition, in the training set, each item has a matching true label. which is the item's known class (e.g. spam / non-spam). The machine performs a learning phase, during which it learns a function (model) that receives an unlabeled instance (e.g. new email message) and returns its predicted label (e.g. spam / non-spam).

The learning phase is performed once, and then the model is ready to use for inference as many times as you want. You can give it a new unlabeled instance (e.g. a new email message that just arrived) and it will predict its class (e.g. spam / non-spam) by applying the learned function.

Supervised learning pipeline (picture taken from here).

As you may have noticed, spam detection does not perform perfectly; sometimes a spam message is missed and stays in the inbox (the model classifies a "positive" spam as "negative" non-spam - false negative). In other times, a valid message unjustifiably finds its way to the spam folder (the model classifies a "negative" non-spam as "positive" spam - false positive).

In order for the algorithm to perform well, it needs to learn a model that best describes the training set, with the assumption that the training set is representative of the real-world instances. In order to assess how successful a learned model is (in comparison with other models or in general), an evaluation is performed. This requires an additional set of labeled examples, used to test the model, which is called the test set. This set is disjoint from the training set and not used during the learning phase. The model is applied for each of the instances in the test set, and the predicted label is compared with the true label (gold standard) given in the test set. An evaluation measure is then computed - for example precision¹, recall² or F1³.

Of course that spam detection is only one example out of many examples of supervised learning. Other examples are:

Medical diagnosis - predict whether a patient suffers from a certain disease, based on his symptoms
Detecting credit card fraudulent transactions
Lexical inference - predict whether two terms hold a certain semantic relation, based on the relations between them in knowledge resources

In addition, there are more complex variations of supervised learning. The examples I gave were of binary classification, where each instance is classified to one of two classes: either positive (e.g. spam) or negative (e.g. non-spam). Other tasks require multi-class classification, in which every instance can be classified to one of several predefined classes; for instance, in optical character recognition (OCR), each hand-written character should be classified as one of the possible characters or digits in the alphabet.

In other tasks, any instance can be classified to multiple classes from a predefined set of classes, for instance, determining the different topics of a document, from a predefined set of topics (this post can be classified as computer science, machine learning, supervised learning, etc). This is called multi-label classification.

More complex tasks require outputting a structure rather than a class - this is called structured prediction. One such task is part-of-speech (POS) tagging: given a sentence, predict the part-of-speech of every word in the sentence (e.g. noun, verb, adjective). Rather than predicting the POS tag of every word separately, the sequence is predicted together, taking advantage of dependencies between preceding POS tags; e.g. if the previous word is tagged as a determiner, it is more likely that this word is a noun.

An example of POS tagging, from Stanford Parser

No post about machine learning is complete without mentioning overfitting and regularization. During the learning phase, the machine tries to learn a model that fits the training set. However, it might overfit the training set, by memorizing all the instances instead of learning the main trends in the data. In this case, the evaluation results when applied to the training set (trying to predict the labels of the instances without looking at the true label) will be very good. On a separate test set, however, they are expected to perform worse, since the algorithm learned a very specific function which is not good at handling unseen data. For example, suppose that our training set contains the following 6 emails (training sets are usually much larger, this is for simplicity):

Subject	True Label
earn extra cash	spam
our meeting on Monday	non-spam
the slides you requested	non-spam
get cash today	spam
hi	non-spam
cash bonus	spam

A good algorithm will learn that "cash" in the mail's subject is indicative of spam. A bad algorithm will only recognize emails with the exact subjects "earn extra cash", "get cash today" and "cash bonus" as spam. Then, if it sees a new mail with the subject "get your cash immediately", it won't know it is also spam.

The solution is to apply regularization. Without going into too technical details, regularization is used to punish the algorithm for overfitting the training set, causing it to prefer learning a more general model.

This was just the tip of the iceberg of machine learning. Stay tuned for more about it!

1 The fraction of instances that were classified as positive (e.g. predicted to be spam) that are actually positive (e.g. actual spam messages). A numeric value between 0 and 1, 1 being the best precision, in which there are no negative instances falsely classified as positive. ^↩

2 The fraction of positive instances (e.g. spam messages) that were also classified as positive (e.g. predicted to be spam). A numeric value between 0 and 1, 1 being the best recall, in which there are no positive instances falsely classified as negative. ^↩

3 A measure that balances between precision and recall. A numeric value between 0 and 1, 1 being the best F1. ^↩

Lexical Inference

2015-07-13T17:46:00.000+03:00

After I dedicated the previous post to the awesome field of natural language processing, in this post I will drill down and tell you about the specific task that I'm working on: recognizing lexical inference. Most of the work that I will describe was done by other talented people. You can see references to their papers at the bottom of the post, in case you would like to read more about a certain work.

What?

I'll start by defining what lexical inference is. We are given two terms, x and y (a term is a word such as cat or a multi-word expression such as United States of America). We would like to know whether we can infer the meaning of y from x (denoted by x → y throughout this post).

For example, we can infer animal from cat, because when we talk about a cat we refer to an animal. In general, y can be inferred from x if they hold a certain lexical or semantic relation; for example, if y is a x (cat → animal, Lady Gaga → singer), if x causes y (flu → fever), if x is a part of y (London → England), etc.

Why?
Now would be a good time to ask - why is this task important? We know that a cat is an animal. How would it help us if the computer can automatically infer that? I'll give a usage example. Let's say you use a search engine and type the query "actor Scientology" (or "actors engaged in Scientology", if you don't search by keywords). You expect the search engine to retrieve the following documents:

Figure 1: search results for the query "actor Scientology" that don't directly involve the word "actor"

since they are talking about a certain actor (Tom Cruise or John Travolta) and Scientology. However, what if these documents don't contain the word actor? The search engine needs to know that Tom Cruise → actor to retrieve the first document, and that John Travolta → actor to retrieve the second.
There are many other applications, and in general, knowing that one term infers another term helps dealing with language variability (there is more than one way of saying the same thing).

How?
People have been working on this task for many years. As many other NLP tasks, this one is also difficult. There are two main approaches to recognize lexical inference:

Resource-based: in this approach, the inference is based on knowledge from hand-crafted resources, that specify the semantic or lexical relations between words or entities in the world. In particular, the resource which is usually used for this task is WordNet, a lexical database of the English language. WordNet contains words which are connected to each other via different relations, such as (tail, part of, cat) and (cat, subclass of, feline).¹See figure 2 for an illustration of WordNet.

This approach is usually very precise (it is correct in most of the times that it says that x → y), because it relies on knowledge which is quite precise. However, its coverage (the percentage of times in which it recognizes that x → y, out of all the times that x → y is true) is limited, because some of the knowledge needed for the inference may be absent from the resource.

Figure 2: an excerpt of WordNet - a lexical database of the English language
Corpus-based: this approach uses a huge text called "corpus" (e.g. all the English articles in Wikipedia) which is supposed to be representative of the language. The inference is based on the statistics of occurrences of x and y in the corpus. There are several ways to use a corpus to recognize lexical inference:
- pattern-based approach - there are some patterns such as "x and other y" or "y such as x" that indicate that x → y; if you find it difficult to understand, think about "animals such as cats" and "cat and other animals" and ignore the plural/singular. If x and y frequently occur in the corpus in such patterns, this approach will recognize that x → y. It is not enough to observe one or two occurrences; think about the sentence "my brother and other students". It may occur in the corpus, but this is not a general phenomenon: student is not a common attribute of brother. Positive examples such as cat and animal will probably occur more frequently in these patterns in the corpus.
  
  The first method defined these patterns manually ^[1]. A later work found such patterns automatically ^[2]. This work was highly referenced and used. It is quite precise and also has a good coverage. However, it requires that x and y occur together in the corpus, and some words tend not to occur together, even though they are highly related; for instance, synonyms (e.g. elevator and lift).
- distributional approach - the second approach solves exactly this. It is based on a linguistic hypothesis ^[3] that says that if words occur with similar neighboring words, then they tend to have similar meanings (e.g. elevator and lift will both appear next to down, up, building, floor, and stairs). There has been plenty of work in this approach: earlier methods defined some similarity measure between words which was based on the neighbors (the more common neighbors they share, the more similar they are) ^[4],[5]. In recent years, some automatic methods (that don't require defining a similarity measure) were developed (I might elaborate on these in another post, but it requires knowledge in topics that I haven't covered yet).
Corpus-based methods, and in particular distributional ones, have a much higher coverage than resource-based methods, because they utilize huge texts. The amount of texts available on the web is incredible, as opposed to structured knowledge. However, they are much less precise. The distributional hypothesis says something about the similarity of x and y and it is a vague definition. Just because x and y are similar (what does that even mean?) it doesn't mean that we can infer x from y or vice versa; for instance, the words football and basketball are similar, and will probably share some common neighbors such as ball, player, team, match, and win. However, you can't infer one from the other. Moreover, distributional methods may say that hot and cold are similar, because both occur with weather, temperature, drink, water, etc. Now this is too much. Not only that hot ↛ cold and cold ↛ hot, but they mean exactly the opposite!

So what have we been doing?
We developed a new resource-based method for recognizing lexical inference ^[6]. We weren't going to compromise on precision, but we still wanted to improve upon the coverage of prior methods. In particular, we found that prior methods are incapable of recognizing inferences that contain recent terminology (e.g. social networks) and named-entities (called proper-names, e.g. Lady Gaga). This simply happens because prior methods are based on WordNet, and these terms are absent from WordNet; WordNet is an "ontology of the English language", so by definition it's not supposed to contain world-knowledge about named entities. Also, it hasn't been updated in years, so it doesn't cover recent terminology.

We used other structured knowledge resources that contain exactly this kind of information, are much larger than WordNet and are frequently updated. These resources contain information such as (Lady Gaga, occupation, singer) and (singer, subclass, person), that can indicate that Lady Gaga → singer and Lady Gaga → person. However, they may also contain information such as (Lady Gaga, producer, Giorgio Moroder) but that does not indicate that Lady Gaga → Giorgio Moroder. As in WordNet, we needed to define which relations in the resource are relevant for lexical inference. For instance, the occupation relation is relevant for lexical inference, because a person infers its occupation (Lady Gaga → singer, Barack Obama → president).

As opposed to WordNet-based methods, which only need to select relevant relations out of the few relations WordNet defines, it would be excruciating to do the same for the resources we used. They contain thousand of relations. So we developed a method that automatically recognizes which resource relations are indicative of lexical inference. Then, if it finds that x and y are connected to each other via a path containing only relevant relations, it predicts that x → y. So in our previous example, since occupation and subclass were found indicative of lexical inference, then Lady Gaga → person.

Similarly to the example, we've made successful inferences, and in particular inferences containing proper-names that were not captured by previous methods. We also maintained a very high precision. This is basically the simplified version of our paper.

So, is it perfect now?

Well... not exactly. First of all, our coverage is still lower than that of the corpus-based methods (but with higher precision, usually). Second, there are still some open issues left. I'll give one of them as an example, as this post is already very long (and I challenge you to tl;dr it).

Answer the following question:

apple __ fruit?

(a) →
(b) ↛

Well, I know this seems like a trivial question, but the answer is - it depends!
Are we talking about or about?
The problem in determining whether apple → fruit, is that the word apple has two senses (meanings). In one of its senses, apple → fruit, and in the other, apple ↛ fruit. In order to decide correctly, we need to know which of the senses of apple is the one we are being asked about.

Figure 3: I've just seen this on my Facebook feed after publishing the post and I had to add it :)

As I mentioned before, recognizing lexical inference is usually a component in some NLP application. In such application, x, y or both x and y are part of a text, and the application asks "does x infer y?", "what can we infer from x?" or "what infers y?". If x=apple, and we would like to know whether it infers y=fruit, the solution (for humans) would be to look at the texts.

Say we have the sentence I ate a green apple for breakfast. We can easily understand that the correct sense of apple in this sentence is fruit. How did we know that? We noticed words like ate, breakfast and green that are related to apple in the sense of fruit (and unrelated to Apple the company). There are already automatic methods that do that (with some success, of course). So one of the next challenges is to incorporate them and apply context-sensitive lexical inference. In this case, infer that I ate a fruit and not that I ate a company. I promise to update in case I have any progress with that.

References:

[1] Hearst, Marti A. "Automatic acquisition of hyponyms from large text corpora." Proceedings of the 14th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 1992. ^↩
[2] Snow, Rion, Daniel Jurafsky, and Andrew Y. Ng. "Learning syntactic patterns for automatic hypernym discovery." Advances in Neural Information Processing Systems 17. 2004. ^↩
[3] Harris, Zellig S. "Distributional structure." Word. 1954. ^↩
[4] Weeds, Julie, and David Weir. "A general framework for distributional similarity." Proceedings of the 2003 conference on Empirical methods in natural language processing. Association for Computational Linguistics, 2003. ^↩
[5] Kotlerman, Lili, et al. "Directional distributional similarity for lexical inference." Natural Language Engineering 16.04: 359-389. 2010. ^↩
[6] Shwartz, Vered, Omer Levy, Ido Dagan, and Jacob Goldberger. "Learning to Exploit Structured Resources for Lexical Inference." Proceedings of the Nineteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics. 2015. ^↩

1 These relations actually have less friendly names: holonym/meronym and hyponym/hypernym. ^↩

Probably Approximately a Scientific Blog

Interpretation of Time Expressions in Different Cultures

Commonsense Reasoning for Natural Language Processing

Table of contents:

What is commonsense?

Is commonsense knowledge already captured by pre-trained language models?

How to measure commonsense reasoning capabilities?

How to gather and represent machine readable commonsense knowledge?

Figure 10: An illustration of using BERT to score the answer choices of a WinoGrande instance.

Static neuro-symbolic integration

Dynamic neuro-symbolic integration

Summary

Text Generation

Scope

Language Models

Generating Text

Training a language model

Evaluating text generation

Are language models dangerous?

Ethical Machine Learning

To train or not to train? That is the question

Underrepresented groups in the data

Biased supervision

Adversarial Removal

Biased input representations

Deep Learning in NLP

Who is this post meant for?

What this post is not

Let’s talk about deep learning already!

What has improved?

Going Deep

Representation Learning

Recurrent Neural Networks

What is not yet working perfectly?

The Need for Unreasonable Amount of Data

The Risk of Overfitting

The Artificial Data Problem

The Shaky Ground of Distributional Word Embeddings

The Unsatisfactory Representations Beyond the Word Level

The Non Existing Robustness

The Lack of Interpretability

Targeted Content

Fun with lyrics

Ambiguity

Paraphrasing

Women in STEM*

Antonymy

Question Answering

Crowdsourcing (for NLP)

Linguistic Analysis of Texts

Improving Hypernymy Detection

Text Classification

Representing Words

Recommender Systems

Fun with Google Ngrams

Translation Models

Language Models

Machine Translation Overview

Probability

Supervised Learning

Lexical Inference