Tuesday, October 5, 2021

Interpretation of Time Expressions in Different Cultures

This is a section from a nonfiction book I'm writing (...I guess this public announcement will now pressure me to finish it and find a publisher 馃槵). Thanks Anna Pryslopska for initiating the interesting Twitter discussion!


It’s common knowledge that Americans use the unique date format mm-dd-yyyy, in which the month appears before the day, differently from the rest of the world. There is no clear answer to why that is, but some hypothesize that this format was used in the UK, brought to the US by the Brits, and later changed in the UK to the European format dd-mm-yyyy. This is certainly a cause for confusion, which is almost inevitable when a date signifies July 4th for one person and April 7th for the other. With that said, it may also be useful in some very specific scenarios. A young colleague who travelled to a conference in the US a few weeks before his 21st birthday on July 4th, had successfully purchased alcohol by misleading the bartender into thinking his birthday, which was printed on a non-US passport, was April 7th.   

I was more surprised to learn that Americans use a 12 hour clock (with AM and PM distinctions as needed) rather than the 24 hour clock. I think part of the reason it was less noticeable, is that both clocks are acceptable outside the US. It wasn’t until I texted an American friend, during a conference, that I will meet him at 18:00 next to the escalator, which made him chuckle and inform me I was using “military time”. 


It was the same friend that only a couple of years earlier I was supposed to go sightseeing with on the day after the conference ended, and in the morning he texted me that we can meet in the afternoon. I was surprised, because my interpretation of “afternoon”, based on the norm associated with its literal translation to Hebrew, means around 4 or 5 pm, which seemed quite a late hour to begin sightseeing. He, of course, literally meant any time after 12 pm, which is a reasonable time to leave your hotel room after a week of exhausting conferencing.   


Indeed, people interpret time expressions with some variation from each other. A 2002 study by the University of Aberdeen analyzed human-written weather forecasts along with the weather data they described. The study found significant individual differences between forecasters in the interpretation of some time phrases such as “by evening”, but full agreement on other expressions such as “midday”. 


I couldn’t find any study regarding the cultural differences in interpretation of time expressions, so I conducted my own. I built a very simple survey with the following questions:

  1. Where are you from, or where have you lived most of your life?

  2. What is the range of time you consider as morning (or the equivalent of morning in your native language)?

  3. What is the range of time you consider as noon (or the equivalent of noon in your native language)?

  4. What is the range of time you consider as afternoon (or the equivalent of afternoon in your native language)?

  5. What is the range of time you consider as evening (or the equivalent of evening in your native language)?

  6. What is the range of time you consider as night (or the equivalent of night in your native language)?


I published the survey on Amazon Mechanical Turk, a crowdsourcing platform that enables recruiting workers to perform discrete tasks. To get answers from a range of countries, I published several batches of questionnaires, each time limiting them to workers from specific regions of the world.


Before I dive into the results, I would like to point out that this study was not conducted with my usual level of scientific rigour, mostly for budgetary reasons. In less subtle words, I did not conduct this experiment for my work, thus couldn’t pay for it with my research budget and paid with my own money, so I went cheap. Because my budget was limited, I collected only 349 answers, which means that for most countries I collected only a handful of answers or no answers at all, making conclusions about those countries less statistically well-supported. Moreover, some countries have many more Mechanical Turk workers than others, so I ended up collecting a very uneven number of responses from each country. 


In addition, I live in the Pacific time zone (PST), and the time I published the batches affected the distribution of countries of workers who responded to the survey. For example, I thoughtlessly published the North American survey at 4 pm PST on a Friday, which likely meant most of the answers came from people living in the west coast. People living in countries when it was night time when the survey was made available were either not well represented or worse - distorted the data. Think of a person answering a survey at 2 am their time, do you really trust them as representative of their culture with respect to time? Go to sleep, dude. 


So, if you would still like to discover the results of my very unscientific study, here they are. This was the country distribution:


United States of America103
India83
Brazil48
Italy37
United Kingdom19
Spain8
France7
Philippines4
Canada3
Australia3
American Samoa2
Israel2
Macedonia2
Ireland2
Germany2
Greece2
Romania2
Taiwan1
Saudi Arabia1
Thailand1
Hong Kong1
Austria1
Andorra1
Barbados1
Azerbaijan1
Equatorial Guinea1
Anguilla1
Ethiopia1
Netherlands1
Malta1
Poland1
Sri Lanka1
Belgium1
Lithuania1
Sweden1
Pakistan1
Singapore1

Since the US is dominant in the survey, let’s first analyze the results received among participants in the US. The following figure presents the average start and end times for each time expression, along with error bars to mark the standard deviation, that measures the dispersion of the data relative to the average. 



Americans considered morning on average to span from 4:45 to 11:27 am, noon from 11:47 am to 12:41 pm, afternoon from 1:06 pm to 4:27 pm, evening from 4:27 to 7:04 pm, and night from 7:19 pm to 9:30 am. If you’re wondering about the contradiction of the early morning start time (4:45 am) and the late night end time (9:30 am), the error bars can explain this discrepancy. The night end time data had the largest standard deviation, with many outliers such as people who considered the night to end at 11:59 pm for some reason. A more informative statistic that is less sensitive to outliers is the median. The median is the time that is the same or later than what half of the people would consider as the end of the night, and the same or earlier than what half of the people would consider as the end of the night. It was much earlier, at 5:45 am. 


At this point, I’ve already empirically shown that Americans indeed consider “noon” as a very narrow time slot around 12 pm, although a small number of them were extremely early risers for whom 10 am already feels like noon, and some considered noon to end as late as 2 pm. Another observation that stands out for me here is the early evening beginning. It explains the early US dinner. If the evening starts at 4:30 pm, “The Cadillac” episode in season 7 of Seinfeld seems slightly less crazy. In this episode, Jerry visits his retired parents in Florida. They are getting ready to go to dinner at 4:30, to make it to the early-bird rate. Jerry says he can’t “force feed himself a steak at 4:30” and convinces them to wait for the regular priced dinner at 6. 


Even if you treat the retiree population in Florida as an outlier, Dinner in the US is eaten rather early, around 6pm. I’ve had work dinners at 5:30pm as well. I’ve heard about restaurants that are so busy that you must book a table for dinner… unless you are willing to eat as late as 8 pm. Needless to say, 8 pm seems like a perfectly good time for dinner to me. I’ve often used “dinner time” as an example for temporal commonsense, e.g. “dinner is typically eaten at around 8 pm”. But giving it a second thought, I realize this is rather culture-specific. On trips to some countries in Europe we wandered around hungry at 9 pm, not finding where to eat because all the restaurants were already closed. In other countries it’s customary to eat very late, such as in Spain.


What makes the dinner time convention more confusing is that the meaning of the word dinner is not exactly “the evening meal”. Today, people typically use “dinner” and “supper” interchangeably to refer to the last meal of the day. However, Merriam-Webster classifies supper as a lighter meal, or “the evening meal especially when dinner is taken at midday”, while dinner is "the principal meal of the day" regardless of its time. 


In 2019, my birthday happened to be on Thanksgiving. We tried to book a table in a restaurant for dinner. The options were limited because many restaurants were closed for the holiday and others only served Thanksgiving dinner. I don’t eat chicken, nor am I a fan of holiday food (blatantly generalizing from my experience with Jewish holidays). By the time we found a restaurant that serves its usual menu, they had no available tables for dinner. Right after hanging up the phone I had second thoughts about the way I phrased the question. I called again and asked whether they had available tables at 8 pm. They did. We had a great meal. It was only intuition that made me recheck, but when I dug deep into this, I learned the difference between dinner and supper, and I found out that Thanksgiving dinner is often eaten at around 2 to 4 pm, hours that I would consider lunch time. This ACL 2019 paperin which textual mentions and their corresponding grounded values were automatically extracted from a large English text corpus, also supports this observation. In a figure showing the time of the day in which meals are typically eaten, dinner seemed to start according to some people as early as 1 pm.


Before all this talk about dinner makes me hungry, I will get back to the survey results. So how is the US different from the rest of the world? We don’t have enough data for a fine-grained analysis country-by-country, but we can group countries by continent, for example looking at all answers from Europe.  

 


In Europe, the average morning was between 5:19 and 11:08 am, noon between 11:47 am and 1:30 pm, afternoon between 1:31 and 5:28 pm, evening between 5:51 and 6:50 pm, and night from 5:32 pm to 6:26 am. I was quite surprised by how early people considered the night to start, and in particular the intersection between the evening and night. I’ve heard people saying “good night” at 5 pm in the US, but expected Europe to party harder. Luckily, I allowed the survey respondents to add a free-text comment, and thankfully, many of them did. Two Spanish workers commented that in Spanish, there is no distinction between afternoon and evening, and that Spanish doesn't really have a word for evening. The word “tarde” (afternoon) is used to describe the range of hours from 1 pm to 8 pm, after which it is “noche” (night). 


There are two other countries with enough responses for a meaningful statistical analysis: India and Brazil. Here is the same figure, for India: 



In India, morning starts at 4:53 and ends at 10:05 am, noon starts at 11:14 am and ends at 11:55 pm, afternoon is between 12 and 1:37 pm, evening between 2:21 and 4:56 pm, and night from 5:45 pm to 8:54 am. Largely, all time expressions referred to earlier times than in the US, with the night spanning over 15 of the 24 hours.  



In Brazil, morning is between 5:21 and 11:29 am, noon is between 11:20 am and 12:16, afternoon from 12:39 to 5:20 pm, evening from 5:28 to 5:49 pm, and night from 2:50 pm to 6:20 am. Again, the evening was swallowed by the night, and again, the comments explain it. First, many commented that there is no concept of evening in Brazil. One person elaborated and said that it gets dark early, and once it’s dark, it is already considered night. In addition, some people mentioned that there is no concept of “noon” either.


My own interpretation of these time expressions was as follows: morning at 6 am to 12 pm, noon from 12 to 3 pm, afternoon from 3 to 6 pm, evening from 6 to 10 pm, and night from 10 pm to 6 am. I had almost perfect agreement with my husband, except that he considered morning to start at 4 am. Interestingly, in Hebrew I would use “morning” to describe 4 am, i.e. “4 in the morning”, but because I don’t consider it a reasonable waking time, I made it part of the night. Indeed, this is “early morning”, a time expression I didn’t think of when I designed the survey. Many workers commented that they divide the time from dark to dawn into two or more different segments. Two workers from the Philippines indicated that the length of day and the length of night are equal, and that midnight marks the beginning of the new day, hence the morning. A worker from India commented that in their native language, there is a word for “early morning” used for the time range between 4 am to 6 am, though another Indian worker, possibly speaking a different native language, referred to this time as 12 am to 5 am. A third worker from India referred to 12 am to 4 am as “midnight”. That was surprising to me because I consider midnight as the exact time 12 am, although I realize I’m inconsistent with my interpretation of noon. Maybe it was clearer if it was more common to call it “midday” instead of “noon”. 


Apart from the answers from Europe, which were diverse in terms of countries, the other regions were mostly dominated by a single country. The answers from North America were dominated by the US (93.6%), Asia and Pacific was dominated by India (85.6%), South America by Brazil (100%), and Africa and the Middle East only had 5 responses. It would also be interesting to study how the interpretation of time differs between states in the US, and in different times of the day, days of the week, and seasons. Do people tend to greet “good night” earlier in the day during the winter, when it gets dark early in the northern hemisphere, or is it always equivalent to “goodbye” after a certain hour, when “have a good day” doesn’t make much sense anymore? To solve the confusion, Americans often use the generic “have a good one” greeting, allowing the recipient to decide what “one” means in their own schedule. 


Tuesday, January 12, 2021

Commonsense Reasoning for Natural Language Processing

This long-overdue blog post is based on the Commonsense Tutorial taught by Maarten Sap, Antoine Bosselut, Yejin Choi, Dan Roth, and myself at ACL 2020. Credit for much of the content goes to the co-instructors, but any errors are mine. 

In the last 5 years, popular media has made it seem that AI is nearly---if not already---solved by deep learning, with reports on super-human performance on speech recognition, image captioning, and object recognition. The release of Google Translate’s neural models in 2016 reported large performance improvements: “60% reduction in translation errors on several popular language pairs”. But looking under the hood, these numbers seem to be misleading. Neural models find shortcuts to the correct answers through dataset-specific input-output correlations, essentially solving the dataset but not the underlying task. When models are challenged with adversarial out-of-domain examples, they perform poorly. Small unnoticeable noise added to images confuses object recognition models and changes their predictions. Visual question answering models guess the answer based on the frequency of answers for the same type of question in the training set, e.g. replying "2" to any "how many" question. Image captioning models often learn to recognize objects based solely on their typical environment, and fail to recognize them outside their typical environment. In NLP, dialogue systems generate highly generic responses such as “I don’t know” even for simple questions. Open-ended generation is prone to repetition. Question answering systems are easily distracted by the addition of an unrelated sentence to the passage. And more. 


Figure 1: adversarial examples in computer vision (left) and natural language processing tasks (right).


Machine learning models today perform reasonably well on perception tasks (image and speech recognition). However, they mostly lack the ability to perform simple intuitive commonsense inferences that humans do in every minute of their waking hours, regarding pre- and post-conditions of events, understanding other people's motivations and intents, mental and emotional states, etc. 

Table of contents: 

The boundaries of commonsense are quite challenging to define, but we will go with this working definition:
Commonsense is the basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among most people. 
For example, it's commonsense that it's OK to keep the closet door open, but not the fridge door, as the food inside might go bad. 

Types of commonsense: 

Commonsense knowledge can be categorized according to types, including but not limited to:
  • Social commonsense: people are capable of making inferences about other people's mental states, e.g. what motivates them, what they are likely to do next, etc. This kind of inferences is captured by the ATOMIC knowledge base, discussed later. In addition, we each have a set of social norms of accepted behavior, e.g. knowing that “it's impolite to comment on someone's weight”. While these are often implicit in our actions and decisions, machines need to be taught them explicitly

  • Temporal commonsense: natural language rarely communicates explicit temporal information. Instead it's vague and relies on the commonsense knowledge of the listener. For example, when told that "Dr. Porter is taking a vacation" we can predict that Dr. Porter will not be able to see us soon, as opposed to when "Dr. Porter is taking a walk". This requires knowing the typical duration of "taking a walk" (minutes) and that of "taking a vacation" (days). Other temporal knowledge is typical times, order, frequency, etc. of events which are addressed by the MC-TACO dataset and the TACO-LM time-aware contextual language model. 

  • Physical commonsense: a glass will likely shatter if it falls to the floor, is a fact most people (and arguably cats) know. Physical commonsense includes knowledge about the physical properties and affordances of everyday objects, as tested in the PIQA dataset.

Commonsense is essential for humans to navigate everyday situations seamlessly and interact with each other in a reasonable and safe way, and for AI to understand human needs and actions better. Yet, endowing machines with such human-like commonsense reasoning capabilities has remained an elusive goal of AI research for decades. Past attempts, in the 1960s and 1970s, resulted in an AI winter, i.e. reduced interest and funding for AI research due to failed over-hyped research directions. In recent years, new interest in machine commonsense has emerged, with the availability of stronger computing power and huge amounts of data. With that said, the path to machine commonsense is unlikely to be brute force training larger neural networks with deeper layers.   

Is commonsense knowledge already captured by pre-trained language models?

In the last 3 years, language models have been ubiquitous in NLP. Language models are pre-trained once, in a self-supervised manner that requires only a large text corpus. Traditionally, language models are trained to predict the next word in a sentence (top part of Figure 2, in blue), but they can also predict hidden (masked) words in the middle of the sentence, as in Google's BERT model (top part of Figure 2, in orange). This pre-training phase yields a function that gets a sequence of words (sentence, short paragraph) and returns a vector for each word in the sequence. 
  

Figure 2: Language models pre-training and fine-tuning.


As opposed to word embeddings which are static, language model-based word vectors are dynamic and re-computed for each context. At the very basic level, they assign different vectors to words when they are used in different senses, as in Figure 3. 


Figure 3: Static vs. dynamic word representations.



Do off-the-shelf pre-trained language models already capture commonsense knowledge? 

✅  They are capable to some extent, of filling incomplete commonsense facts or ranking candidate facts. For example, the language model score (≈ statement plausibility) of a fact like "a musician plays a musical instrument" is higher than "a dancer plays a musical instrument". This is a proof that, in addition to lexical and syntactic knowledge, language models capture general knowledge about the world.  

✅  They can, to some extent, associate concepts with their properties. They distinguish concepts 
associated with a given set of properties, i.e. complete a statement such as "       has fur, is big, and has claws, has teeth, is an animal, ..." with bear (just like playing the "20 question game"). They perform better when they are shown encyclopedic properties (e.g. is an animal) as opposed to perceptual properties (e.g. smooth). They can also, pretty successfully, list the properties 
associated with given concepts, e.g. complete the sentence "Everyone knows that a bear has       " with fur, claws, teeth, etc. 

However, knowledge generated from language models is noisy! 

馃毇 Several papers have shown that language models are not sensitive to negation, i.e. they consider the negated version of facts ("birds can't fly") as similarly plausible. 

馃毇 They are sensitive to phrasing:


馃毇  In distributional word vectors, the vector representing a (sub-)word is learned from the contexts in which it appeared, leading to similar representation for semantically-similar words. In language models, the representation of similar contexts is similar, so the model learns which type of word should appear next (or instead of a masked token). This is generally a positive thing, but it sometimes over-generalizes, leading to examples such as this: 


Figure 4: BERT guesses that the masked token should be a color, but fails to predict the correct color. Using the AllenNLP demo


Here, BERT has seen in its training corpus enough sentences of the type "The color of something is [color]" to know to suggest different colors as substitutes for the masked word. Unfortunately, not every color is suitable in every context that calls for a color. BERT likely didn't see enough sentences discussing the color of a dove, thus it defaults to just predicting any color.  

So knowledge in language models is not the most accurate and reliable. Is it still useful?

Yes, to some extent. One way to show it is through evaluation on tasks requiring commonsense knowledge. We will discuss several such tasks, but for now let's focus on WinoGrande as an example. It is the large-scale version of the Winograd Schema Challenge. Given a sentence with a cloze, the goal is to fill in the blank with a previously mentioned entity or concept, out of two answer choices. For example: 

Because Brett found an internship while in college but Ian was unable to, _____ found a job less quickly after graduation. 
Choices: Brett, Ian

What makes this task especially difficult is that every instance has a twin sentence which is minimally changed such that the correct answer is the other one (for instance, replacing "less quickly" with "more quickly" will change the correct answer from Ian to Brett). 

Language model-based models top the leaderboards of WinoGrande and other commonsense tasks, but since they are trained on task specific training data, which often contains tens or hundreds of thousands of training examples, it's hard to attribute the success to the knowledge captured in language models from the pre-training step. A better way to estimate it is with zero-shot (unsupervised) models. Typically, the way zero-shot models address multiple-choice tasks is by phrasing a statement from the instance and each answer choice, and computing the language model score as a proxy for plausibility:

PLM(The answer is answer1
PLM(The answer is answer2
...
PLM(The answer is answerk)

And then predicting the answer choice with the best language model score (highest probability, which is usually computed as the lowest perplexity). 

In our recent EMNLP paper, we took it one step further and asked whether we can use language models to generate what would otherwise be missing or implicit knowledge needed for solving a multiple choice commonsense question answering instance. We proposed the unsupervised "self-talk" framework, that uses language models to generate information seeking questions such as "what is the definition of..." and their corresponding answers (clarifications) to discover additional background knowledge. In the example in Figure 5, knowing that internship experience may help a person get a job is crucial for answering the question (which of Brett and Ian found a job less quickly?). On most benchmarks, the self-talk model performed better than unsupervised models with no additional knowledge, while competing with models that have access to knowledge bases. This is despite the inaccurate and noisy knowledge language models generate. However, when we showed people some of the clarifications that helped the model choose the correct answer choice, they judged only 40% of them as actually providing helpful information. This discrepancy means that our model doesn't imitate human reasoning process - it works differently. Check out our demo! It's not always accurate but it's often funny :) 

Figure 5: An example of clarification generation for an instance from WinoGrande.


The best performance on commonsense tasks is achieved by fine-tuning language models, i.e. training them on task-specific data. Let's look at some of the benchmarks and the issues we face with supervised learning.  

How to measure commonsense reasoning capabilities? 

Multiple commonsense benchmarks have been released over the last few years. Some of them will be discussed here (see examples in Figure 6), along with the main differences and design choices when creating a benchmark.

Figure 6: Some commonsense benchmarks along with an example instance. 


Type of knowledge: some benchmarks focus on a specific type of commonsense knowledge, such as social commonsense (e.g. Social IQa),  physical commonsense (e.g. PIQA), temporal commonsense (e.g. MC-TACO),  or causes and effects (e.g. COPA), while others target a broader domain of general commonsense knowledge and reasoning (e.g. WSC, WinoGrande, CommonsenseQA, ROCStories).  

Size: most recent datasets include a large training set, in order to facilitate training large neural models. One way to create a benchmark is to hire experts to curate a high-quality dataset such as for WSC and COPA. These datasets are rather expensive to collect and are therefore typically small. The common alternative is to collect data through crowdsourcing or semi-automatically, and split it randomly to train, validation, and test sets. Models that learned data-specific shortcuts in the training set instead of generalized phenomena are likely to perform well on a test set drawn from the same distribution, but this performance is misleading and is likely a lot better than on real-world instances of the task.  Despite this understanding, this is still the dominant approach. 

Format: the vast majority of datasets are in the format of multiple choice questions, as exemplified in Figure 6. This format is the easiest to evaluate automatically: models are judged for their accuracy, i.e. what percent of the questions they answered correctly. Unfortunately, this type of tasks also makes it possible for a model to guess the correct answer. We're not talking about a random guess, which would leave enough room for improvement. A random guess is expected to result in an accuracy of 100/k %, where k is the number of answer choices, e.g. 50% accuracy for binary tests, 33.3% for tests with 3 choices, 25% for 4 choices, etc. The risk is that the model makes an "educated guess" based on - yes, you guessed it correctly - spurious correlations between the questions and the correct/incorrect answers. 

How do you make sure a model is right for the right reasons?

That's the million-dollar question. We don't have a perfect solution for this problem yet. For a start, when collecting a new benchmark, the process of collecting incorrect answers (=distractors) should be well-designed such that distractors are plausible but unlikely. Using random answers as distractors (e.g. naturally-occurring sentences or correct answers of different questions) would create topically-different distractors, which are easy to detect (remember, relatedness is one of the strengths of distributional text representations). Asking people to come up with the distractors may introduce other annotation artifacts, such as exaggerations, going off-topic, or producing overly emotional texts, which are easy for models to detect. Some solutions have been proposed: for example, the distractors in Social IQa are answers for different questions asked on the same context. In Figure 7, the context "Alex spilt food all over the floor
and it made a huge mess." appears in the dataset with two questions: "what happens next?" and "what happened before?". The distractors of "what happens next?" are the correct answers of "what happened before?", e.g. that Alex has slippery hands. A similar approach is taken in CommonsenseQA. 

Figure 7: Creating distractors for a Social IQa instance. Image credit: Maarten Sap.

An alternative solution is to filter out easy questions through "adversarial filtering", i.e. training a weaker model and iteratively removing instances that it succeeds in answering. Variants of adversarial filtering were applied to WinoGrande and PIQA. 

Finally, I believe the future is in generative tasks, in which the model needs to produce a free text answer without being provided with the candidate answers. Several recent benchmarks are generative, such as TimeTravel (counterfactual reasoning), ART (abductive reasoning), CommonGen, and ProtoQA. The challenge in generative tasks is the lack of reliable automatic evaluation metrics. Given the gold standard reference answer(s), we would like a metric to (1) reward correct generated answers that are different from the reference answer, while (2) penalizing incorrect answers that are similar (e.g. lexically) to the reference. Human evaluation is reliable, but it is costly and is typically done once on the test set. In order to be able to improve models during development, we need automatic metrics. We currently settle for metrics based on lexical overlap such as BLEU and ROUGE which are pretty terrible at (1) and have little correlation with human judgements, or model-based metrics such as BERT score that are not great at (2). 

How to gather and represent machine readable commonsense knowledge?

Commonsense resources provide machine-readable knowledge about the world. Resources are expected to be large-scale and accurate, consist of diverse knowledge types, and be usable in downstream tasks. ConceptNet is a large (21 million assertions), commonly-used resource consisting of general commonsense knowledge, in over 85 languages. ATOMIC consists of 880,000 triplets reasoning about causes and effects of everyday situations. Other resources are listed in Figure 8.

Figure 8: Overview of existing commonsense resources. Image credit: Maarten Sap. 


Existing resources differ in several aspects:

Representation: how is knowledge represented in the resource? ConceptNet and ATOMIC represent knowledge in natural language (Figure 9), while NELL and Cyc represent knowledge in symbolic logic:

(#$implies (#$and (#$isa ?OBJ ?SUBSET) (#$genls ?SUBSET ?SUPERSET)) (#$isa ?OBJ ?SUPERSET)) 


Figure 9: example knowledge extracted from ConceptNet and ATOMIC. Image credit: Maarten Sap. 


Knowledge type: ConceptNet consists of semantic knowledge, i.e. properties of concepts (e.g. reading is a type of activity). ATOMIC, on the other hand, is inferential: given a templated event with "PersonX" representing the subject and "PersonY" an optional object(s) (e.g. PersonX yells at PersonY), and one of 9 pre-defined relation dimensions (e.g. PersonX's motivation) it provides a second event (e.g. PersonX wanted to express anger). 

Collection method: knowledge can be collected from humans, either experts or crowdsourcing workers. Expert-curated resources are more uniform and accurate, and may use complex representations, but it is an expensive collection method, and it is very time consuming. Alternatively, non-experts can write knowledge in natural language, making the collection faster and more scalable.

The alternative approach is to extract knowledge automatically from texts, as in NELL. This approach works, but it produces less accurate knowledge. In addition, the approach suffers from reporting bias: over-representing the rare at the expense of the trivial. For example, people are reported to murder more often than they are reported to breathe. Default properties of concepts (yellow banana) are mentioned less often than their alternatives (green banana), etc. 



How to enhance neural models for commonsense reasoning tasks with symbolic knowledge?

Most models developed for solving commonsense benchmarks today are based on language models. Typically, each answer choice, along with the context, forms a statement. The language model computes a vector representing each statement. These vectors are then fed into a classifier that assigns a plausibility score for each candidate answer:


Figure 10: An illustration of using BERT to score the answer choices of a WinoGrande instance.


Static neuro-symbolic integration

The knowledge in commonsense resources may enhance models built for solving commonsense benchmarks. For example, we can extract from ConceptNet the assertions that job is used for making money, that spending money requires making money, that buying requires spending money, and that car is something you can buy. Ideally we would also need the knowledge that a high-paying job is a type of job, specifically one used for making a lot of money, which is required for spending a lot of money, which is required for buying something that costs a lot of money, car being one of them. Finally, we may want to remove the edge from "buy" to "car" so we can only get to "car" from the node "buy something that costs a lot of money". 


Figure 12: Knowledge extracted from ConceptNet for the WinoGrande instance discussed above.


How do we incorporate knowledge from knowledge resources into a neural model?

The simple recipe (success not guaranteed) calls for 4 ingredients: the task addressed, the knowledge resource used, the neural component, and the combination method. We have already discussed tasks and knowledge resources, so I would only add here that ConceptNet is the main resource utilized for downstream models, although some models incorporate other knowledge sources, such as other knowledge bases (WordNet, ATOMIC), knowledge mined from text, and tools (knowledge base embeddings, sentiment analysis models, COMET - see below). 


Figure 13: Resources used by most knowledge-informed commonsense models.

The neural component is the 
shiny new neural architecture - language models in the last 3 years, biLSTMs in the years prior, etc. The more interesting component is the combination method. We will look at 3 examples:

Incorporating into the scoring function: Lin et al. (2017) extracted probabilistic "rules" connecting pairs of terms from multiple sources such as WordNet (restaurant→eatery: 1.0), Wikipedia categories (restaurant→business: 1.0), script knowledge mined from text (X went to a restaurant→X ate: 0.32), word embedding-based relatedness scores (restaurant→food: 0.71), and more. The model scores each candidate answer according to the scores of the inference rules used to get from the context (e.g. "Mary walked to a restaurant" in Figure 14) to the candidate answer (e.g. "She ordered foods.").  


Figure 14: "covering" each candidate answer by the original context and the rules extracted from various sources. Image credit: Lin et al. (2017).


Representing symbolic knowledge as vectors: Lin et al. (2019) used BERT as the neural component to represent the instance (statement vector). For their symbolic component, they extracted subgraphs from ConceptNet pertaining to concepts mentioned in the instance and learned to represent them as a vector (graph vector). These two vectors were provided as input to the answer scorer which was trained to predict the correct answer choice. 

Figure 15: extracting subgraphs from ConceptNet pertaining to concepts mentioned in the instance. Image credit: Lin et al. (2019).

Multi-task learning: Xia et al. (2019) fine-tuned a BERT model to solve the multiple choice questions. They also trained two auxiliary tasks supervised by ConceptNet, in which two concepts were given as input and the classifier had to predict whether they are related or not, and the specific ConceptNet property that connects them. The BERT model was shared between the main and the auxiliary tasks, so that commonsense knowledge from ConceptNet was instilled into BERT, improving its performance on the main task.


Figure 16: multi-task learning aimed at instilling knowledge from ConceptNet into BERT.


Dynamic neuro-symbolic integration

There are two main limitations to the neuro-symbolic integration discussed above:
  1. Coverage: relevant knowledge is often not found as-is in commonsense knowledge resources. As we've seen earlier, commonsense knowledge is immeasurably vast, so much of it is not documented. 

  2. Precision and context: knowledge found in the knowledge base about concept X doesn't necessarily apply to all contexts in which X appears. For example, when provided with "PersonX adopts a cat", ATOMIC says that PersonX had to go to the shelter first (Figure 17), but that's not always the case. It may as well be that PersonX adopted a cat they found on the street or got the cat from a friend who was no longer able to care for it. 

Figure 17: ATOMIC inferences for the event "PersonX adopted a cat".


How do we provide machines with large-scale, contextualized commonsense knowledge?

The solution is to leverage manually curated commonsense knowledge resources, such as ConceptNet and ATOMIC, to train a model that can dynamically produce such knowledge for a given context. Commonsense knowledge resources are typically sparse, making training a knowledge base completion model to extend the resource less efficient. Pre-trained language models and their inherent knowledge come in handy here. Language models (such as GPT) implicitly represent knowledge, so you can re-train them on completing knowledge base assertions (e.g. from ATOMIC) to teach them the structure of knowledge. This is what COMET (COMmonsEnse Transformers) does, as illustrated in Figure 18. 


Figure 18: Illustration of the training process of COMET: The language model is fine-tuned to predict the "tail entity" (e.g. inference in ATOMIC) given the "head entity" and the relation. Image credit: Antoine Bosselut.


COMET is capable of dynamically generating inferences for any context. For example, if we modify the context from ATOMIC to "David adopted his sister's cat because they found out her husband was allergic.", which for obvious reasons does not appear in ATOMIC, COMET no longer predicts that PersonX (David) had to go to the shelter, but instead that he, for example, needed to find out about it.

COMET has been used successfully in various downstream tasks requiring commonsense knowledge. Models trained on ATOMIC or on ConceptNet are available, and the demo for both ATOMIC and COMET can be found here. There is also a Visual COMET that can generate inferences from images. 

Summary

We talked about ways to acquire and represent commonsense knowledge in machine readable format, ways to measure commonsense reasoning abilities, and ways to integrate this kind of knowledge into models. None of these is solved yet. Manually collecting all the commonsense knowledge is infeasible, while extracting it from texts or from language models suffers from inaccuracies, reporting bias, and societal biases. Looking forward, a promising research direction is multi-modal commonsense knowledge acquisition, e.g. learning from texts along images and videos. For example, looking through enough class photos, you might learn that the kids in the front row typically sit (especially if the kids in the last row are also seated). 


Machines may reach human performance on commonsense benchmarks but it's often due to being right for the wrong reasons rather than actually possessing and successfully applying commonsense knowledge and reasoning abilities. Generative tasks are somewhat less prone to this issue, but we would have to develop reliable automatic evaluation metrics to make them the standard. 

Machine commonsense reasoning is becoming more and more popular within NLP so I am optimistic about future breakthroughs!