tag:blogger.com,1999:blog-91451206782901951312024-03-16T16:51:42.528+02:00Probably Approximately a Scientific BlogHuman-interpretable computer science and other ramblingsVered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.comBlogger27125tag:blogger.com,1999:blog-9145120678290195131.post-23365820176342609202021-10-05T03:52:00.003+03:002021-10-06T19:45:35.986+03:00Interpretation of Time Expressions in Different Cultures <p style="text-align: left;"><span style="font-family: Arial; white-space: pre-wrap;"><i>This is a section from a nonfiction book I'm writing (...I guess this public announcement will now pressure me to finish it and find a publisher </i>đŹ<i>). Thanks </i></span><span style="font-family: Arial; font-style: italic;"><span style="white-space: pre-wrap;">Anna Pryslopska for initiating the <a href="https://twitter.com/anna_pryslopska/status/1444923131345448961?s=20">interesting Twitter discussion</a>!</span></span></p><p><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Itâs common knowledge that Americans use the unique date format mm-dd-yyyy, in which the month appears before the day, differently from the rest of the world. There is no clear answer to why that is, but <a href="https://iso.mit.edu/americanisms/date-format-in-the-united-states/ ">some hypothesize</a> that this format was used in the UK, brought to the US by the Brits, and later changed in the UK to the European format dd-mm-yyyy.</span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> This is certainly a cause for confusion, which is almost inevitable when a date signifies July 4th for one person and April 7th for the other. With that said, it may also be useful in some very specific scenarios. A young colleague who travelled to a conference in the US a few weeks before his 21st birthday on July 4th, had successfully purchased alcohol by misleading the bartender into thinking his birthday, which was printed on a non-US passport, was April 7th. </span></p><span id="docs-internal-guid-202ee0de-7fff-321d-5fb1-9e3b32413f77"><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">I was more surprised to learn that Americans use a 12 hour clock (with AM and PM distinctions as needed) rather than the 24 hour clock. I think part of the reason it was less noticeable, is that both clocks are acceptable outside the US. It wasnât until I texted an American friend, during a conference, that I will meet him at 18:00 next to the escalator, which made him chuckle and inform me I was using âmilitary timeâ. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">It was the same friend that only a couple of years earlier I was supposed to go sightseeing with on the day after the conference ended, and in the morning he texted me that we can meet in the afternoon. I was surprised, because my interpretation of âafternoonâ, based on the norm associated with its literal translation to Hebrew, means around 4 or 5 pm, which seemed quite a late hour to begin sightseeing. He, of course, literally meant any time after 12 pm, which is a reasonable time to leave your hotel room after a week of exhausting conferencing. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Indeed, people interpret time expressions with some variation from each other. <a href="https://direct.mit.edu/coli/article/28/4/545/1785/Human-Variation-and-Lexical-Choice">A 2002 study by the University of Aberdeen</a> analyzed human-written weather forecasts along with the weather data they described.</span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> The study found significant individual differences between forecasters in the interpretation of some time phrases such as âby eveningâ, but full agreement on other expressions such as âmiddayâ. </span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">I couldnât find any study regarding the cultural differences in interpretation of time expressions, so I conducted my own. I built a very simple survey with the following questions:</span><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /><br /></span></p><ol style="margin-bottom: 0px; margin-top: 0px; padding-inline-start: 48px;"><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: decimal; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Where are you from, or where have you lived most of your life?</span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: decimal; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">What is the range of time you consider as </span><span style="font-size: 11pt; font-style: italic; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">morning</span><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> (or the equivalent of </span><span style="font-size: 11pt; font-style: italic; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">morning</span><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> in your native language)?</span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: decimal; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">What is the range of time you consider as </span><span style="font-size: 11pt; font-style: italic; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">noon</span><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> (or the equivalent of </span><span style="font-size: 11pt; font-style: italic; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">noon</span><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> in your native language)?</span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: decimal; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">What is the range of time you consider as </span><span style="font-size: 11pt; font-style: italic; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">afternoon</span><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> (or the equivalent of </span><span style="font-size: 11pt; font-style: italic; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">afternoon</span><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> in your native language)?</span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: decimal; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">What is the range of time you consider as </span><span style="font-size: 11pt; font-style: italic; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">evening</span><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> (or the equivalent of </span><span style="font-size: 11pt; font-style: italic; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">evening</span><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> in your native language)?</span></p></li><li aria-level="1" dir="ltr" style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: decimal; vertical-align: baseline; white-space: pre;"><p dir="ltr" role="presentation" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">What is the range of time you consider as </span><span style="font-size: 11pt; font-style: italic; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">night</span><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> (or the equivalent of </span><span style="font-size: 11pt; font-style: italic; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">night</span><span style="font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> in your native language)?</span></p></li></ol><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">I published the survey on Amazon Mechanical Turk, a crowdsourcing platform that enables recruiting workers to perform discrete tasks. To get answers from a range of countries, I published several batches of questionnaires, each time limiting them to workers from specific regions of the world.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Before I dive into the results, I would like to point out that this study was not conducted with my usual level of scientific rigour, mostly for budgetary reasons. In less subtle words, I did not conduct this experiment for my work, thus couldnât pay for it with my research budget and paid with my own money, so I went cheap. Because my budget was limited, I collected only 349 answers, which means that for most countries I collected only a handful of answers or no answers at all, making conclusions about those countries less statistically well-supported. Moreover, some countries have many more Mechanical Turk workers than others, so I ended up collecting a very uneven number of responses from each country. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">In addition, I live in the Pacific time zone (PST), and the time I published the batches affected the distribution of countries of workers who responded to the survey. For example, I thoughtlessly published the North American survey at 4 pm PST on a Friday, which likely meant most of the answers came from people living in the west coast. People living in countries when it was night time when the survey was made available were either not well represented or worse - distorted the data. Think of a person answering a survey at 2 am their time, do you really trust them as representative of their culture with respect to time? Go to sleep, dude. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">So, if you would still like to discover the results of my very unscientific study, here they are. This was the country distribution:</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><google-sheets-html-origin></google-sheets-html-origin></p><table border="1" cellpadding="0" cellspacing="0" dir="ltr" style="border-collapse: collapse; border: none; font-family: Arial; font-size: 10pt; table-layout: fixed; width: 0px;" xmlns="http://www.w3.org/1999/xhtml"><colgroup><col width="100"></col><col width="100"></col></colgroup><tbody><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"United States of America"}" style="border: 1px solid rgb(0, 0, 0); overflow: hidden; padding: 2px 3px; vertical-align: bottom;">United States of America</td><td data-sheets-value="{"1":3,"3":103}" style="border-color: rgb(0, 0, 0) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">103</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"India"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">India</td><td data-sheets-value="{"1":3,"3":83}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">83</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Brazil"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Brazil</td><td data-sheets-value="{"1":3,"3":48}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">48</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Italy"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Italy</td><td data-sheets-value="{"1":3,"3":37}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">37</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"United Kingdom"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">United Kingdom</td><td data-sheets-value="{"1":3,"3":19}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">19</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Spain"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Spain</td><td data-sheets-value="{"1":3,"3":8}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">8</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"France"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">France</td><td data-sheets-value="{"1":3,"3":7}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">7</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Philippines"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Philippines</td><td data-sheets-value="{"1":3,"3":4}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">4</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Canada"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Canada</td><td data-sheets-value="{"1":3,"3":3}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">3</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Australia"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Australia</td><td data-sheets-value="{"1":3,"3":3}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">3</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"American Samoa"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">American Samoa</td><td data-sheets-value="{"1":3,"3":2}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">2</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Israel"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Israel</td><td data-sheets-value="{"1":3,"3":2}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">2</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Macedonia"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Macedonia</td><td data-sheets-value="{"1":3,"3":2}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">2</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Ireland"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Ireland</td><td data-sheets-value="{"1":3,"3":2}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">2</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Germany"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Germany</td><td data-sheets-value="{"1":3,"3":2}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">2</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Greece"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Greece</td><td data-sheets-value="{"1":3,"3":2}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">2</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Romania"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Romania</td><td data-sheets-value="{"1":3,"3":2}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">2</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Taiwan"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Taiwan</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Saudi Arabia"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Saudi Arabia</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Thailand"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Thailand</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Hong Kong"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Hong Kong</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Austria"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Austria</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Andorra"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Andorra</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Barbados"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Barbados</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Azerbaijan"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Azerbaijan</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Equatorial Guinea"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Equatorial Guinea</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Anguilla"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Anguilla</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Ethiopia"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Ethiopia</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Netherlands"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Netherlands</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Malta"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Malta</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Poland"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Poland</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Sri Lanka"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Sri Lanka</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Belgium"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Belgium</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Lithuania"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Lithuania</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Sweden"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Sweden</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Pakistan"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Pakistan</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr><tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Singapore"}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; vertical-align: bottom;">Singapore</td><td data-sheets-value="{"1":3,"3":1}" style="border-color: rgb(204, 204, 204) rgb(0, 0, 0) rgb(0, 0, 0) rgb(204, 204, 204); border-image: initial; border-style: solid; border-width: 1px; overflow: hidden; padding: 2px 3px; text-align: right; vertical-align: bottom;">1</td></tr></tbody></table><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Since the US is dominant in the survey, letâs first analyze the results received among participants in the US. The following figure presents the average start and end times for each time expression, along with error bars to mark the standard deviation, that measures the dispersion of the data relative to the average. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><span style="border: none; display: inline-block; height: 231px; overflow: hidden; width: 602px;"><img height="231" src="https://lh6.googleusercontent.com/9XRXM8faauL4GNoeLKkUQiyIVNGGjbTsa28JEOE-18l8LRj_GdUtTZfBVG1pyDhSrw8AQPAhw0kfcSak0kLZYQ6Lsho3TYWfzhv0wcVwNVPjX3dxhBeMWru-RkqUEKavQ8CqkqNa=s0" style="margin-left: 0px; margin-top: 0px;" width="602" /></span></span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Americans considered morning on average to span from 4:45 to 11:27 am, noon from 11:47 am to 12:41 pm, afternoon from 1:06 pm to 4:27 pm, evening from 4:27 to 7:04 pm, and night from 7:19 pm to 9:30 am. If youâre wondering about the contradiction of the early morning start time (4:45 am) and the late night end time (9:30 am), the error bars can explain this discrepancy. The night end time data had the largest standard deviation, with many outliers such as people who considered the night to end at 11:59 pm for some reason. A more informative statistic that is less sensitive to outliers is the median. The median is the time that is the same or later than what half of the people would consider as the end of the night, and the same or earlier than what half of the people would consider as the end of the night. It was much earlier, at 5:45 am. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">At this point, Iâve already empirically shown that Americans indeed consider ânoonâ as a very narrow time slot around 12 pm, although a small number of them were extremely early risers for whom 10 am already feels like noon, and some considered noon to end as late as 2 pm. Another observation that stands out for me here is the early evening beginning. It explains the early US dinner. If the evening starts at 4:30 pm, âThe Cadillacâ episode in season 7 of Seinfeld seems slightly less crazy. In this episode, Jerry visits his retired parents in Florida. They are getting ready to go to dinner at 4:30, to make it to the early-bird rate. Jerry says he canât âforce feed himself a steak at 4:30â and convinces them to wait for the regular priced dinner at 6. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Even if you treat the retiree population in Florida as an outlier, Dinner in the US is eaten rather early, around 6pm. Iâve had work dinners at 5:30pm as well. </span><span style="font-family: Arial; font-size: 11pt; white-space: pre-wrap;">Iâve heard about restaurants that are so busy that you must book a table for dinner⊠unless you are willing to eat as late as 8 pm. Needless to say, 8 pm seems like a perfectly good time for dinner to me. Iâve often used âdinner timeâ as an example for temporal commonsense, e.g. âdinner is typically eaten at around 8 pmâ. But giving it a second thought, I realize this is rather culture-specific. On trips to some countries in Europe we wandered around hungry at 9 pm, not finding where to eat because all the restaurants were already closed. In other countries itâs customary to eat very late, such as in Spain. </span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">What makes the dinner time convention more confusing is that the meaning of the word dinner is not exactly âthe evening mealâ. Today, people typically use âdinnerâ and âsupperâ interchangeably to refer to the last meal of the day. However, Merriam-Webster classifies supper as a lighter meal, or âthe evening meal especially when dinner is taken at middayâ, while dinner is "the principal meal of the day" regardless of its time. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">In 2019, my birthday happened to be on Thanksgiving. We tried to book a table in a restaurant for dinner. The options were limited because many restaurants were closed for the holiday and others only served Thanksgiving dinner. I donât eat chicken, nor am I a fan of holiday food (blatantly generalizing from my experience with Jewish holidays). By the time we found a restaurant that serves its usual menu, they had no available tables for dinner. Right after hanging up the phone I had second thoughts about the way I phrased the question. I called again and asked whether they had available tables at 8 pm. They did. We had a great meal. It was only intuition that made me recheck, but when I dug deep into this, I learned the difference between dinner and supper, and I found out that Thanksgiving dinner is often eaten at around 2 to 4 pm, hours that I would consider lunch time. </span><a href="https://aclanthology.org/P19-1388/" style="font-family: Arial; font-size: 14.6667px; white-space: pre-wrap;">This ACL 2019 paper</a>, <span style="font-family: Arial; font-size: 14.6667px; white-space: pre-wrap;">in which textual mentions and their corresponding grounded values were automatically extracted from a large English text corpus</span><span style="font-family: Arial; font-size: 14.6667px; white-space: pre-wrap;">,</span><span style="font-family: Arial; font-size: 14.6667px; white-space: pre-wrap;"> also supports this observation. In a figure showing the </span><span style="font-family: Arial; font-size: 14.6667px; white-space: pre-wrap;">time of the day in which meals are typically eaten, dinner seemed to start according to some people as early as 1 pm.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Before all this talk about dinner makes me hungry, I will get back to the survey results. So how is the US different from the rest of the world? We donât have enough data for a fine-grained analysis country-by-country, but we can group countries by continent, for example looking at all answers from Europe. </span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"> </span></p><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><span style="border: none; display: inline-block; height: 231px; overflow: hidden; width: 602px;"><img height="231" src="https://lh3.googleusercontent.com/DDZVWV2LtsnFEQfav9yfUQ14BreYB_eejgmGkFSOo3zUO_nLMXhWv4PfvVgvLFBuPsIICHAdAGYpkE33cPfI2tayaPVU8AGgfAM8bwEoY_AmIlAcJpcgSnIvCOHznOZuaAgXIMQ7=s0" style="margin-left: 0px; margin-top: 0px;" width="602" /></span></span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">In Europe, the average morning was between 5:19 and 11:08 am, noon between 11:47 am and 1:30 pm, afternoon between 1:31 and 5:28 pm, evening between 5:51 and 6:50 pm, and night from 5:32 pm to 6:26 am. I was quite surprised by how early people considered the night to start, and in particular the intersection between the evening and night. Iâve heard people saying âgood nightâ at 5 pm in the US, but expected Europe to party harder. Luckily, I allowed the survey respondents to add a free-text comment, and thankfully, many of them did. Two Spanish workers commented that in Spanish, there is no distinction between afternoon and evening, and that Spanish doesn't really have a word for evening. The word âtardeâ (afternoon) is used to describe the range of hours from 1 pm to 8 pm, after which it is ânocheâ (night). </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">There are two other countries with enough responses for a meaningful statistical analysis: India and Brazil. Here is the same figure, for India: </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><span style="border: none; display: inline-block; height: 231px; overflow: hidden; width: 602px;"><img height="231" src="https://lh3.googleusercontent.com/Na9Je6d8-ywbA78VZgybeAVkjiWaiRefYiqFzuHrxwkTfQLtTKYBVEWd7tf8QmmNhL17Cp30cjz8Zn_6ECwU0spq7hewfkN-dnrZmtt8oCwVSVBnP4X5HbyaREZT-7CWTUndm8rF=s0" style="margin-left: 0px; margin-top: 0px;" width="602" /></span></span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">In India, morning starts at 4:53 and ends at 10:05 am, noon starts at 11:14 am and ends at 11:55 pm, afternoon is between 12 and 1:37 pm, evening between 2:21 and 4:56 pm, and night from 5:45 pm to 8:54 am. Largely, all time expressions referred to earlier times than in the US, with the night spanning over 15 of the 24 hours. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><span style="border: none; display: inline-block; height: 231px; overflow: hidden; width: 602px;"><img height="231" src="https://lh3.googleusercontent.com/ymEGyoM15pu1iZOMj-NNM9YW48DLB3H3wLuiYbVm-svO1TeV5CWygC49iwTtBrrogRGCI19dgpMke9NKYFX5o6aOuDmsobRz83glN3fOF0CSqwa88mChHpK4tDmLcuGeTnJ6dlFp=s0" style="margin-left: 0px; margin-top: 0px;" width="602" /></span></span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">In Brazil, morning is between 5:21 and 11:29 am, noon is between 11:20 am and 12:16, afternoon from 12:39 to 5:20 pm, evening from 5:28 to 5:49 pm, and night from 2:50 pm to 6:20 am. Again, the evening was swallowed by the night, and again, the comments explain it. First, many commented that there is no concept of evening in Brazil. One person elaborated and said that it gets dark early, and once itâs dark, it is already considered night. In addition, some people mentioned that there is no concept of ânoonâ either.</span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">My own interpretation of these time expressions was as follows: morning at 6 am to 12 pm, noon from 12 to 3 pm, afternoon from 3 to 6 pm, evening from 6 to 10 pm, and night from 10 pm to 6 am. I had almost perfect agreement with my husband, except that he considered morning to start at 4 am. Interestingly, in Hebrew I would use âmorningâ to describe 4 am, i.e. â4 in the morningâ, but because I donât consider it a reasonable waking time, I made it part of the night. Indeed, this is âearly morningâ, a time expression I didnât think of when I designed the survey. Many workers commented that they divide the time from dark to dawn into two or more different segments. Two workers from the Philippines indicated that the length of day and the length of night are equal, and that midnight marks the beginning of the new day, hence the morning. A worker from India commented that in their native language, there is a word for âearly morningâ used for the time range between 4 am to 6 am, though another Indian worker, possibly speaking a different native language, referred to this time as 12 am to 5 am. A third worker from India referred to 12 am to 4 am as âmidnightâ. That was surprising to me because I consider midnight as the exact time 12 am, although I realize Iâm inconsistent with my interpretation of noon. Maybe it was clearer if it was more common to call it âmiddayâ instead of ânoonâ. </span></p><br /><p dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;">Apart from the answers from Europe, which were diverse in terms of countries, the other regions were mostly dominated by a single country. The answers from North America were dominated by the US (93.6%), Asia and Pacific was dominated by India (85.6%), South America by Brazil (100%), and Africa and the Middle East only had 5 responses. It would also be interesting to study how the interpretation of time differs between states in the US, and in different times of the day, days of the week, and seasons. Do people tend to greet âgood nightâ earlier in the day during the winter, when it gets dark early in the northern hemisphere, or is it always equivalent to âgoodbyeâ after a certain hour, when âhave a good dayâ doesnât make much sense anymore? To solve the confusion, Americans often use the generic âhave a good oneâ greeting, allowing the recipient to decide what âoneâ means in their own schedule. </span></p><div><span style="font-family: Arial; font-size: 11pt; font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline; white-space: pre-wrap;"><br /></span></div></span>Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com6tag:blogger.com,1999:blog-9145120678290195131.post-59336862300456513562021-01-12T20:06:00.009+02:002021-01-13T23:36:29.661+02:00Commonsense Reasoning for Natural Language ProcessingThis long-overdue blog post is based on the <a href="https://homes.cs.washington.edu/~msap/acl2020-commonsense/">Commonsense Tutorial</a> taught by Maarten Sap, Antoine Bosselut, Yejin Choi, Dan Roth, and myself at <a href="https://acl2020.org/">ACL 2020</a>. Credit for much of the content goes to the co-instructors, but any errors are mine. <div><br /></div><div>In the last 5 years, popular media has made it seem that AI is nearly---if not already---solved by deep learning, with reports on
super-human performance on speech recognition, image captioning, and object recognition. The release of Google Translateâs neural models in 2016 <a href="https://arxiv.org/pdf/1609.08144.pdf ">reported</a> large performance improvements: â60% reduction in translation errors on several popular language pairsâ. But looking under the hood, these numbers seem to be misleading. Neural models find shortcuts to the correct answers through <a href="https://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html#artificial_data">dataset-specific input-output correlations</a>, essentially solving the dataset but not the underlying task. When models are challenged with adversarial out-of-domain examples, they perform poorly. Small unnoticeable noise added to images <a href="https://ieeexplore.ieee.org/document/7298594">confuses object recognition models and changes their predictions</a>. Visual question answering models <a href="https://www.aclweb.org/anthology/D16-1203">guess the answer</a> based on the frequency of answers for the same type of question in the training set, e.g. replying "2" to any "how many" question. Image captioning models often <a href="https://dl.acm.org/doi/abs/10.1145/3025453.3025814?casa_token=nZpp3uz8qrgAAAAA:y8osF83HhDrG4mhKoq4SEeAWXJFGmjBjFUzHtrbsR3go3m4orm6lH70MUlpQoSUrUSjUhDtCPKfDXg">learn to recognize objects based solely on their typical environment</a>, and fail to recognize them outside their typical environment. In NLP, dialogue systems generate <a href="https://www.aclweb.org/anthology/D16-1127/">highly generic responses</a> such as âI donât knowâ even for simple questions. Open-ended generation is <a href="https://openreview.net/forum?id=rygGQyrFvH">prone to repetition</a>. Question answering systems are <a href="https://www.aclweb.org/anthology/D17-1215/">easily distracted by the addition of an unrelated sentence</a> to the passage. And more. </div><div><br /></div><div><br /></div><div><div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHzZzuRpxTHlVNtU5plQl9VpmKj3Awg4lQEPZXlc4xAJ4QoD_0M0PRWI7wZiUkq_MLmc7PMkH4t55YvRi79mDpe5JGDe7aX023sTk9H1p8w_Toyu5_fI6rQev_CWXMrXpa9aWwhEwZu9I/" style="margin-left: auto; margin-right: auto;"><img alt="" data-original-height="1788" data-original-width="2048" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHzZzuRpxTHlVNtU5plQl9VpmKj3Awg4lQEPZXlc4xAJ4QoD_0M0PRWI7wZiUkq_MLmc7PMkH4t55YvRi79mDpe5JGDe7aX023sTk9H1p8w_Toyu5_fI6rQev_CWXMrXpa9aWwhEwZu9I/w344-h300/image.png" width="344" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Figure 1: adversarial examples in computer vision (left) and natural language processing tasks (right).</td></tr></tbody></table><br /></div><div><br /></div><div>Machine learning models today perform reasonably well on perception tasks (image and speech recognition). However, they mostly lack the ability to perform simple intuitive commonsense inferences that humans do in every minute of their waking hours, regarding pre- and post-conditions of events, understanding other people's motivations and intents, mental and emotional states, etc. </div><div><br /></div><h4 style="text-align: left;"><b>Table of contents:</b> </h4><div><ol style="text-align: left;"><li><a href="#definition">What is commonsense?</a> </li><li><a href="#lms">Is commonsense knowledge already captured by pre-trained language models?</a> </li><li><a href="#benchmarks">How to create benchmarks to measure commonsense reasoning capabilities?</a> </li><li><a href="#kbs">How to gather and represent machine readable commonsense knowledge?</a> </li><li>How to enhance neural models for commonsense reasoning tasks with symbolic knowledge? </li><ul><li><a href="#static">Static integration</a></li><li><a href="#dynamic">Dynamic integration</a> </li></ul><li><a href="#summary">Summary</a></li></ol><div><br /></div><h4 id="definition" style="text-align: left;"><b>What is commonsense? </b></h4></div><div><div>The boundaries of commonsense are quite challenging to define, but we will go with this working definition:</div><div><blockquote><i>Commonsense is the basic level of <b>practical knowledge</b> and <b>reasoning</b> concerning everyday <b>situations</b> and <b>events</b> that are <b>commonly</b> shared among <b>most</b> people. </i></blockquote></div></div></div><div>For example, it's commonsense that it's OK to keep the closet door open, but not the fridge door, as the food inside might go bad. </div><div><br /></div><div><b>Types of commonsense: </b></div><div><br /></div><div>Commonsense knowledge can be categorized according to types, including but not limited to:</div><div><ul style="text-align: left;"><li><b>Social commonsense: </b>people are capable of making inferences about other people's mental states, e.g. what motivates them, what they are likely to do next, etc. This kind of inferences is captured by the <a href="https://homes.cs.washington.edu/~msap/atomic/">ATOMIC</a> knowledge base, discussed later. In addition, we each have a set of social norms of accepted behavior, e.g. knowing that âit's impolite to comment on someone's weightâ. While these are often implicit in our actions and decisions, machines need to be taught them <a href="https://www.aclweb.org/anthology/2020.emnlp-main.48/">explicitly</a>. <br /><br /></li><li><b>Temporal commonsense: </b>natural language rarely communicates explicit temporal information. Instead it's vague and relies on the commonsense knowledge of the listener. For example, when told that "<i>Dr. Porter is taking a vacation</i>" we can predict that Dr. Porter <b>will not</b> <b>be</b> able to see us soon, as opposed to when "<i>Dr. Porter is taking a walk</i>". This requires knowing the typical duration of "taking a walk" (minutes) and that of "taking a vacation" (days). Other temporal knowledge is typical times, order, frequency, etc. of events which are addressed by the <a href="https://leaderboard.allenai.org/mctaco/submissions/get-started">MC-TACO</a> dataset and the <a href="https://www.aclweb.org/anthology/2020.acl-main.678.pdf">TACO-LM</a> time-aware contextual language model. <br /><br /></li><li><b>Physical commonsense: </b>a glass will likely shatter if it falls to the floor, is a fact most people (and <a href="https://youtu.be/ccK3usCWmTo">arguably cats</a>) know. Physical commonsense includes knowledge about the physical properties and affordances of everyday objects, as tested in the <a href="https://yonatanbisk.com/piqa/">PIQA</a> dataset.</li></ul><div><br /></div></div><div>Commonsense is essential for humans to navigate everyday situations seamlessly and interact with each other in a reasonable and safe way, and for AI to understand human needs and actions better. Yet, endowing machines with such human-like commonsense reasoning capabilities has remained an elusive goal of AI research for decades. Past attempts, in the 1960s and 1970s, resulted in an AI winter, i.e. reduced interest and funding for AI research due to failed over-hyped research directions. In recent years, new interest in machine commonsense has emerged, with the availability of stronger computing power and huge amounts of data. With that said, the path to machine commonsense is unlikely to be brute force training larger neural networks with deeper layers. </div><div><br /></div><h4 id="lms" style="text-align: left;">Is commonsense knowledge already captured by pre-trained language models?</h4><div>In the last 3 years, language models have been ubiquitous in NLP. <a href="http://veredshwartz.blogspot.com/2019/08/text-generation.html#lms">Language models</a> are pre-trained once, in a self-supervised manner that requires only a large text corpus. Traditionally, language models are trained to predict the next word in a sentence (top part of Figure 2, in blue), but they can also predict hidden (masked) words in the middle of the sentence, as in Google's <a href="https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html">BERT</a> model (top part of Figure 2, in orange). This pre-training phase yields a function that gets a sequence of words (sentence, short paragraph) and returns a vector for each word in the sequence. </div><div> </div><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhk85ia_RAfE9ajGDWwl00WVISVHovCnEjeHtpyaXg8u2GmPONaBlRpJyyhcZKtBMbLVa4kp86aWGg6aOooPQz73IaMWuMRxN_OwrCOD-b_MbaDszil6I2OIAzDbwCYhleCls9YK0PXMZM/s1315/Screen+Shot+2020-09-14+at+5.00.37+PM.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="605" data-original-width="1315" height="184" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhk85ia_RAfE9ajGDWwl00WVISVHovCnEjeHtpyaXg8u2GmPONaBlRpJyyhcZKtBMbLVa4kp86aWGg6aOooPQz73IaMWuMRxN_OwrCOD-b_MbaDszil6I2OIAzDbwCYhleCls9YK0PXMZM/w400-h184/Screen+Shot+2020-09-14+at+5.00.37+PM.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Figure 2: Language models pre-training and fine-tuning.</td></tr></tbody></table><div><br /><br /></div><div>As opposed to word embeddings which are static, language model-based word vectors are dynamic and re-computed for each context. At the very basic level, they assign different vectors to words when they are used in different senses, as in Figure 3. </div><div><br /></div><div><br /></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUErc9XJRN201Ke0hY_p6gAmjODEDzq315U0r8ziFz3_o0G1Gfcq1kHNyjv7pfMnk4G7pqH7OxOm52RsboU27b1tbkzqwH7PwCxsFy5GcbH3Iqlsb-XF2XORFiaz8AKEq4S84nbPGOYcg/s1675/static_vs_dynamic.jpg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="606" data-original-width="1675" height="145" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUErc9XJRN201Ke0hY_p6gAmjODEDzq315U0r8ziFz3_o0G1Gfcq1kHNyjv7pfMnk4G7pqH7OxOm52RsboU27b1tbkzqwH7PwCxsFy5GcbH3Iqlsb-XF2XORFiaz8AKEq4S84nbPGOYcg/w400-h145/static_vs_dynamic.jpg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Figure 3: Static vs. dynamic word representations.</td></tr></tbody></table><div style="text-align: right;"><br /></div><div><br /></div><div><br /></div><div>Do off-the-shelf pre-trained language models <i>already</i> capture commonsense knowledge? </div><div><br /></div><div>â
They are capable to some extent, of <a href="https://www.aclweb.org/anthology/D19-1250/">filling incomplete commonsense facts</a> or <a href="https://www.aclweb.org/anthology/D19-1109/">ranking candidate facts</a>. For example, the language model score (â statement plausibility) of a fact like "a <i>musician</i> plays a musical instrument" is higher than "a <i>dancer</i> plays a musical instrument". This is a proof that, in addition to lexical and syntactic knowledge, language models capture general knowledge about the world. </div><div><br /></div><div>â
They can, to some extent, <a href="https://cognitivesciencesociety.org/cogsci20/papers/0070/0070.pdf">associate concepts with their properties</a>. They distinguish concepts âšassociated with a given set of properties, i.e. complete a statement such as "<i>A <u> </u> has fur, is big, and has claws, has teeth, is an animal, ...</i>" with <i>bear</i> (just like playing the "20 question game"). They perform better when they are shown encyclopedic properties (e.g. <i>is an animal</i>) as opposed to perceptual properties (e.g. <i>smooth</i>). They can also, pretty successfully, list the properties âšassociated with given concepts, e.g. complete the sentence "Everyone knows that a bear has <u> </u>" with fur, claws, teeth, etc. </div><div><br /></div><div>However, knowledge generated from language models is noisy! </div><div style="text-align: left;"><br /></div><div style="text-align: left;">đ« <a href="https://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00298">Several</a> <a href="https://www.aclweb.org/anthology/2020.acl-main.698/">papers</a> have shown that language models are not sensitive to negation, i.e. they consider the negated version of facts ("<i>birds can't fly</i>") as similarly plausible. </div><div style="text-align: left;"><br /></div><div style="text-align: left;">đ« They are sensitive to phrasing:</div><div style="text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg9ghq5tppQ_OxZUinGFpmYODuLIyWKoJ3DbY808Gy6c3XZpVtYFECHlTvLUhPjGqNaqoE48CwMAlkyFiaVN5ZAL6YMpjmbzu0TH4Mhu-gBOhs27Z3ZAVx1NMA38ogxHqhEDb0gnZJKGIk/s458/Screen+Shot+2020-10-23+at+2.17.48+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="217" data-original-width="458" height="95" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg9ghq5tppQ_OxZUinGFpmYODuLIyWKoJ3DbY808Gy6c3XZpVtYFECHlTvLUhPjGqNaqoE48CwMAlkyFiaVN5ZAL6YMpjmbzu0TH4Mhu-gBOhs27Z3ZAVx1NMA38ogxHqhEDb0gnZJKGIk/w200-h95/Screen+Shot+2020-10-23+at+2.17.48+PM.png" width="200" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVyAQnNCjvAEqYBMUIxsKgzqhbVUHcOOYcgkDYvFC8ldjRpqDHJ-ocC9UQiXpRTIHEauk9RfX4eKrXkpkPBaMJJ2qvWlioBgHElA63tj3GLVHthvCA0mhNkPHQI5YEbwxSjC38D_Ug8zo/s453/Screen+Shot+2020-10-23+at+2.18.04+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="219" data-original-width="453" height="97" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiVyAQnNCjvAEqYBMUIxsKgzqhbVUHcOOYcgkDYvFC8ldjRpqDHJ-ocC9UQiXpRTIHEauk9RfX4eKrXkpkPBaMJJ2qvWlioBgHElA63tj3GLVHthvCA0mhNkPHQI5YEbwxSjC38D_Ug8zo/w200-h97/Screen+Shot+2020-10-23+at+2.18.04+PM.png" width="200" /></a></div><div style="text-align: left;"><br /></div><div style="text-align: left;">đ« In <a href="http://veredshwartz.blogspot.com/2016/01/representing-words.html">distributional word vectors</a>, the vector representing a (sub-)word is learned from the contexts in which it appeared, leading to similar representation for semantically-similar words. In language models, the representation of similar contexts is similar, so the model learns which type of word should appear next (or instead of a masked token). This is generally a positive thing, but it sometimes <a href="https://www.aclweb.org/anthology/2020.coling-main.605/">over-generalizes</a>, leading to examples such as this: </div><div style="text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjc7CZMBIjNA1OLQ9K3pFeDU7YOaobFq1JRlu3B-Kx0J9xEfyW5wHJ2JZCWB5ZuzFUlnnSruUq-ilJjWPFghYLvxpGera8jV7SUJBr5Sjx8VVTLi2oj-hM8quatL6nFM4pEsqRZ_bq9dD8/s2/Screen+Shot+2021-01-08+at+4.16.25+PM.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1" data-original-width="2" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjc7CZMBIjNA1OLQ9K3pFeDU7YOaobFq1JRlu3B-Kx0J9xEfyW5wHJ2JZCWB5ZuzFUlnnSruUq-ilJjWPFghYLvxpGera8jV7SUJBr5Sjx8VVTLi2oj-hM8quatL6nFM4pEsqRZ_bq9dD8/s0/Screen+Shot+2021-01-08+at+4.16.25+PM.png" /></a></div><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjrnjrV-1sef8BD4mok3B9NwYHXUQuPh6mpV56iVW3UtMzMapgQcoQGsXgEEo9FvjcPBjout4O3_DrOk7oA_0qB_cHhHo8wLp1Fbj4zHUvQL2B1FmvMtoKgkA-a6h8urqQ6lM77Ngh3l_U/s610/Screen+Shot+2021-01-08+at+4.28.27+PM.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="215" data-original-width="610" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjrnjrV-1sef8BD4mok3B9NwYHXUQuPh6mpV56iVW3UtMzMapgQcoQGsXgEEo9FvjcPBjout4O3_DrOk7oA_0qB_cHhHo8wLp1Fbj4zHUvQL2B1FmvMtoKgkA-a6h8urqQ6lM77Ngh3l_U/s320/Screen+Shot+2021-01-08+at+4.28.27+PM.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: justify;"><span style="text-align: left;">Figure 4: BERT guesses that the masked token should be a color, but fails to predict the correct color. Using the </span><a href="https://demo.allennlp.org/masked-lm" style="text-align: left;">AllenNLP demo</a><span style="text-align: left;">. </span></td></tr></tbody></table><div style="text-align: left;"><br /></div><div style="text-align: left;"><br />Here, BERT has seen in its training corpus enough sentences of the type <i>"The color of something is [color]" </i>to know to suggest different colors as substitutes for the masked word. Unfortunately, not every color is suitable in every context that calls for a color. BERT likely didn't see enough sentences discussing the color of a dove, thus it defaults to just predicting <i>any </i>color. <br /><br /></div><div style="text-align: left;">So knowledge in language models is not the most accurate and reliable. Is it still useful?</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Yes, to some extent. One way to show it is through evaluation on tasks requiring commonsense knowledge. We will discuss several such tasks, but for now let's focus on <a href="https://leaderboard.allenai.org/winogrande/submissions/public">WinoGrande</a> as an example. It is the large-scale version of the <a href="https://en.wikipedia.org/wiki/Winograd_Schema_Challenge">Winograd Schema Challenge</a>. Given a sentence with a cloze, the goal is to fill in the blank with a previously mentioned entity or concept, out of two answer choices. For example: </div><div style="text-align: left;"><br /></div><div style="text-align: left;"><div><span style="font-family: courier;">Because Brett found an internship while in college but Ian was unable to, _____ found a job less quickly after graduation. </span></div><div>Choices: Brett, <u>Ian</u></div></div><div style="text-align: left;"><br /></div><div style="text-align: left;">What makes this task especially difficult is that every instance has a twin sentence which is minimally changed such that the correct answer is the other one (for instance, replacing "less quickly" with "more quickly" will change the correct answer from Ian to Brett). </div><div style="text-align: left;"><br /></div><div style="text-align: left;">Language model-based models top the leaderboards of WinoGrande and other commonsense tasks, but since they are trained on task specific training data, which often contains tens or hundreds of thousands of training examples, it's hard to attribute the success to the knowledge captured in language models from the pre-training step. A better way to estimate it is with zero-shot (unsupervised) models. Typically, the way zero-shot models address multiple-choice tasks is by phrasing a statement from the instance and each answer choice, and computing the language model score as a proxy for plausibility:</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><span style="font-family: courier;">P<span style="font-size: xx-small;">LM</span>(The answer is <span style="color: #ffa400;">answer<span style="font-size: xx-small;">1</span></span>) </span></div><div style="text-align: left;"><span style="font-family: courier;">P</span><span style="font-family: courier; font-size: xx-small;">LM</span><span style="font-family: courier;">(The answer is <span style="color: #93c47d;">answer<span style="font-size: xx-small;">2</span></span>) </span></div><div style="text-align: left;"><span style="font-family: courier;">...</span></div><div style="text-align: left;"><span style="font-family: courier;">P</span><span style="font-family: courier; font-size: xx-small;">LM</span><span style="font-family: courier;">(The answer is <span style="color: #b4a7d6;">answer<span style="font-size: xx-small;">k</span></span>)</span></div><div style="text-align: left;"><br /></div><div style="text-align: left;">And then predicting the answer choice with the best language model score (highest probability, which is usually computed as the lowest <a href="http://veredshwartz.blogspot.com/2019/08/text-generation.html">perplexity</a>). </div><div style="text-align: left;"><br /></div><div style="text-align: left;">In our <a href="https://arxiv.org/abs/2004.05483">recent EMNLP paper</a>, we took it one step further and asked whether we can use language models to generate what would otherwise be missing or implicit knowledge needed for solving a multiple choice commonsense question answering instance. We proposed the unsupervised "self-talk" framework, that uses language models to generate information seeking questions such as "<i>what is the definition of..."</i> and their corresponding answers (clarifications) to discover additional background knowledge. In the example in Figure 5, knowing that internship experience may help a person get a job is crucial for answering the question (which of Brett and Ian found a job less quickly?). On most benchmarks, the self-talk model performed better than unsupervised models with no additional knowledge, while competing with models that have access to knowledge bases. This is despite the inaccurate and noisy knowledge language models generate. However, when we showed people some of the clarifications that helped the model choose the correct answer choice, they judged only 40% of them as actually providing helpful information. This discrepancy means that our model doesn't imitate human reasoning process - it works differently. Check out our <a href="https://self-talk.apps.allenai.org/">demo</a>! It's not always accurate but it's often funny :) </div><div style="text-align: left;"><div><br /></div></div><div style="text-align: left;"><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7vBqUcPviqJcMUzhtT-9FF-7h3e7_WLT0hLXVV-isxqSUbwjAWxOoFwqAOhGaik_ztTkujfkx_9F-hNZ_gq7qrEg6T_idoGcYRXLWnQQu9Ae3GjYfNIPuY5wytDAo1QRXKTxFSdMrhlY/s673/Screen+Shot+2020-10-23+at+2.29.04+PM.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="561" data-original-width="673" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7vBqUcPviqJcMUzhtT-9FF-7h3e7_WLT0hLXVV-isxqSUbwjAWxOoFwqAOhGaik_ztTkujfkx_9F-hNZ_gq7qrEg6T_idoGcYRXLWnQQu9Ae3GjYfNIPuY5wytDAo1QRXKTxFSdMrhlY/s320/Screen+Shot+2020-10-23+at+2.29.04+PM.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: left;">Figure 5: An example of clarification generation for an instance from WinoGrande.<br /><br /></td></tr></tbody></table></div><div style="text-align: left;"><br /></div><div style="text-align: left;">The best performance on commonsense tasks is achieved by fine-tuning language models, i.e. training them on task-specific data. Let's look at some of the benchmarks and the issues we face with supervised learning. </div><div style="text-align: left;"><br /></div><h4 id="benchmarks" style="text-align: left;">How to measure commonsense reasoning capabilities? </h4><div>Multiple commonsense benchmarks have been released over the last few years. Some of them will be discussed here (see examples in Figure 6), along with the main differences and design choices when creating a benchmark.</div><div><br /></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-xfeZ28BuZnbnjjv2eDZmAFVXdSdVXAgBpZ1bW7gZmybYKmkvIK_2TLKqqK0CymRCeD2N2wXy57KLji772TAYBbuGlZ4HuRNFJyVfavP2dDh0BipaxYKYWBfqvtKWicXPAsJb8iCCr44/s2394/Screen+Shot+2021-01-01+at+1.34.35+PM.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="896" data-original-width="2394" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-xfeZ28BuZnbnjjv2eDZmAFVXdSdVXAgBpZ1bW7gZmybYKmkvIK_2TLKqqK0CymRCeD2N2wXy57KLji772TAYBbuGlZ4HuRNFJyVfavP2dDh0BipaxYKYWBfqvtKWicXPAsJb8iCCr44/w640-h240/Screen+Shot+2021-01-01+at+1.34.35+PM.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Figure 6: Some commonsense benchmarks along with an example instance. </td></tr></tbody></table><div><br /></div><div><br /></div><div><b>Type of knowledge:</b> some benchmarks focus on a specific type of commonsense knowledge, such as social commonsense (e.g. <a href="https://leaderboard.allenai.org/socialiqa/submissions/get-started">Social IQa</a>), physical commonsense (e.g. <a href="https://yonatanbisk.com/piqa/">PIQA</a>), temporal commonsense (e.g. <a href="https://leaderboard.allenai.org/mctaco/submissions/get-started">MC-TACO</a>), or causes and effects (e.g. <a href="https://people.ict.usc.edu/~gordon/copa.html">COPA</a>), while others target a broader domain of general commonsense knowledge and reasoning (e.g. <a href="https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html">WSC</a>, <a href="https://winogrande.allenai.org/">WinoGrande</a>, <a href="https://www.tau-nlp.org/commonsenseqa">CommonsenseQA</a>, <a href="https://cs.rochester.edu/nlp/rocstories/">ROCStories</a>). </div><div><br /></div><div><b>Size: </b>most recent datasets include a large training set, in order to facilitate training large neural models. One way to create a benchmark is to hire experts to curate a high-quality dataset such as for WSC and COPA. These datasets are rather expensive to collect and are therefore typically small. The common alternative is to collect data through <a href="http://veredshwartz.blogspot.com/2016/08/crowdsourcing-for-nlp.html">crowdsourcing</a> or semi-automatically, and split it randomly to train, validation, and test sets. Models that learned data-specific shortcuts in the training set instead of generalized phenomena are likely to perform well on a test set <a href="https://www.aclweb.org/anthology/2020.acl-main.465.pdf">drawn from the same distribution</a>, but this performance is misleading and is likely a lot better than on real-world instances of the task. Despite this understanding, this is still the dominant approach. </div><div><br /></div><div><b>Format:</b> the vast majority of datasets are in the format of multiple choice questions, as exemplified in Figure 6. This format is the easiest to evaluate automatically: models are judged for their accuracy, i.e. what percent of the questions they answered correctly. Unfortunately, this type of tasks also makes it possible for a model to guess the correct answer. We're not talking about a random guess, which would leave enough room for improvement. A random guess is expected to result in an accuracy of 100/k %, where k is the number of answer choices, e.g. 50% accuracy for binary tests, 33.3% for tests with 3 choices, 25% for 4 choices, etc. The risk is that the model makes an "educated guess" based on - yes, you guessed it correctly - spurious correlations between the questions and the correct/incorrect answers. </div><div><br /></div><div>How do you make sure a model is right for the right reasons?</div><div><br /></div><div>That's the million-dollar question. We don't have a perfect solution for this problem yet. For a start, when collecting a new benchmark, the process of collecting incorrect answers (=distractors) should be well-designed such that distractors are <i>plausible but unlikely</i>. Using random answers as distractors (e.g. naturally-occurring sentences or correct answers of different questions) would create topically-different distractors, which are easy to detect (remember, <a href="https://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html#we">relatedness is one of the strengths of distributional text representations</a>). Asking people to come up with the distractors may introduce other annotation artifacts, such as exaggerations, going off-topic, or producing overly emotional texts, which are <a href="https://www.aclweb.org/anthology/W17-0907.pdf">easy for models to detect</a>. Some solutions have been proposed: for example, the distractors in Social IQa are answers for different questions asked on the same context. In Figure 7, the context "<i>Alex spilt food all over the floor</i></div><div><i>and it made a huge mess.</i>" appears in the dataset with two questions: "<i>what happens next?</i>" and "<i>what happened before?</i>". The distractors of "<i>what happens next?</i>" are the correct answers of "<i>what happened before?</i>", e.g. that Alex has slippery hands. A similar approach is taken in CommonsenseQA. </div><div><br /></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGAZ0EXzLMVJSoy93TJ3MXXSZ0sK2fZXAHw__R8kBMCMU4I2150JsCFswFsr7aICkBqquIPRgooeYFHSJqAs1RtfhS2oSnmY2ni4Pf7JaHAvZ6phcC9vD2cqRrCn3_p4g4acmbneGR1cA/s1458/Screen+Shot+2021-01-01+at+1.58.54+PM.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="690" data-original-width="1458" height="189" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiGAZ0EXzLMVJSoy93TJ3MXXSZ0sK2fZXAHw__R8kBMCMU4I2150JsCFswFsr7aICkBqquIPRgooeYFHSJqAs1RtfhS2oSnmY2ni4Pf7JaHAvZ6phcC9vD2cqRrCn3_p4g4acmbneGR1cA/w400-h189/Screen+Shot+2021-01-01+at+1.58.54+PM.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Figure 7: Creating distractors for a Social IQa instance. Image credit: Maarten Sap.</td></tr></tbody></table><br /><div>An alternative solution is to filter out easy questions through "<a href="https://rowanzellers.com/swag/">adversarial filtering</a>", i.e. training a weaker model and iteratively removing instances that it succeeds in answering. Variants of adversarial filtering were applied to WinoGrande and PIQA. </div><div><br /></div><div>Finally, I believe the future is in generative tasks, in which the model needs to produce a free text answer without being provided with the candidate answers. Several recent benchmarks are generative, such as <a href="https://www.aclweb.org/anthology/D19-1509/">TimeTravel</a> (counterfactual reasoning), <a href="https://openreview.net/pdf?id=Byg1v1HKDB">ART</a> (abductive reasoning), <a href="https://inklab.usc.edu/CommonGen/">CommonGen</a>, and <a href="https://www.aclweb.org/anthology/2020.emnlp-main.85.pdf">ProtoQA</a>. The challenge in generative tasks is the lack of reliable automatic evaluation metrics. Given the gold standard reference answer(s), we would like a metric to (1) reward correct generated answers that are different from the reference answer, while (2) penalizing incorrect answers that are similar (e.g. lexically) to the reference. Human evaluation is reliable, but it is costly and is typically done once on the test set. In order to be able to improve models during development, we need automatic metrics. We currently settle for metrics based on lexical overlap such as <a href="https://en.wikipedia.org/wiki/BLEU">BLEU</a> and <a href="https://en.wikipedia.org/wiki/ROUGE_(metric)">ROUGE</a> which are pretty terrible at (1) and have <a href="https://www.aclweb.org/anthology/D17-1238/">little correlation with human judgements</a>, or model-based metrics such as <a href="https://openreview.net/forum?id=SkeHuCVFDr">BERT score</a> that are not great at (2). </div><div style="text-align: left;"><div><br /></div></div><h4 id="kbs" style="text-align: left;">How to gather and represent machine readable commonsense knowledge?</h4><div style="text-align: left;"><div>Commonsense resources provide machine-readable knowledge about the world. Resources are expected to be large-scale and accurate, consist of diverse knowledge types, and be usable in downstream tasks. <a href="https://conceptnet.io/">ConceptNet</a> is a large (21 million assertions), commonly-used resource consisting of general commonsense knowledge, in over 85 languages. <a href="https://homes.cs.washington.edu/~msap/atomic/">ATOMIC</a> consists of 880,000 triplets reasoning about causes and effects of everyday situations. Other resources are listed in Figure 8.</div><div><br /></div></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhkD2fzDHpTTydRC9aS3YX0hnoXK6FJJNS__-iaGFtfpHfnDFhCI_AhknrsVeFBZuY2hOp77wWnF5kmwVnupNDLW-7XYlqMj1GVrPBljgrwoabIYWB77-bVCBnABkRPnPq1VuqSnHp6ecE/s1614/Screen+Shot+2021-01-01+at+3.06.10+PM.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="734" data-original-width="1614" height="293" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhkD2fzDHpTTydRC9aS3YX0hnoXK6FJJNS__-iaGFtfpHfnDFhCI_AhknrsVeFBZuY2hOp77wWnF5kmwVnupNDLW-7XYlqMj1GVrPBljgrwoabIYWB77-bVCBnABkRPnPq1VuqSnHp6ecE/w640-h293/Screen+Shot+2021-01-01+at+3.06.10+PM.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Figure 8: Overview of existing commonsense resources. Image credit: Maarten Sap. </td></tr></tbody></table><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;">Existing resources differ in several aspects:</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><b>Representation</b>: how is knowledge represented in the resource? ConceptNet and ATOMIC represent knowledge in natural language (Figure 9), while NELL and Cyc represent knowledge in symbolic logic:</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><span style="font-family: courier;">(#$implies
(#$and
(#$isa ?OBJ ?SUBSET)
(#$genls ?SUBSET ?SUPERSET))
(#$isa ?OBJ ?SUPERSET)) </span></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZWsHPKeDk47tb49OXpgmJzTalNaiB2kCq8YfM6FsBeNFFcUFwwPoJMH08o16L7dymp_o0DhDbGQgxXUWsyVgJtNK85kdEsuTsiAweIpAgy-8e6xS2IOueS__bH0qfHiPZ3Hqo941VCAU/s1642/Screen+Shot+2021-01-01+at+3.33.17+PM.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="898" data-original-width="1642" height="350" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZWsHPKeDk47tb49OXpgmJzTalNaiB2kCq8YfM6FsBeNFFcUFwwPoJMH08o16L7dymp_o0DhDbGQgxXUWsyVgJtNK85kdEsuTsiAweIpAgy-8e6xS2IOueS__bH0qfHiPZ3Hqo941VCAU/w640-h350/Screen+Shot+2021-01-01+at+3.33.17+PM.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Figure 9: example knowledge extracted from ConceptNet and ATOMIC. Image credit: Maarten Sap. </td></tr></tbody></table><br /><div style="text-align: left;"><div><b><br /></b></div><div><b>Knowledge type: </b>ConceptNet consists of semantic knowledge, i.e. properties of concepts (e.g. <i>reading is a type of activity</i>). ATOMIC, on the other hand, is inferential: given a templated event with "PersonX" representing the subject and "PersonY" an optional object(s) (e.g. <i>PersonX yells at PersonY</i>), and one of 9 pre-defined relation dimensions (e.g. PersonX's motivation) it provides a second event (e.g. <i>PersonX wanted to express anger</i>). </div><div><br /></div><div><b>Collection method:</b> knowledge can be collected from humans, either experts or crowdsourcing workers. Expert-curated resources are more uniform and accurate, and may use complex representations, but it is an expensive collection method, and it is very <a href="https://emerj.com/ai-future-outlook/a-30-year-old-ai-project-hits-the-market/">time consuming</a>. Alternatively, non-experts can write knowledge in natural language, making the collection faster and more scalable.</div><div><br /></div><div>The alternative approach is to extract knowledge automatically from texts, as in <a href="http://rtw.ml.cmu.edu/rtw/">NELL</a>. This approach works, but it produces less accurate knowledge. In addition, the approach suffers from <a href="https://openreview.net/pdf?id=AzxEzvpdE3Wcy">reporting bias</a>: over-representing the rare at the expense of the trivial. For example, people are reported to murder more often than they are reported to breathe. Default properties of concepts (<i>yellow banana</i>) are mentioned less often than their alternatives (<i>green banana</i>), etc. </div><div><br /></div></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><b>How to enhance neural models for commonsense reasoning tasks with symbolic knowledge?</b></div><h4 id="static" style="text-align: left;"><span style="font-weight: 400;">Most models developed for solving commonsense benchmarks today are based on language models. Typically, each answer choice, along with the context, forms a statement. The </span><span style="font-weight: normal;">language model computes a vector representing each statement. These vectors are then fed into a classifier that assigns a plausibility score for each candidate answer:</span></h4><div><span><b><br /></b></span></div><h4 id="static" style="text-align: left;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvy3JZi-V5JR5Cz7s_W3d-SVUZlVWyL0pw2yCaqTVbkOdOdKeFBNR68g7mK4kxH39bJFj1qFxQqYiwlXg4_xTQqiZsdywaZ6rX9VOYRNp89ixv0fIepuqZtttAuUtd2O_fmew4t0oR0Rw/s933/Screen+Shot+2020-10-29+at+4.08.14+PM.png" style="margin-left: auto; margin-right: auto;"><b><img border="0" data-original-height="163" data-original-width="933" height="112" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhvy3JZi-V5JR5Cz7s_W3d-SVUZlVWyL0pw2yCaqTVbkOdOdKeFBNR68g7mK4kxH39bJFj1qFxQqYiwlXg4_xTQqiZsdywaZ6rX9VOYRNp89ixv0fIepuqZtttAuUtd2O_fmew4t0oR0Rw/w640-h112/Screen+Shot+2020-10-29+at+4.08.14+PM.png" width="640" /></b></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Figure 10: An illustration of using BERT to score the answer choices of a WinoGrande instance.</td></tr></tbody></table></h4><div><span style="font-weight: normal;"><br /></span></div><h4 id="static" style="text-align: left;">Static neuro-symbolic integration</h4><div>The knowledge in commonsense resources may enhance models built for solving commonsense benchmarks. For example, we can extract from ConceptNet the assertions that job is used for making money, that spending money requires making money, that buying requires spending money, and that car is something you can buy. Ideally we would also need the knowledge that a high-paying job is a type of job, specifically one used for making a lot of money, which is required for spending a lot of money, which is required for buying something that costs a lot of money, car being one of them. Finally, we may want to remove the edge from "buy" to "car" so we can only get to "car" from the node "buy something that costs a lot of money". </div><div><br /></div><div><br /></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhk0pAcEtYfpeR9l8Vl3StifAmcr6l_1mkVgsytveBbVrIh99zrQPgXx0wxP2wOtWahgn2DBCi2ZgoLp71B9_VhZKs1Naq7m_ZKY7Rxm-ubqwyhKPINa2vP87F6eAnXIi2dyOxQirR-kyk/s746/Screen+Shot+2020-10-29+at+4.10.56+PM.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="516" data-original-width="746" height="277" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhk0pAcEtYfpeR9l8Vl3StifAmcr6l_1mkVgsytveBbVrIh99zrQPgXx0wxP2wOtWahgn2DBCi2ZgoLp71B9_VhZKs1Naq7m_ZKY7Rxm-ubqwyhKPINa2vP87F6eAnXIi2dyOxQirR-kyk/w400-h277/Screen+Shot+2020-10-29+at+4.10.56+PM.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Figure 12: Knowledge extracted from ConceptNet for the WinoGrande instance discussed above.<br /></td></tr></tbody></table><div><br /></div><div><br /></div><div>How do we incorporate knowledge from knowledge resources into a neural model?</div><div><br /></div><div>The simple recipe (success not guaranteed) calls for 4 ingredients: the task addressed, the knowledge resource used, the neural component, and the combination method. We have already discussed tasks and knowledge resources, so I would only add here that ConceptNet is the main resource utilized for downstream models, although some models incorporate other knowledge sources, such as other knowledge bases (WordNet, ATOMIC), knowledge mined from text, and tools (knowledge base embeddings, sentiment analysis models, COMET - see below). </div><div><br /></div><div><br /></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEinv-vE5mqVPn937_dJ0ae0519YxRZKbD_1V_FIunt4t15gVDN9g0ngglQ5sSzeloBcNNOrcLJlsAzuJaFSydRNMlSHx3X8NemJzPxDzjOwBM6cMusmBmCyZXBX93cfI0uBatwBtjoDgUA/s542/Screen+Shot+2020-10-29+at+4.13.10+PM.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="361" data-original-width="542" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEinv-vE5mqVPn937_dJ0ae0519YxRZKbD_1V_FIunt4t15gVDN9g0ngglQ5sSzeloBcNNOrcLJlsAzuJaFSydRNMlSHx3X8NemJzPxDzjOwBM6cMusmBmCyZXBX93cfI0uBatwBtjoDgUA/s320/Screen+Shot+2020-10-29+at+4.13.10+PM.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Figure 13: Resources used by most knowledge-informed commonsense models.</td></tr></tbody></table><span><br /><span style="font-weight: normal;">The neural component is the </span></span><span style="font-weight: normal;">shiny new neural architecture - language models in the last 3 years, biLSTMs in the years prior, etc. The more interesting component is the combination method. We will look at 3 examples:</span><div><span style="font-weight: normal;"><br /></span></div><div><b>Incorporating into the scoring function:</b> <a href="https://www.aclweb.org/anthology/D17-1216.pdf">Lin et al. (2017)</a> extracted probabilistic "rules" connecting pairs of terms from multiple sources such as WordNet (restaurantâeatery: 1.0), Wikipedia categories (restaurantâbusiness: 1.0), script knowledge mined from text (X went to a restaurantâX ate: 0.32), word embedding-based relatedness scores (restaurantâfood: 0.71), and more. The model scores each candidate answer according to the scores of the inference rules used to get from the context (e.g. "<i>Mary walked to a restaurant</i>" in Figure 14) to the candidate answer (e.g. "<i>She ordered foods.</i>"). </div><div><br /></div><div><br /></div><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7SoCI_h-kAQIKSo88zWbvz9f8rE8mDMLuX11Zz674Gb8jioPpTwsiO1Vin5wVUi0FqUHQSRlHnGqfNT0-P_IjEf7Kec9u1U1yAB3jM3jYmDVW9wJAi4Ca5TDvOE_cFzHIMCpI0d9WbVE/s1228/Screen+Shot+2021-01-01+at+4.59.24+PM.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="694" data-original-width="1228" height="226" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7SoCI_h-kAQIKSo88zWbvz9f8rE8mDMLuX11Zz674Gb8jioPpTwsiO1Vin5wVUi0FqUHQSRlHnGqfNT0-P_IjEf7Kec9u1U1yAB3jM3jYmDVW9wJAi4Ca5TDvOE_cFzHIMCpI0d9WbVE/w400-h226/Screen+Shot+2021-01-01+at+4.59.24+PM.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: left;">Figure 14: "covering" each candidate answer by the original context and the rules extracted from various sources. Image credit: <a href="https://www.aclweb.org/anthology/D17-1216.pdf">Lin et al. (2017)</a>.</td></tr></tbody></table><div><br /></div><div><br /></div><div><b>Representing symbolic knowledge as vectors: </b><a href="https://www.aclweb.org/anthology/D19-1282.pdf">Lin et al. (2019)</a> used BERT as the neural component to represent the instance (statement vector). For their symbolic component, they extracted subgraphs from ConceptNet pertaining to concepts mentioned in the instance and learned to represent them as a vector (graph vector). These two vectors were provided as input to the answer scorer which was trained to predict the correct answer choice. </div><div><br /></div><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeFi4Q05VbYhv9zj_fve6lsFrh2Zfjt6xH_dOyGLsYTc63AdfvdEyv2KQBds3W5ZBKKDDs2yg74Cz6M5bEcz82ZdLedDcfagGlGhdUl5Sp9s0wa0xoZLmxWE7ss5G-y6qSdfbIQGjowCI/s790/Screen+Shot+2021-01-03+at+4.12.09+PM.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="630" data-original-width="790" height="255" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeFi4Q05VbYhv9zj_fve6lsFrh2Zfjt6xH_dOyGLsYTc63AdfvdEyv2KQBds3W5ZBKKDDs2yg74Cz6M5bEcz82ZdLedDcfagGlGhdUl5Sp9s0wa0xoZLmxWE7ss5G-y6qSdfbIQGjowCI/w320-h255/Screen+Shot+2021-01-03+at+4.12.09+PM.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: left;">Figure 15: extracting subgraphs from ConceptNet pertaining to concepts mentioned in the instance. Image credit: <a href="https://www.aclweb.org/anthology/D19-1282.pdf">Lin et al. (2019)</a>.</td></tr></tbody></table><div><br /></div><h4 id="dynamic" style="text-align: left;">Multi-task learning: <span style="font-weight: 400;"><a href="https://dl.acm.org/doi/10.1145/3357384.3358165">Xia et al. (2019)</a> fine-tuned a BERT model to solve the multiple choice questions. They also trained two auxiliary tasks supervised by </span><span style="font-weight: normal;">ConceptNet, in which two concepts were given as input and the classifier had to predict whether they are related or not, and the specific ConceptNet property that connects them. The BERT model was shared between the main and the auxiliary tasks, so that commonsense knowledge from ConceptNet was instilled into BERT, improving its performance on the main task.</span></h4><div><span style="font-weight: normal;"><br /></span></div><div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBRKpSrahy_2YiwhvDX1NsBBbSD6EjUjk_kSIa3FUYqIpSE_g90DIdTqXTS-H0dVWr0RA9fGphmH6XlFMQzO470SuoumNTTDk6RO9KgaS3tfuh_b4kfHHovO78gClRlVV9XeFR2tIZPcw/s1960/Screen+Shot+2021-01-01+at+5.10.43+PM.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="936" data-original-width="1960" height="306" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBRKpSrahy_2YiwhvDX1NsBBbSD6EjUjk_kSIa3FUYqIpSE_g90DIdTqXTS-H0dVWr0RA9fGphmH6XlFMQzO470SuoumNTTDk6RO9KgaS3tfuh_b4kfHHovO78gClRlVV9XeFR2tIZPcw/w640-h306/Screen+Shot+2021-01-01+at+5.10.43+PM.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Figure 16: multi-task learning aimed at instilling knowledge from ConceptNet into BERT.</td></tr></tbody></table><br /><span style="font-weight: normal;"><br /></span></div><h4 id="dynamic" style="text-align: left;">Dynamic neuro-symbolic integration</h4><div><div><div>There are two main limitations to the neuro-symbolic integration discussed above:</div><div><ol style="text-align: left;"><li><b>Coverage:</b> relevant knowledge is often not found as-is in commonsense knowledge resources. As we've seen earlier, commonsense knowledge is immeasurably vast, so much of it is not documented. <br /><br /></li><li><b>Precision and context:</b> knowledge found in the knowledge base about concept X doesn't necessarily apply to all contexts in which X appears. For example, when provided with "<i>PersonX adopts a cat</i>", ATOMIC says that PersonX had to go to the shelter first (Figure 17), but that's not always the case. It may as well be that PersonX adopted a cat they found on the street or got the cat from a friend who was no longer able to care for it. </li></ol><div><br /></div></div><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_qJpp93jIgQnHd0S8Dve6RHGPRN5IwJq_51ieuCMOYiaeqOezMQ7H2CpQqogTExuVXYWuMK6U9yTIqJbxyqH1hqK44EG28MWqWDxdvfnBfItttJ3fM1Re8qydrPtTQ67U-n88DiBnXq8/s1634/Screen+Shot+2021-01-03+at+4.32.30+PM.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1634" data-original-width="1620" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_qJpp93jIgQnHd0S8Dve6RHGPRN5IwJq_51ieuCMOYiaeqOezMQ7H2CpQqogTExuVXYWuMK6U9yTIqJbxyqH1hqK44EG28MWqWDxdvfnBfItttJ3fM1Re8qydrPtTQ67U-n88DiBnXq8/w634-h640/Screen+Shot+2021-01-03+at+4.32.30+PM.png" width="634" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Figure 17: ATOMIC inferences for the event "PersonX adopted a cat".</td></tr></tbody></table><br /><div><br /></div><div><div>How do we provide machines with large-scale, contextualized commonsense knowledge?</div></div><div><br /></div><div>The solution is to leverage manually curated commonsense knowledge resources, such as ConceptNet and ATOMIC, to train a model that can dynamically produce such knowledge for a given context. Commonsense knowledge resources are typically sparse, making training a knowledge base completion model to extend the resource less efficient. Pre-trained language models and their inherent knowledge come in handy here. Language models (such as GPT) implicitly represent knowledge, so you can re-train them on completing knowledge base assertions (e.g. from ATOMIC) to teach them the structure of knowledge. This is what <a href="https://www.aclweb.org/anthology/P19-1470/">COMET</a> (COMmonsEnse Transformers) does, as illustrated in Figure 18. </div><div><br /></div><div><br /></div><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNgWqcB0gK8mLpe3rzsGnIiQ2b1O6s6lkuk35l_6MJO4y5OCm3VyXCt74qxL9VKmIIw-kcORmsdA5YKGX8c1go4aPqUqmf7emM1LO91b4I7KnDhoGKPBQp80fAjwcdjLrozhl0bVseV9w/s2561/Screen+Shot+2021-01-03+at+4.48.11+PM.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1228" data-original-width="2561" height="191" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNgWqcB0gK8mLpe3rzsGnIiQ2b1O6s6lkuk35l_6MJO4y5OCm3VyXCt74qxL9VKmIIw-kcORmsdA5YKGX8c1go4aPqUqmf7emM1LO91b4I7KnDhoGKPBQp80fAjwcdjLrozhl0bVseV9w/w400-h191/Screen+Shot+2021-01-03+at+4.48.11+PM.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: left;">Figure 18: Illustration of the training process of COMET: The language model is fine-tuned to predict the "tail entity" (e.g. inference in ATOMIC) given the "head entity" and the relation. Image credit: Antoine Bosselut.</td></tr></tbody></table><br /><div><br /></div><div>COMET is capable of <i>dynamically</i> generating inferences for any context. For example, if we modify the context from ATOMIC to "<i>David adopted his sister's cat because they found out her husband was allergic.</i>", which for obvious reasons does not appear in ATOMIC, COMET no longer predicts that PersonX (David) had to go to the shelter, but instead that he, for example, needed to find out about it.</div></div><div><br /></div><div>COMET has been used successfully in various downstream tasks requiring commonsense knowledge. Models trained on ATOMIC or on ConceptNet are available, and the demo for both ATOMIC and COMET can be found <a href="https://mosaickg.apps.allenai.org/">here</a>. There is also a <a href="https://visualcomet.xyz/">Visual COMET</a> that can generate inferences from images. </div></div><div><br /></div><div><h4 id="summary" style="text-align: left;">Summary</h4><div>We talked about ways to acquire and represent commonsense knowledge in machine readable format, ways to measure commonsense reasoning abilities, and ways to integrate this kind of knowledge into models. None of these is solved yet. Manually collecting all the commonsense knowledge is infeasible, while extracting it from texts or from language models suffers from inaccuracies, reporting bias, and societal biases. Looking forward, a promising research direction is multi-modal commonsense knowledge acquisition, e.g. learning from texts along images and videos. For example, looking through enough class photos, you might learn that the kids in the front row typically sit (especially if the kids in the last row are also seated). </div><div><br /></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgTcBOuG6Dxx5XM2JwRVv1LxNBz4ws0RsoVmdZRmaLAmLPCHLRW3VFmlPT1RU3bKJl1p9IebdX-oVO7dN02gRJSHrBLt7zHWW9ViGBLoL9TpCg6Vr1SxjekAQkYzhFYDXSFoTgAdmwO_EY/" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="846" data-original-width="1364" height="198" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgTcBOuG6Dxx5XM2JwRVv1LxNBz4ws0RsoVmdZRmaLAmLPCHLRW3VFmlPT1RU3bKJl1p9IebdX-oVO7dN02gRJSHrBLt7zHWW9ViGBLoL9TpCg6Vr1SxjekAQkYzhFYDXSFoTgAdmwO_EY/" width="320" /></a></div><br /></div><div>Machines may reach human performance on commonsense benchmarks but it's often due to being right for the wrong reasons rather than actually possessing and successfully applying commonsense knowledge and reasoning abilities. Generative tasks are somewhat less prone to this issue, but we would have to develop reliable automatic evaluation metrics to make them the standard. </div></div><div><br /></div><div>Machine commonsense reasoning is becoming more and more popular within NLP so I am optimistic about future breakthroughs! <br /></div>Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com4tag:blogger.com,1999:blog-9145120678290195131.post-60753689796882451212019-08-30T20:51:00.001+03:002021-01-11T21:43:02.842+02:00Text GenerationEarly this year, OpenAI <a href="https://openai.com/blog/better-language-models/">announced</a> a very powerful language model they developed that can generate human-like text. While such announcements are usually followed by the release of a model to the public, this one suggested that such a powerful tool will pose a danger, and therefore only a smaller and less powerful version of it was released. Soon enough, in addition to the usual buzz on academic Twitter, the news made it to popular media - where it was described in a rather simplistic and exaggerated way. This has caused some fear among the general population; some criticism among other NLP people; and many questions from their relatives ("hey, look at this article I've found - did they just solve NLP? Are you going to be unemployed?"). Six months later, OpenAI finally <a href="https://openai.com/blog/gpt-2-6-month-follow-up/">decided to release the full model</a>.<br />
<br />
While I might be late to the dangerous language models party, I thought this blog lacks a basic post about text generation. How are these models trained? How are they used? Are they really that good? And dangerous?<br />
<br />
<a href="#scope">Scope</a><br />
<a href="#lms">Language Models</a><br />
<a href="#generation">Generating Text</a><br />
<a href="#training">Training a language model</a><br />
<a href="#eval">Evaluating text generation</a><br />
<a href="#dangerous">Are language models dangerous?</a><br />
<br />
<h4 dir="ltr" id="scope" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<a href="https://www.blogger.com/null" name="scope">
<span style="background-color: transparent; color: #666666; font-family: "arial"; font-size: 12pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">Scope</span></a></h4>
<br />
The reason that everyone is talking about language models (LMs) lately is not so much that they're all working on text generation, but because pre-trained LMs (like the OpenAI <a href="https://openai.com/blog/better-language-models/">GPT-2</a> or Google's <a href="https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html">BERT</a>) are used to produce text representations across various NLP applications, greatly improving their performances. The effect is similar to the effect that pre-trained word embeddings had on NLP in 2013. I recommend reading Sebastian Ruder's article <a href="https://thegradient.pub/nlp-imagenet/">NLP's ImageNet moment has arrived</a> that summarizes it very nicely. This blog post will focus on text generation.<br />
<br />
There is an important distinction between two main types of applications of text generation:<br />
<b><br /></b>
<b>1.</b> <b>Open-ended generation:</b> the purpose is to generate <i>any</i> text. It could be on some specific topic or continuing a previous paragraph, but the model is given the artistic freedom to generate <i>any </i>text.<br />
<div>
<br /></div>
<b>2. Non</b> <b>open-ended generation:</b> the model is expected to generate <i>a specific text</i>. More formally, given some input, the model should generate text that is strictly derived from it. A good example is translation: given a sentence in French, for instance, the model must generate a sentence in English - but not just any sentence - it should have the same meaning as the French sentence. Other examples include summarization (given a long document, generate a short text that consists of the important details in the document); image captioning (given an image, generate a text describing it); speech to text (transcribing); and converting text to code or SQL queries.<br />
<ol>
</ol>
<div>
This post focuses on open-ended text generation.<br />
<br /></div>
<h4 dir="ltr" id="lms" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<a href="https://www.blogger.com/null" name="lms">
<span style="background-color: transparent; color: #666666; font-family: "arial"; font-size: 12pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">Language Models</span></a></h4>
<div>
<br />
I've discussed LMs in one of the <a href="https://veredshwartz.blogspot.com/2015/09/language-models.html">earlier posts in this blog</a>, in the context of machine translation. Simply put, a language model is a probability distribution of the next word in the text, given the previous words. The distribution is over all the words in the vocabulary, which is typically very large (may be a few hundred thousands or more).</div>
<div>
<br />
For example, what can be the next word in the sentence "<i>I'm tired, I want to</i>"? A good language model would assign a high score to <span style="font-family: "georgia" , "times new roman" , serif;">p(<i>sleep</i>|<i>I'm tired, I want to</i>)</span>. The probability of a word like "<i>bed</i>" should be low - although it is a related term, it doesn't form a grammatical sentence; or "<i>party</i>" which is syntactically correct but contradicts with logic. The probability of an entire sentence is the product of the conditional probability of each word given the previous words, using the chain rule:<br />
<br />
<span style="font-family: "georgia" , "times new roman" , serif;">p(<i>I'm tired, I want to sleep</i>) = p(<i>I'm</i>|<s>) * p(<i>tired</i>|<s> <i>I'm</i>) * p(,|<s> <i>I'm tired</i>) * p(<i>I</i>|<s> <i>I'm tired,</i>) * p(<i>want</i>|<s> <i>I'm tired, I</i>) * p(<i>to</i>|<s> <i>I'm tired, I want</i>) * p(<i>sleep</i>|<s> <i>I'm tired, I want to</i>) * p(</s>|<s><i> I'm tired, I want to sleep</i>)</span><br />
<br />
where <s> and </s> mark the beginning and the end of the sentence, respectively. Note that I used a word-based LM for the purpose of demonstration in this post, however, it's possible to define the basic token as a character or a "<a href="https://www.aclweb.org/anthology/P16-1162">word piece</a>" / "subword unit".<br />
<br /></div>
<br />
<h4 dir="ltr" id="generation" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<a href="https://www.blogger.com/null" name="generation">
<span style="background-color: transparent; color: #666666; font-family: "arial"; font-size: 12pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">Generating Text</span></a></h4>
<br />
While LMs can be used to score a certain text on its likelihood in the language, in this post we will discuss another common usage of them which is to generate new text. Assuming we've already trained a language model, how do we generate text? We will demonstrate it with this very simple toy LM, which has a tiny vocabulary and very few probable utterances:<br />
<br />
<script src="https://gist.github.com/vered1986/e40f15d69f74715fba62e51ab9b76133.js"></script>
To generate text using a language model, one must generate token by token, each time deciding on the next token using the distribution defined by the previous tokens. The most basic way is to simply take the most probable word at each step. The code will look like this:<br />
<script src="https://gist.github.com/vered1986/3f390192bf16852a028f9bdd3e4b7d26.js"></script>
Our toy LM only generates the sentence <i>This LM is cool</i>. In general, this generation method is pretty limited because it has very little diversity and in particular, it favors very frequent words, some of which are function words like determiners (<i>the</i>, <i>a</i>, ...), prepositions (<i>on</i>, <i>in</i>, <i>of</i>, ...) and so on. Moreover, this <a href="https://arxiv.org/pdf/1904.09751.pdf">interesting study</a> showed that text generated by maximizing the probability is very different from human-generated text. People don't tend to generate the most likely and obvious utterances, but rather the most helpful amount of information which is not already known to the listener (according to <a href="https://psychology.wikia.org/wiki/Gricean_maxims">Grice's Cooperative Principle)</a>.<br />
<br />
An alternative is to sample from the distribution, i.e., randomly select a word from the vocabulary, proportional to its probability given the previous words, according to the language model. The code will look something like this:<br />
<br />
<script src="https://gist.github.com/vered1986/ee16a0333b761f8313b5490c03012ce4.js"></script>You may notice that running this code multiple times sometimes generates <i>This LM is stupid </i>and sometimes <i>This LM is cool</i>. While this sampling method tends to generate more diverse texts, it's not perfect either, because now there is a chance to sample a rare or unrelated word at each time step - and once the model does, the generation of the next word is conditioned on that rare word and it quickly goes downhill.<br />
<br />
A simple solution is to combine the two approaches and sample only from the top k most probable words in the distribution, for some pre-defined k (as done in <a href="https://arxiv.org/pdf/1904.09751.pdf">this work</a>). This is what it would look like:<br />
<br />
<br />
<script src="https://gist.github.com/vered1986/1f6ae30a87dfd080cd28bbee1cbc1464.js"></script>
<br />
Notice that after keeping only k words in the distribution, we need to make sure again that they form a valid probability distribution, i.e. each entry is between 0 and 1, and the sum is 1.<br />
<br />
An alternative way to sample from the top of the distribution is <a href="https://arxiv.org/pdf/1904.09751.pdf">top p</a>: sort the tokens by their probability from highest to lowest, and take tokens until the sum of probabilities (which is exactly the probability to generate <i>any</i> of these tokens) reaches some pre-defined value p between 0 and 1. A small number close to 0 is similar to always taking the most probable token, while a large number close to 1 is similar to sampling from the entire distribution. This method is more flexible from top k because the number of candidate tokens may change according to the generated prefix. For example, a general text like <i>I want to </i>may have many valid continuations (with a relatively small probability for each), while a more specific text like <i>The bride and the groom got </i>will have much fewer, with the obvious next token <i>married </i>taking most of the probability mass.<br /><br />
<b>Update 01/11/21:</b> a top p snippet is now available - thanks to Saptarshi Sengupta for the contribution!
<br />
<script src="https://gist.github.com/saptarshi059/0408483297fdecfc08e26e0312eb1a37.js"></script>
<br />
<h4 dir="ltr" id="training" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<a href="https://www.blogger.com/null" name="training">
<span style="background-color: transparent; color: #666666; font-family: "arial"; font-size: 12pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">Training a language model</span></a></h4>
<br />
I've already discussed <a href="https://veredshwartz.blogspot.com/2015/09/language-models.html">N-gram language models</a>, but by the time I wrote that post (4 years ago), they were already obsolete and replaced by neural language models. The basic algorithm for training a neural LM is as follows:<br />
<div>
<br /></div>
<table border="1">
<tbody>
<tr>
<td>A large amount of text is dedicated for training (<i>training corpus</i>).<br />
<div>
The model goes over the corpus, sentence by sentence. </div>
<div>
For a given sentence w<sub>1</sub>... w<sub>n</sub>, for each word w<sub>i</sub>:</div>
<div>
<ol>
<ol>
<li>A representation for the context of the word (for example, the previous words in the sentence) is computed: </li>
<ul>
<li>Each word in the sequence w<sub>1</sub>... w<sub>i-1</sub> is represented with a vector, i.e. word embedding. </li>
<li>These word embeddings are fed into the <i>encoder</i>, which returns a single vector representing this sequence.</li>
</ul>
<li>This vector is fed into a classifier whose goal is to predict the next word (each word is a class).</li>
<li>During training, the predicted word w' is compared with the gold label (the actual next word w<sub>i</sub>) and the model parameters are updated accordingly.</li>
</ol>
</ol>
</div>
</td>
</tr>
</tbody></table>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
The various neural LMs differ in their choice of basic token (i.e. word, character, word piece) and encoder. The encoder takes a sequence of word embeddings and returns a single vector representing the corresponding sequence of words (e.g. <i>... tired, I want to</i>). I may have a separate post in the future that focuses on ways to encode text into a vector. For the purpose of this post, let's treat it as a black box function. The following figure illustrates the training (specifically for an encoder based on an <a href="https://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html#rnns">RNN</a>):<br />
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEimd2bTUDeCsElVYpB9QwpkMekUh216Z5AK88ubvOJ_ZBuZq_KP6UNS8cLJBp9TQuIswzG4U_g9Web7tejc1uZaKCpXnLoXcslhghQPhKNCSnQZGc2WfTcJwl6-AuzHc5bnRMReye48XCI/s1600/neural_LM_predict.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="556" data-original-width="698" height="317" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEimd2bTUDeCsElVYpB9QwpkMekUh216Z5AK88ubvOJ_ZBuZq_KP6UNS8cLJBp9TQuIswzG4U_g9Web7tejc1uZaKCpXnLoXcslhghQPhKNCSnQZGc2WfTcJwl6-AuzHc5bnRMReye48XCI/s400/neural_LM_predict.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px; text-align: left;"><b>Figure 1: </b>an excerpt of a neural language model in action. The word embeddings are fed in-order to a recurrent neural network (RNN) that represents each prefix of the sentence. The representation of the previous words (the output of the RNN) is fed to a classifier (MLP) that predicts the next word: each word in the vocabulary is a class. During training, the loss function updates the parameters (of the MLP, RNN, and word embeddings) so that it would guess the next word correctly the next time. </td></tr>
</tbody></table>
<br />
The two main advantages of neural LMs over N-gram LMs are:<br />
<br />
(1) N-gram LMs predict the next word based on a history of N-1 words, e.g. given <i>I'm tired, I want to,</i> a 3-gram LM will predict the next word only based on the last 3 words "I want to", completely ignoring the crucial word "tired". N-gram LMs were usually based on small Ns (2-4) (see the post about <a href="https://veredshwartz.blogspot.com/2015/09/language-models.html">N-gram language models</a> for explanation).<br />
<br />
(2) N-gram LMs are based on the statistics of how many times each text appeared in the data, and it has to be verbatim, i.e. the occurrences of <i>I'm tired </i>are disjoint from those of <i>I'm exhausted</i>. Neural LMs, on the other hand, learn to represent a fragment of text as a vector and to predict the next word based on it. It may generalize semantically-similar texts by assigning them similar vector representations (resulting in the same prediction).<br />
<br />
Important note: some LMs today are trained with a different training objective, i.e. not optimizing guessing correctly the next word in the sentence. Specifically, BERT has a "masked LM objective", i.e. hiding random words in the sentence and guessing them from their surrounding context - tokens before and after these hidden words. Text GANs (Generative Adversarial Networks) consist of two competing components: a generator that generates human-like text and a discriminator trained to distinguish between human-generated and generator-generated texts. In practice, current GAN-based text generation doesn't perform as well as generation from language models (see <a href="https://www.aclweb.org/anthology/N19-1233">here</a> and <a href="https://arxiv.org/abs/1811.02549">here</a>).<br />
<br />
<h4 dir="ltr" id="eval" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<a href="https://www.blogger.com/null" name="eval">
<span style="color: #666666; font-family: "arial"; font-size: 12pt;">Evaluating text generation</span></a></h4>
<br />
Comparing the performance of two classifiers that were trained to solve the same task is easy - we have a test set with the true label of each data point; we predict the test labels using each model, and compute the accuracy of each model compared to the true labels. We then have exactly two numeric values - the higher the accuracy, the better the model. This is not the case for text generation.<br />
<br />
Since we are talking about open-ended generation, there is no gold standard text the model is expected to produce (we have a test set, but we really just want to make sure the generated text looks like it), so how can we judge the model's quality? The best we can do is to manually examine some of the model outputs and decide whether we think it's a good (human-like?) text or not. To do so more systematically we can perform a more proper human evaluation by showing people texts generated by our model vs. texts generated by some baseline model (or by humans...), asking them to rate which is better, and aggregating across multiple judgements on multiple texts. While this is probably the best evaluation method, it is costly and takes a long time to obtain. As a result, it is usually applied to a relatively small number of texts at the final stages of the model development, and isn't used to validate texts in the intermediate steps (which can potentially help improving the model).<br />
<br />
The alternative and commonly used metric is <a href="https://en.wikipedia.org/wiki/Perplexity">perplexity</a>: by definition, it is the inverse probability of the test set, normalized by the number of words. So we want to get a low perplexity score as possible which means the probability of the test set is maximized - i.e., the LM learned a probability distribution which is similar to the "truth". The test set is just a bunch of texts which the LM has not seen before, and its probability is computed by going over it word by word and computing the LM probability of predicting each word given its past. A good LM will assign high probability to the "correct" (actual) next word and a low probability to other words.<br />
<br />
Although perplexity is the most common evaluation metric for text generation, it is criticized for various reasons. Mainly, because it has been shown that improvement in perplexity doesnât always mean an improvement in applications using the language model (it's basically not a good indicator of quality). And also because perplexity can't be used to evaluate text generation models <a href="https://nlpers.blogspot.com/2014/05/perplexity-versus-error-rate-for.html">that don't produce a distribution of words</a> under the hood, like <a href="https://www.aclweb.org/anthology/N19-1233">GANs</a>. And if you thought that evaluation metrics for non open-ended generation are better, think twice!<sup><a href="#1" name="top1">1</a></sup><br />
<br />
<h4 dir="ltr" id="dangerous" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<a href="https://www.blogger.com/null" name="dangerous">
<span style="background-color: transparent; color: #666666; font-family: "arial"; font-size: 12pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">Are language models dangerous?</span></a></h4>
<br />
In the previous <a href="https://veredshwartz.blogspot.com/2018/09/ethical-machine-learning.html">post</a> I discussed the potential misuses of machine learning models, so the starting point should be that yes, if used by people with malicious intentions, LMs may pose a danger. More specifically, the announcement from OpenAI expressed the concern that such a model, if released, may be used to generate fake news at scale. While this is not completely unreasonable, there are currently two limitations of text generation that may help reducing the fear of LMs enhancing disinformation, at least temporarily.<br />
<div>
<br /></div>
When humans generate fake news, they have certain goals - typically either to promote some propaganda or to maximize ad revenue through clicks. Unlike humans, language models don't have agenda. The language models mentioned here were designed to generate text that looks realistic and which is coherent and topically related given some human-written opening passage. There is no easy way to use them to generate controllable fake news at scale.<br />
<div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjESq4dKSxoFPZukW-L4qtKZkDw-K0yDEAVF86Q7fvHiI1st41W4boOi6z4ZrcUMf4PjOW-_DUBOCb4m078YNxuxw93jQnroLNV2D9T3DUWc0rUHr5wDjjRuwARY2prPYvrMVfEggkd1ic/s1600/fake-news-cartoon.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="386" data-original-width="1165" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjESq4dKSxoFPZukW-L4qtKZkDw-K0yDEAVF86Q7fvHiI1st41W4boOi6z4ZrcUMf4PjOW-_DUBOCb4m078YNxuxw93jQnroLNV2D9T3DUWc0rUHr5wDjjRuwARY2prPYvrMVfEggkd1ic/s1600/fake-news-cartoon.png" width="900" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div>
<br /></div>
The exception is <a href="https://medium.com/ai2-blog/counteracting-neural-disinformation-with-grover-6cf6690d463b">Grover</a>, which was designed to generate controlled text. Specifically, it was designed to generate fake news, controlled by several parameters: domain (e.g. New York Times), date, authors, and headline. Nevertheless, most importantly, the model can be used to discriminate between fake and real news very accurately. It learns to recognize the small differences between text generated by machines and by humans, and it accurately distinguishes them even when the text was generated by another language model. Which leads to the second point: machine-generated text is just not good enough yet (if good equals "human-like"). </div>
<div>
<br /></div>
<div>
Yes, generated text today is quite impressive. It is grammatical and in most cases it doesn't deviate from the topic. But it is not <a href="https://www.aclweb.org/anthology/P19-1598">fact-aware</a> (see how it <a href="https://demo.allennlp.org/gpt2?text=GPT-2%20is%20a%20language%20model%20">continues</a> the following sentence: <i>GPT-2 is a language model</i> ___), it has little common sense (and <a href="https://demo.allennlp.org/gpt2?text=She%20fell%20and%20broke%20her%20leg%20because%20someone%20left%20a%20banana%20peel%20">this one</a>:<i> she fell and broke her leg because someone left a banana peel </i>____), and as previously mentioned, often just doesn't read "human-like". Even when it does and humans can't tell that it's machine-generated, there are models that are good at detecting that. The robots may fail us humans, but not each other đ€</div>
<div>
<br /></div>
<div>
Fear of disinformation is justified, but at least at its current state, I'm more concerned about the humans involved in it. Those that initiate and generate it, those that spread it with evil intention, and especially the many others that spread it ignorantly and naively. Perhaps, in parallel to the race between technology used for or developed against disinformation, we can also train humans to think more critically?<br />
<div>
<br /></div>
<br />
<hr style="text-align: right;" />
<div style="direction: ltr;">
<span style="font-size: 90%;">I learned a lot of what I know about text generation pretty recently, thanks to my awesome collaborators on the text GAN evaluation <a href="https://www.aclweb.org/anthology/N19-1233">paper</a> and my teammates at AI2/UW (especially Ari Holtzman and Rowan Zellers). Thanks!</span><br />
<span style="font-size: 90%;"><br /></span>
<span class="Apple-style-span" style="font-size: 90%;"><a href="https://www.blogger.com/null" name="1" style="font-weight: bold;">1</a> The evaluation of non open-ended generation depends on the task, yet suffers from a major issue: the gold standard is a given text, but it may not be the <i>only </i>correct text due to variability in language. In machine translation, for example, the standard evaluation metric is <a href="https://en.wikipedia.org/wiki/BLEU">BLEU</a>, which basically compares chunks of text in the reference (gold standard) translation to the system predicted translation. Various correct translations may differ in their syntactic structure or in the choice of words. Penalizing a model for not predicting the exact sentence that the human translators suggested (and which is found in the test set) is unfair, yet this is the standard way to evaluate machine translation models today. The same issue exists for summarization with the <a href="https://en.wikipedia.org/wiki/ROUGE_(metric)">ROUGE</a> metric. For a much more elaborate discussion on this topic, see Rachael Tatman's <a href="https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213">blog post</a>. <sup><a href="#top1">â©</a></sup></span><br />
<br /></div>
<br /></div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com4tag:blogger.com,1999:blog-9145120678290195131.post-55610565361151181422018-09-14T19:33:00.000+03:002018-09-15T01:10:23.491+03:00Ethical Machine LearningWith machine learning increasingly automating many previously manual decision making processes, itâs time to reflect not only on the algorithmsâ performances but also on the ethical issues involved. Here are some questions about ethical concerns:<br />
<div>
<br /></div>
<div>
<ul>
<li><b>Fairness</b>: are the outputs of our algorithms fair towards everyone? Is it possible that they discriminate against people based on characteristics such as gender, race, and sexual orientation?<br /></li>
<li>Are developers <b>responsible</b> for potential bad usages of their algorithms?<br /></li>
<li><b>Accountability</b>: who is responsible for the output of the algorithm?<br /></li>
<li><b>Interpretability</b> and <b>transparency</b>: in sensitive applications, can we get an explanation for the algorithmâs decision?<br /></li>
<li>Are we aware of <b>human biases</b> found in our training data? Can we reduce them?<br /></li>
<li>What should we do to use <b>user data</b> cautiously and respect user privacy?</li>
</ul>
<b style="font-weight: normal;"><br /></b>
To demonstrate the importance of ethics in machine learning, it is now <a href="https://fairmlclass.github.io/">taught in classes</a>, it has a <a href="https://www.fatml.org/">growing community of researchers</a> working on it, dedicated <a href="http://ethicsinnlp.org/">workshops</a> and <a href="https://sites.google.com/view/srnlp">tutorials</a>, and a Google <a href="https://developers.google.com/machine-learning/fairness-overview/">team</a> entirely devoted to it.</div>
<div>
<br />
We are going to look into several examples.</div>
<div>
<br /></div>
<div>
<h3>
To train or not to train? That is the question</h3>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="font-variant-east-asian: normal; font-variant-numeric: normal; vertical-align: baseline;">Machine learning has evolved dramatically over the past few years, and together with the availability of data, itâs possible to do many things more accurately than before. But prior to considering implementation details, we need to pause for a second and ask ourselves: so, we <i>can</i> develop this model - but <i>should</i> we do it?<br /><br />The models we develop can have bad implications. Assuming that none of my readers is a villain, letâs think in terms of âthe road to hell is paved with good intentionsâ. How can (sometimes seemingly innocent) ML models be used for bad purposes?</span></div>
<ul>
<li>A model that <a href="https://medium.com/@blaisea/do-algorithms-reveal-sexual-orientation-or-just-expose-our-stereotypes-d998fafdf477">detects sexual orientation</a> can be used to out people against their will</li>
<li>A model to <a href="https://arxiv.org/pdf/1611.04135v1.pdf">detect criminality using face images</a> can put innocent people behind bars</li>
<li>A model that <a href="https://en.wikipedia.org/wiki/Deepfake">creates fake videos</a> can be used as âevidenceâ for fake news</li>
<li>A text generation model can be used to generate fake (positive or negative) reviews</li>
</ul>
<br />
In some cases, the answer is obvious (do we <i>really</i> want to determine that someone is a potential criminal based on their looks?). In other cases, itâs not straightforward to weigh all the potential malicious usages of our algorithm against the good purposes it can serve. In any case, itâs worth asking ourselves this question before we start coding.</div>
<div>
<br /></div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://vignette.wikia.nocookie.net/smurfs/images/2/2e/Blue_Plague_Song_Dance.jpg/revision/latest?cb=20120508002530" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="490" data-original-width="640" height="305" src="https://vignette.wikia.nocookie.net/smurfs/images/2/2e/Blue_Plague_Song_Dance.jpg/revision/latest?cb=20120508002530" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Would you develop a model that can recognize smurfs if you knew it could be used by Gargamel? (<a href="https://vignette.wikia.nocookie.net/smurfs/images/2/2e/Blue_Plague_Song_Dance.jpg/revision/latest?cb=20120508002530">Image source</a>)</td></tr>
</tbody></table>
<div>
<br />
<br />
<h3>
Underrepresented groups in the data</h3>
So our model passed the âshould we train itâ phase and now itâs time to gather some data! What can go wrong in this phase?<br />
<br />
In the <a href="http://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html">previous post</a> we saw some examples for seemingly solved tasks whose models work well only for certain populations. <a href="https://www.newscientist.com/article/2141940-donate-your-voice-so-siri-doesnt-just-work-for-white-men/">Speech recognition</a> works well for white males with an American accent but less so for other populations. <a href="https://arxiv.org/pdf/1707.00061.pdf">Text analysis tools</a> donât recognize African-American English as English. <a href="https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html">Face Recognition</a> works well for white men but far less accurately for dark-skinned women. In 2015, <a href="https://www.theverge.com/2018/1/12/16882408/google-racist-gorillas-photo-recognition-algorithm-ai">Google Photos automatically labelled pictures of black people as âgorillasâ</a>. <br />
<br />
The common root of the problem in all these examples is insufficient representation of certain groups in the training data: not enough speech by females, blacks, and people with non-American English accent. Text analysis tools are often trained on news texts, which means itâs mostly written by adult white males. Finally, not enough facial images of black people. If you think about it, itâs not surprising. This goes all the way back to the photographic film, which <a href="https://youtu.be/d16LNHIEJzs">had problems rendering dark skins</a>. I donât actually think there were bad intentions behind any of these, just maybe - ignorance? We are all guilty of being self-centered, so we develop models to work well for people like us. In the case of software industry, this mostly means âwork well for white malesâ. </div>
<div>
<br /></div>
<div>
<h3>
Biased supervision</h3>
When we train a model using supervised learning, we train it to perform similarly to humans. Unfortunately, it comes with all the disadvantages of humans, and we often train models to mimic human biases. <br />
<br />
Letâs start with the classic example. Say that we would like to automate the process of mortgage applications, i.e. training a classifier to decide whether or not someone is eligible for a mortgage. The classifier is trained using the previous mortgage applications with their human-made decisions (accepted/rejected) as gold labels. Itâs important to note that we donât exactly train a classifier to accurately predict an individualâs ability to pay back the loan; instead we train a classifier to predict what a human would decide when being presented with the application.<br />
<br />
We already know that humans have implicit biases and that sensitive attributes such as race and gender may affect these decisions negatively. For example, in the US, <a href="https://www.nytimes.com/2015/10/31/nyregion/hudson-city-bank-settlement.html">black people are less likely to get a mortgage</a>. Since we donât want our classifier to learn this bad practice (i.e. rejecting a mortgage merely because the applicant is black), we leave out those sensitive attributes from our feature vectors. The model has no access to these attributes.<br />
<br />
Only that analyzing the classifier predictions with respect to the sensitive attributes may yield surprising results; for example, that black people are less likely than white people to be eligible for a mortgage. The model is biased towards black people. How could this happen?<br />
<br />
Apparently, the classifier gets access to the excluded sensitive attributes through included attributes which are correlated with them. For example, if we provided the applicantâs address, it may indicate their race. (<a href="https://books.google.co.il/books?id=ZvR4rB3mKB8C&pg=PA259&lpg=PA259&dq=In+the+US,+zip+code+it+is+highly+correlated+with+race&source=bl&ots=bRbSeKIJUa&sig=11GhvewUzXIqHqj014nPuRw4iPA&hl=en&sa=X&ved=2ahUKEwiEtZremuzcAhXLDuwKHVi7DucQ6AEwA3oECAcQAQ#v=onepage&q=In%20the%20US%2C%20zip%20code%20it%20is%20highly%20correlated%20with%20race&f=false">In the US, zip code it is highly correlated with race</a>). Things can get even more complicated when using deep learning algorithms on texts. We no longer have control on the features the classifier learns. Letâs say that the classifier now gets as input a textual mortgage application. Now it may be able to detect race through <a href="https://en.wikipedia.org/wiki/African-American_Vernacular_English">writing style and word choice</a>. And this time we canât even remove certain suspicious features from the classifier. </div>
<div>
<br /></div>
<div>
<h4>
Adversarial Removal</h4>
What can we do? We can try to actively remove anything that indicates race. <br />
<br />
We have a model that gets as input a mortgage application (X), learns to represent it as f(X) (f encodes the application text, or extracts discrete features), and predicts a decision (Y) - accept or reject. We would like to remove information about some sensitive feature Z, in our case race, from the intermediate representation f(X). <br />
<br />
This can be done by jointly training a second classifier, an âadversarialâ classifier, which tries to predict race (Z) from the representation f(X). The adversaryâs goal is to predict race successfully, while at the same time, the main classifier aims both to predict the decision (Y) with high accuracy, and to fail the adversary. To fail the adversary, the main classifier has to learn a representation function f which does not include any signal pertaining to Z.<br />
<br />
The idea of removing features from the representation using adversarial training was presented in <a href="https://arxiv.org/abs/1505.07818">this paper</a>. Later, <a href="https://arxiv.org/abs/1707.00075">this paper</a> used the same technique to remove <i>sensitive</i> features. Finally, <a href="https://arxiv.org/pdf/1808.06640.pdf">this paper</a> experimented with textual input, and found that demographic information of the authors is indeed encoded in the latent representation. Although they managed to âfailâ the adversary (as the architecture requires), they found that training a post-hoc classifier on the encoded texts still managed to detect race somewhat successfully. They concluded that adversarial training isnât reliable for completely removing sensitive features from the representation.</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<h3>
Biased input representations</h3>
Weâre living in an amazing time with positive societal changes. Iâll focus on one example that I personally relate to: gender equality. Every once in a while, my father emails me an article about some successful woman (CEO/professor/entrepreneur/etc.). He is genuinely happy to see more women in these jobs because he remembers a time when there were almost none. For me, I wish for times that this would be a non-issue - when my knowledge that women <i>can</i> do these jobs and the number of women <i>actually doing</i> these jobs finally make sense together.</div>
<div>
<br /></div>
<div>
In the <a href="http://ethicsinnlp.org/ethnlp-2017">Ethics in NLP workshop at EACL 2017</a>, Joanna Bryson distinguished between <a href="https://twitter.com/j2bryson/status/849549737766289408">3 related terms</a>: <i>bias</i> is knowing what "doctor" means, including that more doctors are male than female (if someone tells me theyâre going to the doctor, I normally imagine theyâre going to see a male doctor). <i>Stereotype</i> is thinking that doctors <b>should</b> be males (and consequently, that women are unfit to be doctors). Finally, <i>prejudice</i> is if you only use (go to / hire) male doctors. The thing is, while we as humans--or at least some of us--can distinguish between the three, algorithms canât tell the difference.</div>
<div>
<br /></div>
<div>
One of the points of failure in this lack of algorithmic ability to tell bias from stereotype is in word embeddings. We discussed in <a href="http://veredshwartz.blogspot.com/2017/03/women-in-stem.html">a previous post</a> <a href="https://arxiv.org/abs/1606.06121">this paper</a> which showed that word embeddings capture gender stereotypes. They showed that for instance, when using embeddings to solve analogy problems (a toy problem which is often used to evaluate the quality of word embeddings), they may suggest that <i>father</i> is to <i>doctor</i> like <i>mother</i> is to <i>nurse</i>, and that <i>man</i> to <i>computer programmer</i> is like <i>woman</i> to <i>homemaker</i>. This obviously happens because statistically there are more nurse and homemaker females and male doctors and computer programmers than vice versa, which is reflected in the training data.<br />
<br /></div>
<div>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibwUKFWR0AXfwxr0pkoAc34XoJqdbsEWrmGKs1_iE0IwnyP9OZcXu77CCDjEGxgvGlzONdKh-_mhKKLbm-GDKMn-Dy_gwgjj5yqYm1YvFhhC7TAwblHvJOuQKGDZVKgJg1iJsiAA4VSX8/s1600/doctor-nurse.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="416" data-original-width="1600" height="166" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibwUKFWR0AXfwxr0pkoAc34XoJqdbsEWrmGKs1_iE0IwnyP9OZcXu77CCDjEGxgvGlzONdKh-_mhKKLbm-GDKMn-Dy_gwgjj5yqYm1YvFhhC7TAwblHvJOuQKGDZVKgJg1iJsiAA4VSX8/s640/doctor-nurse.jpg" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="tr-caption" style="font-size: 12.8px;">Google image search for âdoctorâ (left) and ânurseâ (right): there are many more female than male nurse images. </td><td class="tr-caption" style="font-size: 12.8px;"></td></tr>
</tbody></table>
</td></tr>
</tbody></table>
<br />
However, we treat word embeddings as representing meaning. By doing so, we engrave âmaleâ into the meaning of âdoctorâ and âfemaleâ into the meaning of ânurseâ. These embeddings are then commonly used in applications, which might inadvertently amplify these unwanted stereotypes.</div>
<div>
<br />
The suggested solution in that paper was to âdebiasâ the embeddings, i.e. trying to remove the bias from the embeddings. The problem with this approach is, first, that you can only remove biases that you are aware of. Second, which I find worse, is that it removes some of the characteristics of a concept. As opposed to the removal of sensitive features from classification models, in which the features we try to remove (e.g. race) have nothing to contribute to the classification, here we are removing an important part of a wordâs meaning. We still want to know that most doctors are men, we just donât want to have a meaning representation in which <i>woman</i> and <i>doctor</i> are incompatible concepts.<br />
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;"><br class="kix-line-break" /></span></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre;"><br /></span></div>
<b style="font-weight: normal;"><br /></b>
The sad and trivial take-home message is that algorithms only do what we tell them too, so âracist algorithmsâ (e.g. the <a href="https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist">Microsoft chatbot</a>) are only racist because they learned it from people. If we want machine learning to help build a better reality, we need to research not just techniques for improved learning, but also ways to teach algorithms what <b>not</b> to learn from us.</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com10tag:blogger.com,1999:blog-9145120678290195131.post-63285960973420044382018-08-14T21:21:00.001+03:002018-08-15T13:35:33.527+03:00Deep Learning in NLP<div style="line-height: 1.38; margin-bottom: 6pt; margin-top: 20pt;">
This post is an old debt. Since Iâve started this blog 3 years ago, Iâve been refraining from writing about deep learning (DL), with the exception of occasionally discussing a method that uses it, without going into details. Itâs a challenge to explain deep learning using simple concepts and without the caveat of remaining at a very high level. But perhaps worse than that, DL somewhat contradicts the description of my blog: âhuman-<i>interpretable</i> computer scienceâ. We donât really know how or why it works, and attempts to interpret it still only scratch the surface. On the other hand, it has been so prominent in NLP in the last few years (Figure 1), that itâs no longer reasonable to ignore it in a blog about NLP. So hereâs my attempt to talk about it. </div>
<div style="line-height: 1.38; margin-bottom: 6pt; margin-top: 20pt;">
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0MTCaIo4Tf1OsYC9jFkxHHviKrTTtHG3lCSRdT8knU43CaT-mR11-cIp9C7-ILhFOShwgN9uqAqtIbU70vFUJsZXoRQ_Y2Vazkh4diSiaUUsL7xOv3-0CTESGn_hDggLsuy7zKzmxFlc/s1600/word_cloud_ACL.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="682" data-original-width="648" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0MTCaIo4Tf1OsYC9jFkxHHviKrTTtHG3lCSRdT8knU43CaT-mR11-cIp9C7-ILhFOShwgN9uqAqtIbU70vFUJsZXoRQ_Y2Vazkh4diSiaUUsL7xOv3-0CTESGn_hDggLsuy7zKzmxFlc/s400/word_cloud_ACL.png" width="380" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: left;">Figure 1: word cloud of the words in the titles of the accepted papers of ACL 2018, from <a href="https://acl2018.org/2018/07/31/conference-stats/">https://acl2018.org/2018/07/31/conference-stats/</a>. Note the prevalence of deep learning-related words such as âembeddingsâ, ânetworkâ, ârnnsâ.</td></tr>
</tbody></table>
</div>
<div>
<br /></div>
This writing is based on many resources and most of the material is a summary of the main points I have taken from them. Credit is given in the form of linking to the original source. I would be happy to get corrections and comments!<br />
<div>
<br />
<a href="https://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html#meant_for">Who is this post meant for?</a></div>
<div>
<a href="https://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html#what_is_not">What this post is not</a></div>
<div>
<a href="https://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html#lets_talk">Letâs talk about deep learning already!</a></div>
<div>
<a href="https://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html#improved">What has improved?</a></div>
<div>
<a href="https://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html#going_deep">Going Deep</a></div>
<div>
<a href="https://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html#rep_learning">Representation Learning</a></div>
<div>
<a href="https://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html#rnns">Recurrent Neural Networks</a></div>
<div>
<a href="https://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html#bad">What is not yet working perfectly?</a></div>
<div>
<a href="https://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html#amount">The Need for Unreasonable Amount of Data</a></div>
<div>
<a href="https://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html#overfitting">The Risk of Overfitting</a></div>
<div>
<a href="https://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html#artificial_data">The Artificial Data Problem</a></div>
<div>
<a href="https://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html#we">The Shaky Ground of Distributional Word Embeddings</a></div>
<div>
<a href="https://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html#cram">The Unsatisfactory Representations Beyond the Word Level</a></div>
<div>
<a href="https://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html#robustness">The Non Existing Robustness</a></div>
<div>
<a href="https://veredshwartz.blogspot.com/2018/08/deep-learning-in-nlp.html#interpretability">The Lack of Interpretability</a>
<br />
<h3 dir="ltr" id="meant_for" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 16pt;">
<span style="background-color: transparent; color: #434343; font-family: "arial"; font-size: 13.999999999999998pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline;">Who is this post meant for?</span></h3>
As always, this post will probably be too basic for most NLP researchers, but youâre welcome to distribute it to people who are new to the field! And if youâre like me and you enjoy reading peopleâs views about things you already know, youâre welcome to read too!<br />
<h3 dir="ltr" id="what_is_not" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 16pt;">
<span style="background-color: transparent; color: #434343; font-family: "arial"; font-size: 13.999999999999998pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline;">What this post is not</span></h3>
This post is NOT an extensive list of all the most up-to-date recent innovations in deep learning for NLP; I donât know many of them myself. I tried to remain in a relatively high-level, so there are no formulas or formal definitions. If you want to read an extensive, detailed overview of how deep learning methods are used in NLP, I strongly recommend Yoav Goldbergâs â<a href="https://www.morganclaypool.com/doi/abs/10.2200/S00762ED1V01Y201703HLT037">Neural Network Methods for Natural Language Processing</a>â book. My post will also not teach you anything practical. If youâre looking to learn DL for NLP in 3 days and strike a fortune, there are tons of useful guides on the web. Not this post!<br />
<h3 dir="ltr" id="lets_talk" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 16pt;">
<span style="background-color: transparent; color: #434343; font-family: "arial"; font-size: 13.999999999999998pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline;">Letâs talk about deep learning already!</span></h3>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
OK! Deep learning is a subfield of machine learning. As with machine learning, the goal is to give computers the ability to âlearnâ to solve some task without being explicitly programmed to solve it, but rather by using data and applying statistical methods to it. In the common case of <a href="http://veredshwartz.blogspot.com/2015/08/supervised-learning.html">supervised learning</a>, the idea is to make the computer estimate some function, that takes some input and returns a prediction with respect to some question. I know this is very vague, so letâs take the task of <a href="https://en.wikipedia.org/wiki/Named-entity_recognition">named entity recognition</a> (NER) as an example. The task is to determine for each word in a sequence of words (say, a sentence), whether it is a part of a named entity, and if so, of which type. In the text, â<i>Prince Harry attended a wedding with a hole in his shoe</i>â (I didnât make <a href="https://www.harpersbazaar.com/celebrity/latest/a22665613/prince-harry-hole-in-shoe-wedding/">it</a> up), the words âPrinceâ and âHarryâ should be classified as PERSON, and other words as NONE. A machine learning algorithm expects as input a set of properties pertaining to each word (called âfeature vectorâ), and predicts a label for this word, out of a set of possible labels (e.g. PERSON, ORGANIZATION, LOCATION, NONE).<br />
<br />
In the context of NLP, machine learning has been used for a long time, so what is new? Before we go through the differences, letâs think of the characteristics of a traditional machine learning algorithm, specifically a supervised classifier. In general, a traditional supervised classification pipeline was as follows.</div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<br />
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline;"><br /></span>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjnG0obRAlKDd7clZgextOaEAvnOEIqd_JFhSsg_eO0qliEHAot5Y9u8IhtGzCa0v_nlckJsosOjkcWau33ivHYe20SrIvgkzPo8KBeOQwlD7ezrl1NoXSt-CmKbtlVkv3Ao9juy24ONrU/s1600/traditional_ML.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="335" data-original-width="977" height="136" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjnG0obRAlKDd7clZgextOaEAvnOEIqd_JFhSsg_eO0qliEHAot5Y9u8IhtGzCa0v_nlckJsosOjkcWau33ivHYe20SrIvgkzPo8KBeOQwlD7ezrl1NoXSt-CmKbtlVkv3Ao9juy24ONrU/s400/traditional_ML.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="tr-caption" style="font-size: 12.8px;"><span style="font-size: 12.8px;">Figure 2: pipeline of a traditional supervised classifier - extracting features, then approximating the function.</span></td></tr>
</tbody></table>
</td></tr>
</tbody></table>
</div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<br /></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
The input is raw (e.g. text), along with the gold labels. Phase (1) is the extraction of human-designed features from raw input, and representing it as vectors of features. Going back to the NER example, the person designing the learning algorithm would ask themselves: âwhat can indicate the type of entity of a given word?â. For example, a capitalized word is a good indication that a word is a part of a named entity in general. When the previous word is âTheâ, it can hint that the entity is an organization rather than a person. Someone had to come up with all these ideas. Here is an example of features that were previously used for NER:</div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<br />
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline;"><br /></span>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhpUxdZ_bQ2BDmKwW398sNAnjKLi8ST9NqKbCEhtJ31gYZ57wYzqEMAFQZLv4v7h2w1G6ZQoPq89QvLnJcBONBZOP5rXtyH51tHv5FcavLl4i06w9SlQJ4rp3iB_BWomDvphrnJ2Bu_05o/s1600/ner_features.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="228" data-original-width="238" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhpUxdZ_bQ2BDmKwW398sNAnjKLi8ST9NqKbCEhtJ31gYZ57wYzqEMAFQZLv4v7h2w1G6ZQoPq89QvLnJcBONBZOP5rXtyH51tHv5FcavLl4i06w9SlQJ4rp3iB_BWomDvphrnJ2Bu_05o/s1600/ner_features.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: left;">Figure 3: features for NER, from <a href="https://nlp.stanford.edu/manning/papers/gibbscrf3.pdf">this paper</a>. The example is taken from Stanford <a href="http://cs224d.stanford.edu/">CS224d</a> course: Deep Learning for Natural Language Processing. </td></tr>
</tbody></table>
<span style="color: black; font-family: "arial"; font-size: 11pt; vertical-align: baseline;"><br /></span>
Phase (2) is the learning/training phase, in which the computer tries to approximate a function that takes as input the feature vectors and predicts the correct labels. The function is now a function of the features, i.e., in the simple case of a <a href="https://en.wikipedia.org/wiki/Linear_classifier">linear classifier</a>, it assigns a weight for each feature with respect to each label. Features which are highly indicative of a label (e.g. previous word = The for ORGANIZATION) would be assigned a high weight for that label, and those which are highly indicative of not belonging to that label would be assigned a very low weight. The final prediction is done by summing up all the weighted feature values, and choosing the label that got the highest score.</div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<br /></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
Note that Iâm not discussing <i>how</i> these weights are learned from the training examples and the gold labels. This is not super important for the current discussion, and you can read about it elsewhere (e.g. <a href="https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/">here</a>).</div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<br /></div>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
Now letâs talk about the differences between traditional machine learning and deep learning.</div>
<h3 dir="ltr" id="improved" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 16pt;">
<span style="background-color: transparent; color: #434343; font-family: "arial"; font-size: 13.999999999999998pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline;">What has improved?</span></h3>
First of all, it works, empirically. The state-of-the-art performance in nearly every NLP task today is achieved by a neural model. Literally every task listed on this website â<a href="https://nlpprogress.com/">Tracking Progress in Natural Language Processing</a>â has a neural model as the best performing model. Why is it working so much better from previous approaches?<br />
<h4 dir="ltr" id="going_deep" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<span style="background-color: transparent; color: #666666; font-family: "arial"; font-size: 12pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">Going Deep</span></h4>
âDeep learningâ is a buzzword referring to deep neural networks - powerful learning algorithms inspired by the brainâs computation mechanism. A simple single-layered feed-forward neural network is basically the same as the linear classifier we discussed. A deep neural network is a network that contains one or more hidden layers, which are also learned. This is the visual difference between a linear classifier (left, neural network with no hidden layers) and a neural network with a single hidden layer (right):</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3R18AXB7EP2nJmd0ruOB0S8pyPdA6ATUZ7rowfyOPBTdPZjYrr5gAUmqpV3uPCZQwEJO7fAZ8WWL8BjZm3gmYSFjFV3WHAkuqgWYvSKP_5othx3uq0as5h9cAgyoCMKDFSJMQMMCPiR8/s1600/shallow_vs_deep.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="314" data-original-width="900" height="138" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3R18AXB7EP2nJmd0ruOB0S8pyPdA6ATUZ7rowfyOPBTdPZjYrr5gAUmqpV3uPCZQwEJO7fAZ8WWL8BjZm3gmYSFjFV3WHAkuqgWYvSKP_5othx3uq0as5h9cAgyoCMKDFSJMQMMCPiR8/s400/shallow_vs_deep.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 4: shallow neural network (left) vs. deep (1 hidden layer) neural network (right).</td></tr>
</tbody></table>
<div>
<br />
In general, the deeper the network is, the more complex functions it can estimate, and the better it can (theoretically) approximate them.</div>
<div>
<br /></div>
<div>
In mathematical notation, each layer is a multiplication by another learned matrix, but more importantly, it also goes into a non-linear activation function (e.g. <a href="https://en.wikipedia.org/wiki/Hyperbolic_function">hyperbolic tangent</a> or the simple <a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">rectifier</a> that sets values under a certain threshold to zero). Some functions canât be approximated accurately enough using linear models. Iâm deliberately not going into details, but you can read about linear separability <a href="http://www.ece.utep.edu/research/webfuzzy/docs/kk-thesis/kk-thesis-html/node19.html">here</a>. With the help of the multiple layers and the non-linear activations, neural networks can approximate better functions for many tasks, resulting in improved performance.</div>
<div>
<br /></div>
<div>
<h4 dir="ltr" id="rep_learning" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<span style="background-color: transparent; color: #666666; font-family: "arial"; font-size: 12pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">Representation Learning</span></h4>
Another key aspect that changed is how the input is represented. Traditional machine learning methods worked because someone designed a well-thought feature vector to represent the inputs, as we can see in figure 3. Deep learning obviates the need to come up with a meaningful representation, and learns a representation from raw input (e.g. words). The new pipeline looks like this:</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBwXycVyn9dLCldCpZb-qo041-L-5_1qeG6EDUL5vjekYm5Pt1xp2udkCCAVmvJQgxbN7RWcwCq3kuHTmyHIcRZGFC76Mx6dtXOpvmKTyK6LOsW2r2_U1XTQIYd_z1WuI7r2BICbjG3Ps/s1600/deep_learning.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="327" data-original-width="748" height="173" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBwXycVyn9dLCldCpZb-qo041-L-5_1qeG6EDUL5vjekYm5Pt1xp2udkCCAVmvJQgxbN7RWcwCq3kuHTmyHIcRZGFC76Mx6dtXOpvmKTyK6LOsW2r2_U1XTQIYd_z1WuI7r2BICbjG3Ps/s400/deep_learning.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="tr-caption" style="font-size: 12.8px;"></td><td class="tr-caption">Figure 5: pipeline of a deep supervised classifier - learning both the input representation and the other model parameters.</td></tr>
</tbody></table>
</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div>
Now, rather than feeding the learning algorithms hand-engineered feature vectors, we only give it the raw texts. In the case of NER, we can now just feed as features a window of words around each target word. For example, in the sentence â<i>John worked at the Post Office in the city until last year</i>â, the feature vector of the target word âOfficeâ with a window of size 3 would be [Post, Office, in]. The learning algorithm, in addition to learning the network parameters (the function from representation to output), as it did before, learns also the word representations suitable for the task at hand. In other words, one of the additional parameters that the network learns is the word embeddings. We can think about it as a lookup table whose index is a word (string) and its output is a vector.</div>
<div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
It is common to initialize this lookup table with pre-trained word embeddings. We discussed word embeddings in <a href="http://veredshwartz.blogspot.com/2016/01/representing-words.html">this blog post</a>. They are trained using a large text collection, based on a linguistic hypothesis which states that words with similar meanings appear in the same contexts (next to the same âneighbourâ words). Pre-trained embeddings are useful because they are often trained on a lot more data than available for the end task itself, and the more data, the more high-quality the vectors are.</div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
On the other hand, pre-trained embeddings provide a general notion of âsimilarityâ between words, which is not necessarily the same similarity our task needs. Think for example about <a href="https://en.wikipedia.org/wiki/Sentiment_analysis">sentiment analysis</a>. Letâs say we are trying to predict the sentiment of restaurant reviews. Generic pre-trained embeddings will tell us that <i>good</i> and <i>bad</i> are highly similar, because they appear near the same words. But in the context of sentiment analysis weâd like their vectors to be further apart. The developer can choose between initializing the lookup table with the pre-trained embeddings or randomly, and also between updating the embeddings as additional network parameters, to fit the task better, or keeping them fixed. Weâll touch upon that when we discuss overfitting. </div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
Back to the NER example, this is what a network with a window size of 3 would look like:</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjaohGoIfG6ZwK_pm5piWkM1LwJgmP2dKhk8nJRkB7_KXkfMTQvh9DExqbLOuIVuzU-z3bP0jsp8HobIn0BP4ZKS9OAcPCITgF9zK62U9yVjAmYCh6DjebnEGvm1_wKbD5MTjFdAwUSvpQ/s1600/ner_nn.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="596" data-original-width="766" height="310" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjaohGoIfG6ZwK_pm5piWkM1LwJgmP2dKhk8nJRkB7_KXkfMTQvh9DExqbLOuIVuzU-z3bP0jsp8HobIn0BP4ZKS9OAcPCITgF9zK62U9yVjAmYCh6DjebnEGvm1_wKbD5MTjFdAwUSvpQ/s400/ner_nn.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 6: a neural network for NER, which uses a window of 3 words to classify a word.</td></tr>
</tbody></table>
</div>
<div>
<br /></div>
<div>
In a traditional model, the feature ânext wordâ is a discrete variable than can accept as value any word in the vocabulary. Letâs say that during training, the model encountered a word like âltdâ many times as the next word of an organization, and figured that it is a good indication of the ORGANIZATION class. If during test time, the model needs to classify a word followed by âIncâ, it may have no information about this word, and canât generalize the knowledge about the similar word âltdâ. When the feature vector is composed of word embeddings, since the word embeddings of âltdâ and âIncâ are similar, now that inputs are similar, and the model can use knowledge about similar words to output the correct prediction.<br />
<br />
<div>
<h4 dir="ltr" id="rnns" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<span style="background-color: transparent; color: #666666; font-family: "arial"; font-size: 12pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">Recurrent Neural Networks</span></h4>
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
The network we discussed so far is called a feed-forward neural network. In the NER example we used a fixed-size window of words around each target word. But what if we want to use the entire sentence? For example, in the sentence â<i>John worked at the Post Office in the city until last year, and hated this organization</i>â it is beneficial for the model to be aware of the last word âorganizationâ while predicting the labels of âPostâ and âOfficeâ. One problem is that we donât know in advance how many words each input sentence will contain. Another problem is that the representation of each word in the window model is independent of the other words - and we would like the representation to be more contextualized. For instance, the representation of âpostâ in the context of âpost officeâ should be different from its representations in âblog postâ or âpost docâ.<br />
<br />
Recurrent neural networks (RNNs) solve both problems. An RNN is a network that takes as input a sequence (e.g. a sentence as a sequence of words, or a sequence of characters, etc.) and outputs vectors representing each subsequence (i.e. the first item, the first two items, ..., the entire sequence). At each time step, the RNN considers both the previous memory (of the sequence until the previous input) and the current input. The last output vector can be considered as representing of the entire sequence, like a sentence embedding. Intermediate vectors can represent a word in its context.<br />
<br />
The output vectors can then be used for many things, among which: representation - a fixed-size vector representation for arbitrarily long texts; as feature vectors for classification (e.g. representing a word in context, and predicting its NER tag); or for generating new sequences (e.g. in translation, a sequence-to-sequence or seq2seq model encodes the source sentence and then decodes = generates the translation). In the case of the NER example, we can now use the output vector corresponding to each word as the wordâs feature vector, and predict the label based on the entire preceding context.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBNHM7ZyO51YnKS0BO1QE7sbF3hoQJsiaGPlI9tuzPlmSHMo6DVq1n6JmwZXnoZh-iw_SIYK1vX93tnFtXwHmu3wBPYvu6ub5WNjTLTfMkkArgHqEiAJ_QzbAO6ITQZVJh_XQ2tJCiSU0/s1600/ner_rnn.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="338" data-original-width="958" height="224" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBNHM7ZyO51YnKS0BO1QE7sbF3hoQJsiaGPlI9tuzPlmSHMo6DVq1n6JmwZXnoZh-iw_SIYK1vX93tnFtXwHmu3wBPYvu6ub5WNjTLTfMkkArgHqEiAJ_QzbAO6ITQZVJh_XQ2tJCiSU0/s640/ner_rnn.jpg" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 7: a NER model that uses an RNN to represent each word in its context.<span style="white-space: pre;"> </span></td></tr>
</tbody></table>
</div>
<div>
<br /></div>
Technical notes: an LSTM (long short-term memory) is a specific type of RNN that works particularly well and is commonly used. The differences between various RNN architectures are in the implementation of the internal memory. A bidirectional RNN/LSTM or a biLSTM processes the sequences from both sides - right to left and left to right, such that each output vector contains information pertaining to both the previous and the subsequent items in the sequence. For a much more complete overview of RNNs, I refer you to Andrej Karpathy's blog post <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">"The Unreasonable Effectiveness of Recurrent Neural Networks"</a>.</div>
<div>
<br /></div>
<div>
<br />
So these are the main differences. There are many other new techniques on top of them, such as <a href="https://talbaumel.github.io/blog/attention/">attention</a>, <a href="https://medium.freecodecamp.org/an-intuitive-introduction-to-generative-adversarial-networks-gans-7a2264a81394">Generative Adversarial Networks</a> (GAN), <a href="https://www.cs.ucsb.edu/~william/papers/ACL2018DRL4NLP.pdf">deep reinforcement learning</a>, and <a href="http://ruder.io/multi-task/">multi-task learning</a>, but we wonât discuss them.<br />
<br />
Interestingly, neural networks are not a new idea. They have been around since the 1950s, but have become increasingly popular in recent years thanks to the advances in computing power and the amount of available text on the web.<br />
<br />
<h3 dir="ltr" id="bad" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 16pt;">
<span style="background-color: transparent; color: #434343; font-family: "arial"; font-size: 13.999999999999998pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline;">What is not yet working perfectly?</span></h3>
Although the popular media likes to paint an optimistic picture of â<a href="http://www.itpro.co.uk/business-intelligence/artificial-intelligence-google/26540/google-to-change-ai-forever-with-open">AI is solved</a>â (or rather, a very pessimistic, false picture of â<a href="https://www.forbes.com/sites/tonybradley/2017/07/31/facebook-ai-creates-its-own-language-in-creepy-preview-of-our-potential-future/#1d1be7b2292c">AI is taking over humanity</a>â), in practice there are still many limitations to current deep methods. Here are several of them, mostly from the point of view of someone who works on semantics (feel free to add more limitations in the comments!).<b style="font-weight: normal;"><br /></b>
<br />
<h4 dir="ltr" id="amount" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<span style="background-color: transparent; color: #666666; font-family: "arial"; font-size: 12pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">The Need for Unreasonable Amount of Data</span></h4>
Itâs difficult to convey the challenges in our work to people outside the NLP field (and related fields), when there are already many products out there that perform so well. Indeed, we have come a long way to be able to generally claim that some tasks are âsolvedâ. This is especially true to low-level <a href="http://veredshwartz.blogspot.com/2016/06/linguistic-analysis-of-texts.html">text analysis</a> tasks such as part of speech tagging. But the performance on more complex semantic tasks like machine translation is surprisingly good as well. So what is the secret sauce of success, and why am I still unsatisfied?</div>
<div>
<br />
First, letâs look at a few examples of tasks âsolvedâ by DL, not necessarily in NLP:</div>
<div>
<ul>
<li><u>Automatic Speech Recognition (ASR)</u> - also known as speech-to-text. Deep learning-based methods have reported human-level performance last year, but this interesting <a href="https://awni.github.io/speech-recognition/">blog post</a> tells us differently. According to this post, while the the recent improvements are impressive, the claims about human-level performance are too broad. ASR works very well on <b>American accented English with high signal-to-noise ratios</b>. It has been trained on conversations by mostly American native English speakers with little background noise, which is available in large-scale. It doesnât work well, definitely not human-level performance, for other languages, accents, non-native speakers, etc.<br /></li>
</ul>
<div dir="ltr" style="font-family: arial; font-size: 11pt; line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<div style="text-align: center;">
<span style="color: black; font-family: "arial"; font-size: 11pt; vertical-align: baseline;">
<iframe allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/sAz_UvnUeuU/0.jpg" frameborder="0" height="266" src="https://www.youtube.com/embed/sAz_UvnUeuU?feature=player_embedded" width="320"></iframe><br /><br /></span></div>
</div>
<ul style="margin-bottom: 0pt; margin-top: 0pt;">
<li><u>Facial Recognition</u> - another task claimed to be solved, but this <a href="https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html">article</a> from the New York Times (based on a study from MIT), says that itâs only solved for <b>white men</b>. An algorithm trained to identify gender from images was 99% accurate on images of white man, but far less accurate -- only 65% -- for dark-skinned women. Why is that so? a widely used dataset for facial recognition was estimated to be more than 75% male and more than 80% white.<br /><br /></li>
<li dir="ltr" style="font-variant-east-asian: normal; font-variant-numeric: normal; list-style-type: disc; vertical-align: baseline;"><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<div class="separator" style="clear: both; text-align: left;">
<u>Machine Translation</u> - not claimed to be solved yet, but the release of <a href="https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html">Google Translate</a> neural models in 2016 reported large performance improvements. The <a href="https://arxiv.org/pdf/1609.08144.pdf">paper</a> reported â60% reduction in translation errors on several popular language pairsâ. The language pairs were <b>EnglishâSpanish, EnglishâFrench, EnglishâChinese, SpanishâEnglish, FrenchâEnglish, and ChineseâEnglish.</b> All these languages are considered âhigh-resourceâ languages, or in simple words, languages for which there is a massive amount of training data. As we discussed in the <a href="http://veredshwartz.blogspot.com/2015/09/translation-models.html">blog post about translation models</a>, the training data for machine translation systems is a large collection of the same texts written in both the source language (e.g. English) and the target language (e.g. French). Think book translations as an example for a source of training data.</div>
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline;"><br class="kix-line-break" /></span>In other news, popular media was recently worried that <a href="https://mashable.com/2018/07/23/google-translate-glitch-ominous-religious-prophecies/#uRhHltM46kqy">Google Translate spits out some religious nonsense</a>, completely unrelated to the source text. Here is an example. The top example is with religious content, the bottom example is not religious, only unrelated to the source text.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxRqnzXgFSp2xjQI7GKhZrgLwKEyJVGDYb8spmd9mmiJHYFgEWzRvXgAC-cYD9e-01mGAIKmO-KHfN1xJhQWPtqTeFCH0A4vBU7R8yM_ByhwwzZgNVxcVQycz1_eMBTGy3QBTKBxZSXaw/s1600/MT_bullshit.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="360" data-original-width="947" height="150" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhxRqnzXgFSp2xjQI7GKhZrgLwKEyJVGDYb8spmd9mmiJHYFgEWzRvXgAC-cYD9e-01mGAIKmO-KHfN1xJhQWPtqTeFCH0A4vBU7R8yM_ByhwwzZgNVxcVQycz1_eMBTGy3QBTKBxZSXaw/s400/MT_bullshit.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">Figure 8: âwho has been using these technologies for a long timeâ? Hopefully not the Igbo speakers, translating to English. Google Translate makes up things when translating Gibberish from the low-resource language âIgboâ.</td></tr>
</tbody></table>
This excellent <a href="http://deliprao.com/archives/301">blog post</a> offers a simple explanation to this phenomenon: âlow-resourceâ language pairs (e.g. Igbo and English), for which there is not a lot of available data, have worse neural translation models. Not surprising so far. The reason the translator generates nonsense is that when it is given unknown inputs (the input nonsense in the figure 8), the system tries to provide a fluent translation and ends up âhallucinatingâ sentences. Why religious texts? Because religious texts like the Bible and the Koran exist in many languages, and they are probably the major part of the available training data for translations between low-resource languages.</div>
</li>
</ul>
<br />
<br />
Can you spot a pattern among the successful tasks? Having a tremendous amount of training data increases the chances of training high-quality neural models for the task. Of course, it is a necessary-but-not-sufficient condition. In her <a href="https://www.dropbox.com/s/it1e4ndrcuevl04/Repl4NLP.pdf?dl=0">RepL4NLP</a> keynote, Yejin Choi talked about âsolvedâ tasks and said that they all have in common a lot of training data and enough layers. But with respect to NLP tasks, there is another factor. Performing well on machine translation is possible without having a model with deep text understanding abilities, but rather by relying on the strong alignment between the input and the output. Other tasks, which require deeper understanding, such as recognizing fake news, summarizing a document, or making a conversation, have not been solved yet. (And may not be solvable just by adding more training data or more layers?).<br />
<br />
The models of these âsolvedâ tasks are only applicable for inputs which are similar to the training data. ASR for Scottish accent, facial recognition for black women, and translation from Igbo to English are not solved yet. Translation is not the only NLP example; whenever someone tells you some NLP task is solved, it probably only applies to very specific domains in English.<br />
<br />
<h4 dir="ltr" id="overfitting" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<span style="background-color: transparent; color: #666666; font-family: "arial"; font-size: 12pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">The Risk of Overfitting</span></h4>
Immediately following the previous point, when the training data is limited, we have a risk of overfitting. By definition, overfitting happens when a model performs extremely well on the training set, while performing poorly on the test set. It happens because the model memorizes specific patterns in the training data instead of looking at the big picture. If these patterns are not indicative of the actual task, and are not present in the test data, the model would perform badly on the test data. For example, letâs say youâre working on a very simplistic variant of text classification, in which you need to distinguish between news and sports articles. Your training data contains articles about sports and tweets from news agencies. Your model may learn that an article with less than 280 characters is related to news. The performance would be great on the training set, but what your model actually learned is not to distinguish between news and sports articles but rather between tweets and other texts. This definitely wonât be helpful in production when your news examples can be full-length articles. <br />
<br />
Overfitting is not new to DL, but whatâs changed from traditional machine learning are two main aspects: (1) itâs more difficult to âdebugâ DL models and detect overfitting, because we no longer have nice manually-designed features, but automatically learned representations; and (2) the models have many many more parameters than traditional machine learning models used to have - the more layers, the more parameters. This means that a model can now learn more complex functions, but itâs not guaranteed to learn the best function for the task, but rather the best function for the given data. Unfortunately, itâs not always the same, and this is something to keep in mind!</div>
<div>
<br /></div>
<div>
With respect to the first aspect, updating the pre-trained word embeddings during training can lead to overfitting. Given that the training set is limited and doesnât cover the entire vocabulary, weâre only moving some words in the embedding space but keeping others in their place. The words we move (e.g. <i>kangaroo</i>) now have better vectors for the specific task, but theyâre further away from their (distributional) neighbours which are not found in the training set. This hurts the model generalization abilities: when it encounters an unobserved word (e.g. <i>wallaby</i>), its vector is no longer located next to similar words (kangaroo) which have been moved, so the model doesnât know much about it. Updating the embeddings during training is a good idea only if your task has very different needs from its embeddings than just plain similarity (and then you may want to start with a random initialization rather than pre-trained embeddings), and only if you have enough training data that covers a broad vocabulary.</div>
<div>
<br /></div>
<div>
For the next, related point, Iâm going to broaden the definition of âoverfittingâ to the phenomenon of a model that memorizes the peculiarities in the data rather than whatâs actually important for the task, <i>regardless of its performance on the test set</i>.<br />
<br />
<h4 dir="ltr" id="artificial_data" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<span style="background-color: transparent; color: #666666; font-family: "arial"; font-size: 12pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">The Artificial Data Problem </span></h4>
Machine learning enthusiasts like to say that given enough training data, DL can learn to estimate any function. I personally take the more pessimistic stand that some language tasks are too complex and nuanced to solve using DL with any reasonable amount of training data. Nevertheless, like everyone else in the field, Iâm constantly busy thinking of ways to get more data quickly and without spending too much money. </div>
<div>
<br /></div>
<div>
One representative example (of many) to a task in which the release of a large amount of training data raised researchersâ interest in the task is <a href="https://en.wikipedia.org/wiki/Textual_entailment">recognizing textual entailment</a> (RTE, sometimes called ânatural language inferenceâ or NLI). This is an artificial task that was invented because many language understanding abilities we develop can be reduced to this task. In this task, two sentences -- a premise and a hypothesis -- are given. The task is for a model to automatically determine what a human reading the premise can say with respect to the hypothesis. Letâs look at the definitions along with a simple example from the <a href="https://nlp.stanford.edu/projects/snli/">SNLI dataset</a>. For the premise â<i>A street performer is doing his act for kids</i>â and the hypothesis:</div>
<div>
<br />
<ul>
<li><b>Entailment</b>: the hypothesis must also be true. <br />Hypothesis = â<i>A person performing for children on the street</i>â. <br />The premise tells us there is a street performer, so we can infer that he is a person and he is on the street. The premise also tells us that he is doing his act - therefore performing, and for kids = for children. The hypothesis only repeats the information conveyed in the premise and information which can be inferred from it.<br /><br /></li>
<li><b>Neutral</b>: the hypothesis may or may not be true.<br />Hypothesis = â<i>A juggler entertaining a group of children on the street</i>â. <br />In addition to repeating information from the premise, the hypothesis now tells us that itâs a juggler. The premise is more general and can refer to other types of street performers (e.g. a guitar player), so this may or may not be a juggler.<br /><br /></li>
<li><b>Contradiction</b>: the hypothesis must be false.<br />Hypothesis = â<i>A magician performing for an audience in a nightclub</i>â.<br />The hypothesis describes a completely different event happening at a nightclub.<br /><br /></li>
</ul>
This task is very difficult for humans and machines, and requires background knowledge, commonsense, knowledge about the relationship between words (whether they mean the same thing, one of them is more specific than the other, or do they contradict each other), recognizing that different mentions refer to same entity (coreference), dealing with syntactic variations, etc. For an extensive list, I recommend reading this really good <a href="https://gluebenchmark.com/diagnostics">summary</a>. </div>
<div>
<br /></div>
<div>
In the early days, methodsâ performance was mediocre. The available data for this task contained a few hundreds of annotated examples. They required many <a href="https://arxiv.org/abs/1806.03561">different types of knowledge</a> to answer correctly, and were diverse in the their topics. Unfortunately, there were too few of them to throw a neural network at... A few years ago, the huge <a href="https://nlp.stanford.edu/projects/snli/">SNLI dataset</a> was released, containing half a million examples. What enabled scaling up to such a large dataset was (1) taking the premises from an already available collection of image captions; and (2) asking people to generate entailed, neutral, and contradicting hypotheses. This made the data collection simple enough to not require experts but rather be able to use crowdsourcing workers. </div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div>
Following the release of this dataset, the interest of the NLP community in RTE peaked. Specifically, many neural methods have been developed. What they basically do is encode each sentence (premise, hypothesis) into a vector, normally by running each sentence through an RNN. The premise and hypothesis vectors are then combined using arithmetic operations and sent to a classifier that outputs the label (entailment, neutral, or contradiction). This approach (left side of figure 9) can get you as far as ~87% accuracy on the test set, and the difference between the specific methods is mostly in the technicalities. More sophisticated methods encode the hypothesis conditioned on the premise, using the neural attention mechanism (right side of figure 9). In simple words, the hypothesis encoder is allowed to âlook atâ the premise words, and it roughly aligns each hypothesis word to a related premise word. For example, given the premise â<i>A street performer is doing his act for kidsâ and the hypothesis âA juggler entertaining a group of children on the street</i>â, the alignments would be <i>juggler-performer, act-entertaining, street-street</i>, etc. (In practice, itâs not a 1-on-1 alignment, but a weighted 1-to-many attention). This approach gets you to over 90% accuracy today, which is beyond human performance on this dataset. Yes, you read correctly. A statistical DL method is better than humans on this dataset. </div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGoTGM1Hj5_WZSderBEwN_vrJlSqvA4O0aRo4mIP56goFA68VsAZcA0pNfhBQvEc1zZ7KCnZMyE29VIMLdaQkFmCD5CPXvlqCaJDzpZeyQG73I4DbxXbycpkcf7orcxfK_5aZiD7jw-d8/s1600/rte.png" imageanchor="1" style="margin-left: auto; margin-right: auto; text-align: center;"><img border="0" data-original-height="245" data-original-width="526" height="185" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGoTGM1Hj5_WZSderBEwN_vrJlSqvA4O0aRo4mIP56goFA68VsAZcA0pNfhBQvEc1zZ7KCnZMyE29VIMLdaQkFmCD5CPXvlqCaJDzpZeyQG73I4DbxXbycpkcf7orcxfK_5aZiD7jw-d8/s400/rte.png" width="400" /></a></td></tr>
<tr><td class="tr-caption"><div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
<span style="font-size: 12.8px;">Figure 9: two common architectures of neural RTE systems. Left: sentence encoding models that encode each sentence separately. Right: attention model that encodes the hypothesis conditioned on the premise. Both models extract features from the premise and hypothesis vectors and use them for classification.</span></div>
<div style="text-align: left;">
<br /></div>
</td></tr>
</tbody></table>
<div>
Anyone whoâs ever worked on textual entailment and who is even a tiny bit skeptical of DL as the solution to everything, had to be suspicious. Indeed, <a href="https://twitter.com/emhusband/status/907202918616530944">many of us were</a>. Fast forward a few months, a flood of papers confirming that the task is indeed not solved. Instead of solving the general, very difficult textual entailment task, our models are memorizing very specific peculiarities of the training data (which are also common in the test data). <a href="http://aclweb.org/anthology/N18-2017">1</a>, <a href="http://aclweb.org/anthology/S18-2023">2</a> and <a href="http://www.lrec-conf.org/proceedings/lrec2018/pdf/786.pdf">3</a> concurrently showed that a model which has access <b>only to the hypothesis</b> can solve the task with performance which is much better than a random guess (which is what weâd expect from a model that has no access to the premise). They all pointed out some peculiarities in the data that enable that. For example, hypotheses of contradicting examples tend to contain more negative words. This happens because the premises are image captions (â<i>a dog is running in the park</i>â). Image captions rarely describe something that doesnât happen. The easiest thing for a person asked to write a contradicting sentence is to add negation: â<i>a dog is <b>not</b> running in the park</i>â. </div>
<div>
<br /></div>
<div>
Funnily, <a href="http://aclweb.org/anthology/N18-2017">1</a> also showed that the appearance of the word âcatâ in the hypothesis can indicate contradiction, as there were many dog images, and what contradicts a dog better than a cat? In reality, cats are lovely creatures, and a sentence with a cat doesnât immediately contradict any other sentence.</div>
<div>
<br /></div>
<div>
<a href="http://aclweb.org/anthology/P18-2103">We</a> also showed that state-of-the-art models fail on examples that require knowledge about relations between words, even when the example is super simple. For instance, the models would think that â<i>a man starts his day in India</i>â and â<i>a man starts his day in Malaysia</i>â are entailing, just because India and Malaysia are similar (although mutually exclusive) words. We showed that the models only learns to distinguish between such words if the specific words appear enough times in the training data. For example, many contradiction examples in the training data have a <i>man</i> in the premise doing something and a <i>woman</i> in the premise doing the same thing. Having observed enough of these examples, the models learn that man and woman are mutually exclusive. But they fail in the India/Malaysia example, because they didnât observe this exact pair of words in enough training examples. Since itâs unreasonable to rely on the training set to provide enough examples of each possible pair of words, a different solution, probably involving incorporating external knowledge from dictionaries and taxonomies, is needed. </div>
<div>
<br /></div>
<div>
The main lesson from this story should not be that DL methods are unsophisticated parrots that can only repeat exactly what they saw in the training data. Instead, there are several things to consider:</div>
<div>
<br /></div>
<div>
<ol>
<li>Good performance on the test set doesnât necessarily indicate solving the task. Whenever our training and test data are not ânaturalâ but rather generated in a somewhat artificial way, we run the risk that they will both contain the same peculiarities which are not properties of the actual task. A model learning these peculiarities is wasting energy on remembering things that are unhelpful in real-world usage, and creating an illusion of solving an unsolved task. Iâm not saying we shouldnât in any case process our data - just that we should be aware of this. If your model is getting really good performance on a really difficult task, thereâs reason to be suspicious. <br /><br /></li>
<li>DL methods can only be as good as the input they get. When we make inferences, we employ a lot of common sense and world knowledge. This knowledge is simply not available in the training data, and we can never expect the training data to be extensive enough to contain it. Domain knowledge is not redundant, and in the near future someone will come up with smart ways to incorporate it into a neural model, and achieve good performance on newer, less simple datasets.</li>
</ol>
<b style="font-weight: normal;"><br /></b>
<br />
<h4 dir="ltr" id="we" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<span style="background-color: transparent; color: #666666; font-family: "arial"; font-size: 12pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">The Shaky Ground of Distributional Word Embeddings</span></h4>
At the core of many deep learning methods lie pre-trained word embeddings. They are a great tool for capturing semantic similarity between words. They mostly capture topical similarity, or relatedness (e.g. <i>elevator-floor</i>), but they can also capture functional similarity (e.g. <i>elevator-escalator</i>). Relying on pre-trained word embeddings is a great way to make your model generalize to new inputs, and I recommend it for any task that doesnât have loads of available data.</div>
<div>
<br /></div>
<div>
With that said, itâs not perfect. In many tasks we need to know the exact relationship between two words. Itâs not enough to know that <i>elevator</i> and <i>escalator</i> (or <i>India</i> and <i>Malaysia</i>) are similar - we need to know that they are mutually exclusive. And word embeddings donât tell us that. In fact, they conflate lots of semantic relations together.</div>
<div>
<br /></div>
<div>
I think I have a good way to simulate that, and Iâve been using it in my talks for the last few months. The idea is to take some text, say lyrics of a song, a script of a TV series, a famous speech, anything you like. Then, go over the text and replace each noun with its most similar word in <a href="https://code.google.com/archive/p/word2vec/">word2vec</a>. (It doesnât have to be word2vec and doesnât have to be only for nouns - this is what I did. The code and some other examples are available <a href="https://github.com/vered1986/PythonUtils/blob/master/Fun/Fun%20with%20word2vec.ipynb">here</a>.) Here is a part of my favorite example: Martin Luther Kingâs â<a href="http://www.americanrhetoric.com/speeches/mlkihaveadream.htm">I Have a Dream</a>â speech. This is what you get when you replace words by their word2vec neighbours:</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcxrcCMVQN2HBt3SqPFwu4JQuexuKyM4XMNHOaDGKrvjzkjwvj-SjInIRJKEgj4Z6mF073-cfhOuPkBzMx1cDxU9wvelmUwbTTY_ViBxeY6NHI7wgcY8vsaeXBLpmqRLkbtik7Xvj3SuY/s1600/mlk.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="775" data-original-width="806" height="307" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcxrcCMVQN2HBt3SqPFwu4JQuexuKyM4XMNHOaDGKrvjzkjwvj-SjInIRJKEgj4Z6mF073-cfhOuPkBzMx1cDxU9wvelmUwbTTY_ViBxeY6NHI7wgcY8vsaeXBLpmqRLkbtik7Xvj3SuY/s320/mlk.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption">Figure 10: a part of Martin Luther Kingâs âI Have a Dreamâ speech after replacing nouns with most similar word2vec words.</td></tr>
</tbody></table>
<div>
<br />
<br />
Apart from being funny, this is a good illustration of this phenomenon: words have been replaced with other words in different relationships with them. For example, <i>country</i> instead of <i>nation</i> is not quite the same, but synonymous enough. <i>Kids</i> instead of <i>children</i> is perfectly fine. But a <i>daydream</i> is a type of a <i>dream</i>, <i>week</i> and <i>day</i> are mutually exclusive (they share a mutual category of time unit), <i>Classical.com</i> is completely unrelated to <i>content</i> (yes, statistical methods have errorsâŠ), and <i>protagonist</i> is synonymous with the original word <i>character</i>, but in the sense of a character in the book - and not of individualâs qualities.</div>
<div>
<br /></div>
<div>
In the last 3 years there are also multiple methods for learning a different type of word embeddings, that captures -- in addition to or instead of this fuzzy similarity -- semantic relations from taxonomies like WordNet. For example, the <a href="https://www.cs.cmu.edu/~hovy/papers/15HLT-retrofitting-word-vectors.pdf">Retrofitting</a> method started with distributional (regular) word embeddings and then moved vectors in the space such that two words that appear together in WordNet as synonyms would be close to each other in the vector space. The <a href="https://arxiv.org/abs/1706.00374">Attract-Repel</a> method did the same but also made sure that antonyms would be further apart (e.g. think again of the good/bad vectors in sentiment analysis). Other methods include <a href="https://arxiv.org/pdf/1511.06361.pdf">Order Embeddings</a>, <a href="https://arxiv.org/pdf/1705.08039.pdf">Poincaré Embeddings</a>, <a href="https://arxiv.org/pdf/1710.06371.pdf">LEAR</a>, and many more. While these methods are elegant, and have shown to capture the semantic relations they get as input, they have yet to improve the performance of NLP applications upon a version of the system that uses regular embeddings. </div>
<div>
<br />
<br />
<h4 dir="ltr" id="cram" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<span style="background-color: transparent; color: #666666; font-family: "arial"; font-size: 12pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">The Unsatisfactory Representations Beyond the Word Level</span></h4>
Recurrent neural networks allow us to process texts in arbitrary length: from phrases consisting of several words to sentences, paragraphs, and even full documents. Does this also mean that we can have phrase, sentence, and document embeddings, that will capture the meaning of these texts?</div>
<div>
<br /></div>
<div>
Now is a good time to remember the famous quote from <a href="http://yoavartzi.com/sp14/slides/mooney.sp14.pdf">Ray Mooney</a>:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgtxfLOZR7Sj_OT2Aq6g5GrTIqb_rdLpYBUBFNKKcZIwS-Y4SdXzo4YqaU1mICQrSsykcVN8YVB_rXrA04ln9KzIqQt40eEUztRmBewxi8d9lL-Y6Uz98_dKbsLsEvBdk9RUF80_ZoH9H4/s1600/cram.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="78" data-original-width="612" height="50" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgtxfLOZR7Sj_OT2Aq6g5GrTIqb_rdLpYBUBFNKKcZIwS-Y4SdXzo4YqaU1mICQrSsykcVN8YVB_rXrA04ln9KzIqQt40eEUztRmBewxi8d9lL-Y6Uz98_dKbsLsEvBdk9RUF80_ZoH9H4/s400/cram.png" width="400" /></a></div>
<br />
To be honest, I completely agree with this opinion when talking about general-purpose sentence embeddings. While we have good methods to represent sentences for <i>specific tasks and objectives</i>, itâs not clear to me what a âgenericâ sentence embedding needs to capture and how to learn such a thing.<br />
<br />
Many researchers think differently, and sentence embeddings have been pretty common in the last few years. To name a few, the <a href="https://arxiv.org/abs/1506.06726">Skip-Thought vectors</a> build upon the assumption that a meaningful sentence representation can help predicting the next sentence. Given that even I as a human can rarely predict the next sentence in a text Iâm reading, I think this is a very naive assumption (could you predict that Iâll say that?...). But it probably predicts the next sentence with more accuracy than it would predict a completely unrelated sentence, creating kind of a <b>topical</b>, rather than <b>meaning</b> representation. In the example in figure 11, the model considered lexically-similar sentences (i.e. sentences that share the same words or contain similar words) as more similar to the target sentence than a sentence with a similar meaning but a very different phrasing. Iâm not surprised.</div>
<div>
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSNMgHrpf_KKI4RHV0oGdCnmmKUQJqWo2sU2z4fmqQaZtLQUaJCUEWvR3GXwanL9LxKNyBf1EUdxgFGq1MAilqWrW85crJXrECXKscf5Ah3m1lQpuXfaSzq32m1Jml7EsHuKhs-a1KB1Y/s1600/sent2vec.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="717" data-original-width="848" height="337" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSNMgHrpf_KKI4RHV0oGdCnmmKUQJqWo2sU2z4fmqQaZtLQUaJCUEWvR3GXwanL9LxKNyBf1EUdxgFGq1MAilqWrW85crJXrECXKscf5Ah3m1lQpuXfaSzq32m1Jml7EsHuKhs-a1KB1Y/s400/sent2vec.png" width="400" /></a></td></tr>
<tr><td class="tr-caption">Figure 11: The most similar sentences to the target sentence âA man starts his day in Indiaâ out of the 3 other sentences that were encoded, using the <a href="http://sent2vec.quantumsense.ai/">sent2vec demo</a>. </td></tr>
</tbody></table>
<div>
<br /></div>
<div>
Another approach is the <a href="http://www.aclweb.org/anthology/D11-1014">Autoencoder</a>, which creates a vector representing the input text, and is trained to reproduce the input from that vector. The core assumption is that to be able to predict the original sentence, the representation must capture important aspects of the sentenceâs meaning. You can think about it as a type of compression.</div>
<div>
<br /></div>
<div>
Finally, the byproduct of tasks that represent sentences as vectors, like textual entailment models (for classification) or machine translation models (for text generation) - are the sentence embeddings! Yes, they are trained to capture a specific aspect relating to their end task (entailment / translation), but assuming that these tasks require deep understanding of the meaning of a sentence, the embeddings can be used as general representations. </div>
<div>
<br /></div>
<div>
So what aspects of the sentence do these representations capture? <a href="https://arxiv.org/pdf/1805.01070.pdf">This paper</a> did a pretty extensive analysis of various types of sentence embeddings. They defined some interesting properties which may be conveyed in a sentence, starting from shallow things like the sentence length (number of words) and whether a specific word is in the sentence or not; moving on to syntactic properties such as the order of words in the sentences; and finally semantic properties like whether the sentence is topically coherent (in other words, is it possible to distinguish between a ârealâ sentence and a sentence in which one word was replaced with a completely random word). To find out which of these properties are encoded in which sentence embeddings approach, they used the sentence embeddings as inputs to very simple classifiers, each trained to recognize a single property (e.g. using the vector of â<i>this is a vector</i>â in the sentence length classifier to predict 4). The performance of the various sentence embeddings on all methods was somewhere between the performance of a very simple baseline method (e.g. using random vectors) and the human performance on the task. Not surprisingly, there is more room for improvement on the complex and more semantic tasks.</div>
<div>
<br /></div>
<div>
Iâd like at this point to repeat Ray Mooneyâs quote and say that you still canât cram the meaning of a whole sentence into a single vector. Itâs impressive that we have gone so far to have representations that capture all these properties, but is this all there is to a sentenceâs meaning? Here are some things that I donât know whether the embeddings capture or not, but mostly assume they donât:</div>
<div>
<br /></div>
<div>
<ol>
<li>Do they capture things which are not said explicitly, but can be implied? (<i>I didnât eat anything since the morning</i> implies <i>Iâm hungry</i>).<br /><br /></li>
<li>Do they capture the meaning of the sentence in the context it is said? (<i>No, thanks</i> can mean <i>I donât want to eat</i> if youâve just offered me food).<br /><br /></li>
<li>Do they always assume that the meaning of a sentence is literal and compositional (composed of the meanings of the individual words), or do they have good representations for idioms? (<i>I will clean the house when pigs fly</i> means <i>I will never clean the house</i>).<br /><br /></li>
<li>Do they capture pragmatics? (<i>Can you tell me about yourself</i> actually means <i>tell me about yourself</i>. You donât want your sentence vectors to be like the interviewee in <a href="https://www.reddit.com/r/Jokes/comments/5044l4/interviewer_whats_your_greatest_weakness/">this joke</a>). <br /><br /></li>
<li>Do they capture things which are not said explicitly because the speakers have a common background? (If I tell a local friend that <i>the prime minister must go home</i>, we both know the <a href="https://en.wikipedia.org/wiki/Benjamin_Netanyahu">specific</a> prime minister Iâm talking about).<br /></li>
</ol>
<br />
I can go on and on, but I think these are enough examples to show that we have yet to come up with a meaning representation that mimics whatever representation we have in our heads, which is derived by making inferences and basing on common sense and world knowledge. If you need more examples, I recommend taking a look at the slides from Emily Benderâs â<a href="http://faculty.washington.edu/ebender/papers/Bender-ACL2018-tutorial.pdf">100 Things You Always Wanted to Know about Semantics & Pragmatics But Were Afraid to Ask</a>â tutorial. <br />
<br />
<h4 dir="ltr" id="robustness" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<span style="background-color: transparent; color: #666666; font-family: "arial"; font-size: 12pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">The Non Existing Robustness</span></h4>
Anyone whoâs ever tried to reimplement neural models and reproduce the published results knows we have a problem. More often than not, you wouldnât get the exact same published results. Sometimes not even close. The reason this happens is often due to differences in the values of âhyper-parametersâ: technical settings that have to do with training such as the <a href="http://www.fon.hum.uva.nl/praat/manual/epoch.html">number of epochs</a>, the <a href="https://www.quora.com/What-is-regularization-in-machine-learning">regularization</a> values and methods, and others. While they are seemingly not super important, in practice hyper-parameter values can make large performance differences.</div>
<div>
<br /></div>
<div>
The problem starts when training new models. You come up with an elegant model, you implement and train it, but the results are not as expected. This doesn't mean that the architecture or the data is not good; it often means that you need to tweak the hyper-parameter values and re-train the model, yielding completely different and hopefully better performance. Unfortunately, itâs almost impossible to tell in advance which values would yield better performance. There is no best-practice, just a lot of trial and error. We train many different models with various settings, then test their performance on the validation set (a set of examples separate from the training and the test sets) and choose the best performing model (which is then tested on the test set). </div>
<div>
<br /></div>
<div>
Hyper-parameter tuning is an exhausting and often frustrating process. Iâm sure that many good models get lost on the way because the researcher lost the patience or ran out of computational resources (yes, neural models also take longer to train⊠and strong machines cost a lot of money).</div>
<div>
<br /></div>
<div>
Itâs pretty discouraging to think that achieving good performance on some test set is sometimes due to arbitrary settings rather than thanks to sound scientific ideas and model design.</div>
<div>
<br />
<h4 dir="ltr" id="interpretability" style="line-height: 1.38; margin-bottom: 4pt; margin-top: 14pt;">
<span style="background-color: transparent; color: #666666; font-family: "arial"; font-size: 12pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline;">The Lack of Interpretability</span></h4>
Last but not least, the interpretability issue is probably the worst caveat of DL. </div>
<div>
<br /></div>
<div>
Machines donât give an explanation for their predictions. While this was also true for traditional ML, it was easier to analyze the predictions and come up with an explanation in retrospect. Having algorithms which also learn the representations and networks with a lot more parameters makes this much more difficult. To put it into simple words, we generally have no idea whatâs happening inside the networks we train, they are âblack boxâ models. </div>
<div>
<br /></div>
<div>
Why is this a problem? First, being able to interpret our models would help us debug our models and easily understand why they are not working and what it is exactly they are learning. It will make the development cycle much shorter and our models more robust and trustable. While itâs nearly impossible today, there are people <a href="https://blackboxnlp.github.io/">working to change it</a>. </div>
<div>
<br /></div>
<div>
Second, and more importantly, in some tasks, transparency and accountability is crucial. Specifically, tasks concerned with safety, or which can discriminate against particular groups. Sometimes there is even a <a href="https://syncedreview.com/2018/01/31/will-new-eu-regulations-starve-data-hungry-deep-learning-models/">legal requirement to provide an explanation</a>. Think of the following examples (not necessarily NLP): </div>
<div>
<br /></div>
<div>
<ul>
<li>Self-driving cars </li>
<li>Predicting probability of a prisoner to commit another crime (who needs to be released?) </li>
<li>Predicting probability of death for patients (who is a better healthcare âfinancial investmentâ) </li>
<li>Deciding who should be approved a loan </li>
<li>more... </li>
</ul>
<br />
In the next post I will elaborate on some of these examples and how ML sometimes discriminates against particular groups. This post is long enough already!<br />
<div>
<span style="background-color: transparent; color: black; font-family: "arial"; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline;"><br /></span></div>
</div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com9tag:blogger.com,1999:blog-9145120678290195131.post-41127181424493588232018-04-13T13:23:00.000+03:002018-08-12T23:36:33.944+03:00Targeted Content<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="ltr" style="text-align: left;">
You must have heard of, or have suspected first-handedly, the famous conspiracy theory that the Facebook app listens to your phone's microphone in order to better target ads that match your current interests. I've had the funniest experience with that myself: a friend in the cosmetics business has told me about this conspiracy, and in the same conversation she mentioned that an advertising agent has called her to offer advertising her business. Later that day, I got a Facebook ad "advertise your cosmetics business". What the heck? What are the odds of that? And I don't even have a Facebook app installed, just the Facebook messenger.<br />
<br />
Although <a href="https://www.theverge.com/2018/4/10/17221478/zuckerberg-facebook-senate-listening-tapping-microphone">Mark Zuckerberg denied this conspiracy theory in his senate hearing</a>, I doubt that people would stop believing it whenever the ads algorithm surprises them. Choosing to believe Zuckerberg that they don't listen to our microphones (yet, <a href="https://research.fb.com/publications/towards-end-to-end-spoken-language-understanding/">I suspect</a>), I'm pretty confident that they, as well as other companies, are using our written content (emails, social media posts, search queries).</div>
<div dir="ltr" style="text-align: left;">
<br />
Most people are alarmed by these suspicions from the privacy aspect: <i>what data does this company hold about me? how do they use it? who do they share it with?</i> This post will <b>not</b> be about that. Instead, this post will be about the technical aspect, which is what interests me most as an NLP researcher. If we assume that our apps constantly listen to us and that our written content is monitored and analyzed, what does it say about the text understanding capabilities of these companies?<br />
<br />
Oh, and expect no answers. This post is all about questions and conspiracy theories!<br />
<br />
<b>What is personalized content?</b><br />
Personalized content doesn't have to come in the form of an ad. It can take the form of recommendations (products to buy based on previous purchases, songs to listen to, as in this <a href="https://veredshwartz.blogspot.co.il/2015/11/recommender-systems.html">post</a>). It can be relevant professional content from <a href="https://www.linkedin.com/">LinkedIn</a>, discounts on services you've previously consumed, cheap flights to your planned destinations, and so on. Some of this will be a direct result of the preferences and settings you defined in the website. For example, I've registered in several websites to get updates on concerts of my favorite bands, and I get healthy vegetarian recipes from <a href="https://www.yummly.com/">Yummly</a>. Some of this content will be based on inferences that the system makes, assuming that certain content is relevant for you. Here is one <a href="https://twitter.com/VeredShwartz/status/920196182898610176">example</a>:<br />
<!--<iframe allow="encrypted-media" allowtransparency="true" frameborder="0" height="231" scrolling="no" src="https://vered1986.github.io/misc/htmls/tweet.html" style="border: none; overflow: hidden;" width="500"></iframe>-->
<br />
<blockquote> <!-- class="twitter-tweet"> -->
<div dir="ltr" lang="en">
Lately I've been getting <a href="https://twitter.com/Quora?ref_src=twsrc%5Etfw">@quora</a> digest emails on topics related to conversations I had with people (in Hebrew!). 1/5</div>
â Vered Shwartz (@VeredShwartz) <a href="https://twitter.com/VeredShwartz/status/920196182898610176?ref_src=twsrc%5Etfw">October 17, 2017</a></blockquote>
<script async charset="utf-8" src="https://platform.twitter.com/widgets.js"></script>
In that case I was amazed by the accuracy of the <a href="https://www.quora.com/">Quora</a> digest emails I was getting. Specifically, I had a conversation with my husband about the confidence it takes to admit you don't know something, and he mentioned he likes to say something more helpful than "I don't know" to someone who needs help. The next day, I got a personally-tailored Quora digest email that contained an answer to the question "Could you say something nice instead of 'I don't know'?". It wasn't under any of the topics that I follow (computer science related topics and <a href="https://www.quora.com/topic/Parakeets">parakeets</a>).<br />
<br />
In what follows I will exemplify most of my points using ads.<br />
<br />
<b>What we think these algorithms do</b><br />
OK, so in my case, I have to try to put my knowledge about the limitations of this technology and my skepticism aside for a second and think like the average person. In that case, I think that:<br />
<ul style="text-align: left;">
<li>If the ad is about a topic that I discussed in a spoken conversation, then there must be a recorder, and a speech-to-text component that converts the speech into written text.</li>
<li>Which language did I speak or have written when this happened? In case this happened for more than one language, it's possible that the company has different algorithms (or at least different trained models of the same algorithm) for each language.</li>
<li>Written content and transcribed speech are processed to match with the available content/ads.</li>
<li>In some cases, it seems that even simple keyword matching leads to nice results. E.g., if you mentioned a vacation in Thailand you will be matched with ads containing the words <i>vacation</i> and <i>Thailand</i> (I will let you know if I get any such ads after writing this post...). It takes no text understanding capabilities to do so, it only requires recognizing that a bunch of words said in the same sentence (or in a short period of time) also appear in some ad. If you insist, it may work with <a href="https://en.wikipedia.org/wiki/Information_retrieval">information retrieval</a> (IR) algorithms to recognize the most important words.</li>
<li>In other cases, it seems that a deeper understanding of the meaning of my queries and conversations is required in order to match it to the relevant content. A good example is the Quora digest example from above. Based on IR algorithms, searching for common words like <i>I, don't, know, helpful, nice, say, something</i> will not get you as far as searching for more rare content words like <i>vacation</i> and <i>Thailand</i>. So it must be that the algorithm has built some meaning representation to our conversation, and compared it with the one of that Quora answer, which was phrased with slightly different words. On top of everything, our conversation was in Hebrew, so it must have a universal multi-lingual meaning representation mechanism. </li>
</ul>
<br />
<b>Alternative explanations</b><br />
Skepticism returns; I can believe that my speech is recorded and transcribed fairly accurately to text when I speak English. It's a bit harder to believe when it happens in other languages (e.g. Hebrew in my case), but I can still find it somewhat reasonable; Automatic speech recognition (ASR), <a href="https://awni.github.io/speech-recognition/">although isn't perfect</a>, still works reasonably well. It's the text understanding component I'm much, much more skeptical about. Despite the constant progress, and although popular media makes it seem like AI is solved and computers completely understand human language, I know it definitely isn't the case yet. So what other explanations can there be for the targeted content we see?<br />
<u><br /></u>
<b>By Chance.</b> None of this actually happens and we're just imagining. Well, OK, not <i>none</i> of this, but in some cases, it's really just chance.<br />
<br />
One of the reasons that we're not easily convinced by this "by chance" argument is that we generally tend to pay attention only to the true-positive cases ("hits") in which we talked about something and immediately got an ad about it. It's much harder to notice the "misses": an ad that seems off (false positive) or all the things that we discussed and got no ads about (false negative).<br />
<br />
In the end of the day, we're all just common people that share many common interests. Advertisers may reach us because they try to reach a large audience and we happen to fall under the very broad categories they target (e.g. age group). It could be that by chance we see ads exactly for the product or service we need now.</div>
<div dir="ltr" style="text-align: left;">
<br />
<b>Other Means. </b>Technically speaking, rather than understanding text, it's much easier to consider other parameters such as your location, your declared interests (i.e. pages you've liked on Facebook, search results you clicked on in Google), your location, your age, gender, marital status, and more. If you didn't provide one or more of these details, no worries! Your friends have, and <a href="https://research.fb.com/publications/find-me-if-you-can-improving-geographical-prediction-with-social-and-spatial-proximity/">it's likely you share some of these details with them</a>!<br />
<br /></div>
<div dir="ltr" style="text-align: left;">
Here is one good example:<br />
<iframe allow="encrypted-media" allowtransparency="true" frameborder="0" height="231" scrolling="no" src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2Fvered.shwartz%2Fposts%2F1582371045206680&width=500" style="border: none; overflow: hidden;" width="500"></iframe><br />
I keep getting babies and pregnancy ads on Facebook. I'm a married woman in her 30s, both information items are available in my Facebook profile, and that alone is enough to assume this topic is relevant for me (personally, it is not, but the percent of women like me is too small to care about the error rate, and I totally accept that). Add to this that many of my Facebook friends are other people in my age who are members of parenting groups, have liked pages of baby-related stuff, etc. I can't ever make this stop, but I guess it will stop naturally when I'm in my late forties.<br />
<ol style="text-align: left;">
</ol>
<br />
<br />
I'd like to finish with an anecdote about how non-sophisticated targeted content can sometimes be, to the point where you rub your eyes in disbelief and say "how stupid can these algorithms be?". A few days ago I've written to someone in an email "I'll be in Seattle on May 30". Minutes later, I get an email from Booking.com with the title "Vered, Seattle has some last-minute deals!". That would have been smart, unless I've already used Booking.com to book a hotel room in Seattle for <i>exactly</i> these dates.<br />
<br />
I may be way off and it may be that these companies have killer AI abilities which are kept very well in secret. In that case, some of my readers who work for these companies must be giggling now. To paraphrase <a href="https://www.goodreads.com/quotes/665107-just-because-you-re-paranoid-doesn-t-mean-they-aren-t-after-you">Joseph Heller</a> (or whoever said it first), just because you're paranoid, doesn't mean they're not after you, but hey, there's no way their technology is good enough to do what you think it does, so some of it is just pure chance. Not as catchy as the original quote, I know.</div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com0tag:blogger.com,1999:blog-9145120678290195131.post-80107250282174458882018-01-02T16:03:00.000+02:002018-01-02T16:03:22.901+02:00Fun with lyrics<div dir="rtl" style="text-align: right;" trbidi="on">
<div class="separator" style="clear: both; text-align: center;">
</div>
<div dir="ltr" style="text-align: left;">
</div>
<div dir="ltr" style="text-align: left;">
This post stems from a (very boring) casual thought I've had about a year ago: "Hmm... I wonder whether there is more rain in British songs?", which later generalized into "Is there any correlation between song lyrics and the weather in the country of origin of the artists?". I've spent an entire weekend writing code to scrap lyrics from the web, and then life got in the way and I've never finished this (uninteresting) project.<br />
<br />
Since I already have a very large corpus of lyrics,<sup><a href="https://veredshwartz.blogspot.co.il/2018/01/fun-with-lyrics.html#1">1</a></sup> I've figured why not combine two of my loves -- text analysis and music -- into one blog post? So in this post I will show you some fun analyses that people commonly do with lyrics.</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
<b>Word Clouds</b></div>
<div dir="ltr" style="text-align: left;">
Word clouds provide a nice illustration to the frequency of word occurrences. Given a text, the word cloud contains the most common <i>k</i> words in the text, where more frequent words appear larger and in the center of the cloud. In this case, I chose an artist and created a word cloud from the lyrics of all the songs of that artist. I lowercased all the words, removed punctuation, stop words (very common function words like "and" and "the"), and the word "chorus". I used <a href="https://worditout.com/word-cloud">worditout</a> to draw the word clouds. Here are a few examples (click on the links to enlarge):<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJEJvebiydozg5C0TP8wp1zbzPV4v6NEw8KtSh1nF6jWpNkdWSfEemhHFfdmi0mDCjAP3aYO_TtXTA6HfJegm4w2xIuqaVxvWsOGT3XFwjob3hKKnXF4yQofhhnYiuGmr_rv_81t5eBnY/s1600/three.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="478" data-original-width="1600" height="117" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJEJvebiydozg5C0TP8wp1zbzPV4v6NEw8KtSh1nF6jWpNkdWSfEemhHFfdmi0mDCjAP3aYO_TtXTA6HfJegm4w2xIuqaVxvWsOGT3XFwjob3hKKnXF4yQofhhnYiuGmr_rv_81t5eBnY/s400/three.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="tr-caption" style="font-size: 12.8px;">Left to right: word clouds for the lyrics of <a href="https://worditout.com/word-cloud/2688029/private/7db63e5c8800dfc4319c85a33e4afffe" style="font-size: 12.8px;">Red Hot Chili Peppers</a><span style="font-size: 12.8px;">, </span><a href="https://worditout.com/word-cloud/2687682/private/10e0d8909ce911950d0961f8ee8f5d93" style="font-size: 12.8px;">Morrissey</a><span style="font-size: 12.8px;">, and </span><a href="https://worditout.com/word-cloud/2687679/private/0ca9ff8de25fc051785ca00b1b2293f3" style="font-size: 12.8px;">Eminem</a><span style="font-size: 12.8px;">.</span></td></tr>
</tbody></table>
</td></tr>
</tbody></table>
A few interesting, though expected, observations: Red Hot Chili Peppers often sing about love, Morrissey mostly moan. When he doesn't moan ("<i>Oh</i>"), he sings about serious topics such as <i>war</i>, the <i>world </i>and <i>life</i>. Eminem curses a lot. Funnily, since I kept the words in their inflected form, we get multiple variations of the F word in his word cloud. </div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
<b>Topics</b><br />
Now that we see which words are common in each artists' lyrics, we can take it a step forward and try to visualize the <i>topics </i>that they sing about. There are many ways to do that, and we'll do it simply by visualizing their word embeddings using <a href="https://lvdmaaten.github.io/tsne/">t-SNE</a>, a technique for projecting high-dimensional vectors to 2-dimensional space. The underlying assumption of word embeddings is that words with similar meanings or those that belong to the same topics would have similar vectors. This should also reflect in their 2-dimensional visualization.<br />
<br />
To give the lyrics some context and demonstrate how they relate to all the possible topics in the world, I took the words from the lyrics and visualized their vectors along with the 2,500 most common words in English, highlighting words from the lyrics in red. Here is the result for Morrissey:<br />
<br />
<iframe height="480" src="https://drive.google.com/file/d/1NPHgTzbZ8pIK6wOTjIqF6BxtmWs_v2jp/preview" width="100%"></iframe>
<br />
<br />
You'd have to scroll through the graph and look for clusters of red dots, then try to figure out what is their common theme. For example, I've found adjectives describing negative feelings (<i>unhappy, sad, tired, weary</i>, ...), words related to love (<i>hearts, lonely, love, hug, kiss</i>), body parts (<i>body, arms, hands, head</i>), and people (<i>young, children, nephew, girl, boy, woman</i>, ...).<br />
<br />
And here is the result for Muse:<br />
<br />
<iframe height="480" src="https://drive.google.com/file/d/1Cwp-1J7Zczt2pBLISIhLRrSRMUp3qNQ6/preview" width="100%"></iframe>
<br />
<br />
Here I see positive emotions (<i>love, dream, fate</i>), negative emotions (<i>sorrow, shame, greed, apathy, bitterness</i>), evil stuff (<i>daemons, evil, exorcise, sins</i>) and war-related words (<i>war, struggle, fighting, revolt</i>).<br />
<br />
[Some technical details for my technical readers: I took the first 2500 words from <a href="https://github.com/first20hours/google-10000-english/blob/master/google-10000-english-usa.txt">this</a> list of 10k most common words in English. For the lyrics, I considered the 500 most common words which are adjectives, nouns or verbs. I drew the t-SNE graph using this <a href="https://github.com/vered1986/PythonUtils/blob/master/Fun/lyrics/tsne.py">script</a>, and used the pre-trained 50d GloVe word embeddings].</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
<b>Generating New Songs</b></div>
<div dir="ltr" style="text-align: left;">
As the word clouds may suggest, each artist has a specific style which is reflected in the word choice and topic of their songs. We can train a model that captures this specific style, mimics this artist and generates new songs that would look like they've been written by this artist.<br />
<br />
Unfortunately, for better-quality results, you need a large amount of training data, so forget about generating new songs of artists who tragically died after releasing only a few records (e.g. <a href="https://en.wikipedia.org/wiki/Elliott_Smith">1</a>, <a href="http://www.nirvana.com/">2</a>, <a href="https://en.wikipedia.org/wiki/Amy_Winehouse">3</a>) or of your favorite indie bands that have relatively few songs (e.g. <a href="http://www.marmozets.co.uk/">1</a>, <a href="http://irontom.com/">2</a>, <a href="http://blaenavon.com/">3</a>, <a href="http://www.nbthieves.com/">4</a>, <a href="http://youareallslaves.com/">5</a>, <a href="http://royalbloodband.com/">6</a>, <a href="https://en.wikipedia.org/wiki/Darlia_(band)">7</a>, <a href="https://en.wikipedia.org/wiki/Peace_(band)">8</a>). We'll stick with the more mainstream bands and try to generate new songs by Muse, Weezer, and Red Hot Chili Peppers.<br />
<br />
For that purpose, we are going to learn an artist-specific language model. I've written an elaborate post about <a href="https://veredshwartz.blogspot.co.il/2015/09/language-models.html">language models</a> in the context of machine translation; in short, language models estimate the probability of a certain text in the language (e.g. English, or a more specific domain, like Twitter data or Muse lyrics). Each word in the text depends on the previous words, so in an English language model, for instance, the probability of "she doesn't" is larger than that of "she don't" (although, this may not be the case for English rap songs language models!). Language models can be used to compute the probability of an existing text, but they can also be used to generate new texts by sampling words from the distribution. We're going to use them for generation.<br />
<br />
As opposed to the language models in my blog post, we will train a <i>neural </i>language model. These are explained very clearly in Andrej Karpathy's blog post "<a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">The Unreasonable Effectiveness of Recurrent Neural Networks</a>". In short, a recurrent neural network (RNN) is a model that receives as input a sequence (e.g. of words / characters) and outputs vectors representing each subsequence (i.e. the first item, the first two items, ..., the entire sequence). These vectors can then be used by other machine learning models, e.g. for classification.<br />
<br />
In the context of language models, the RNN learns to model the probability distribution of the next item in the sequence (e.g. the next word in the song). During training, the model goes over the entire text corpus (e.g. all the lyrics of a specific artist) and tries to predict the next item (word). If the predicted next item is incorrect, i.e. different from the actual next item, the model adjusts itself, until it is accurate enough. At test time, once the model parameters are settled, you can use it to generate new texts by sampling from the distribution of possible items (words) and constantly sampling new words conditioned on the already-sampled ones. The result should look similar to the original text corpus it was trained on. Very often, generated sequences will be actual texts from the corpus (and then you've just trained a parrot... Thanks <a href="https://artistdetective.wordpress.com/">Don Patrick</a> for the great metaphor, I'm constantly quoting you on this!).<br />
<br />
[Some technical details for my technical readers: I trained a word-level LSTM using <a href="https://dynet.readthedocs.io/en/latest/">DyNet</a>, largely based on the <a href="https://github.com/clab/dynet/blob/5049e5995f169fe1798139e1ca4dc98a7c0c4317/examples/rnnlm/rnnlm.py">char-level RNN example</a>. My code is available <a href="https://github.com/vered1986/PythonUtils/blob/master/Fun/lyrics/lyrics_lm.py">here</a>.]<br />
<br />
So, let's take a look at the results! After training each model, I sampled a single song. I sampled each sentence separately, so subsequent sentences are not expected to be related to each other. I enforced the song structure by forcing a line break after every 5 lines. Here is the new "Weezer" song:<br />
<br />
<table border="1"><tbody>
<tr><td><span style="font-family: "courier new" , "courier" , monospace;">let me see the joy</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">holding on to what they give,</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">turn it, turn it,</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">i'd bury diamonds</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">woo-hoo-hoah</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">you're just smile</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">excuse my manners if i make a scene</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">we're just visiting</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">i'm still afloat</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">and i'm lost without your love</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">why are all american girls so rough?</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">i'm a robot</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">and kick you out of sight</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">and if you're up all night</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">i cried for you, you were the blast</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">tonight...</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">i'm just meant to be your latest tragedy</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">why are all american girls so rough?</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">you are.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">how man is this in the world</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">i feel safe</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">oo-ee-oo i look just like buddy holly</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">wish that they would teach me when our critics</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">i don't want your love</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">chicks are trying to freak</span></td></tr>
</tbody></table>
<br />
Some sentences are clearly copied from existing songs (<i>"oo-ee-oo I look just like Buddy Holly"</i>) but others are brand new. Overall it feels like a Weezer song to me!<br />
<br />
Moving on to the new Muse song:<br />
<br />
<table border="1">
<tbody>
<tr>
<td><span style="font-family: "courier new" , "courier" , monospace;">than you could ever give</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">and i want you now</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">i wish i could</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">and make a fuss</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">like the evil in your veins</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">you are</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">(your time is now)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">our hopes and expectations</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">we don't belong here</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">i won't let you bury it</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">i wish i could</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">they will pull us down</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">in your world</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">now i just was to name</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">with who knows who</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">i'm growing tired of fighting</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">in my sleep</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">policies</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">loneliness be over</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">vast human and material resources</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">you're unsustainable</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">is it enough</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">killed by drones</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">and our time is running out</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">you and i must fight to survive</span></td>
</tr>
</tbody></table>
<br />
This one is a bit disappointing, because the only reason it feels like a real Muse song is that it's a "summary of Muse songs" created by copying whole sentences from their songs. My intuition is that the amount of training data was too small, leading to "overfitting" (the training data is regenerated perfectly). This calls for an action by Muse to release more albums!<br />
<br />
And the highlight is this new Red Hot Chili Peppers song:<br />
<br />
<table border="1"><tbody>
<tr><td><span style="font-family: "courier new" , "courier" , monospace;">when i find my peace of mind,</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">that i could find the fireflies</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">[m1]</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">someone</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">to close a right today</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">that i slept</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">you say the is least my love</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">start jumping and that sherri meet?</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">funky crime funky crime</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">just a mirror for the sun</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">i wrote a letter to you</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">[chorus:]</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">i've been here before</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">stuck in the muck of the pond</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">to be afraid</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">play your hand and glory</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">well, i'm gonna ride a sabertooth horse</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">let's play</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">mother angel in your hand</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">take a star in a telegram</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">upon the places beyond</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">today loves smile for me</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">part of my scenery</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">i'll play all night</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">i am not wide</span></td></tr>
</tbody></table>
<br />
Wow... this looks nothing like <a href="https://www.youtube.com/watch?v=8-rccN7ZsRE">every Red Hot Chili Peppers song ever</a>. It doesn't even contain the word California! Maybe I should've trained the model for a few more iterations. It is pretty cool, though, that most sentences are new, and they make sense at least like the actual lyrics by RHCP make sense.<br />
<br />
<b>Statistics</b><br />
Now that we've got the data, we can finally answer the sleep-depriving question: "is there a correlation between the occurrence of rain-related words in lyrics and the country of origin of the artist?". For the lyrics that I scraped from the web I've also the kept the artists' countries. For these countries I've also looked up the annual precipitation statistics. I then looked for the occurrence of either of the following words in lyrics: <i>rain, rainning, rained, rains, storm, stormy, cloud, cloudy, drizzle, flood</i>. I computed the percentage of "rain" songs per country (out of all of the songs by artists in this country). The hypothesis was that artists from countries with a high average of annual precipitation are more likely to sing about it.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjn5VJWZCYWGe-N1NYwN6WYW0q6ElHYqm7r4ddeP8svCxf_pAeEBQF5RKWhAaTOyaHQimtQCYeaX1oH2XNiktxsa0fhCIedmD3Pzhvw8H-f-LcKkAflPhhiRhLc6rmpD-Kf6No4-cuCZE8/s1600/rain.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="577" data-original-width="1445" height="158" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjn5VJWZCYWGe-N1NYwN6WYW0q6ElHYqm7r4ddeP8svCxf_pAeEBQF5RKWhAaTOyaHQimtQCYeaX1oH2XNiktxsa0fhCIedmD3Pzhvw8H-f-LcKkAflPhhiRhLc6rmpD-Kf6No4-cuCZE8/s400/rain.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="tr-caption"><span style="font-size: x-small;">The percentage of songs mentioning rain in each country compared with average annual precipitation.</span></td></tr>
</tbody></table>
</td></tr>
</tbody></table>
I was wrong. There was no correlation. It is also possible that this was a failed experiment because the number of songs for some countries was too small to draw any meaningful statistical conclusions.<br />
<br />
Can we answer more interesting questions regarding lyrics? For example, this <a href="https://www.youtube.com/watch?v=8-rccN7ZsRE">every Red Hot Chili Peppers song ever</a> claims that all they ever sing about is California, but this wasn't reflected in the word cloud, nor in the generated song, meaning that this specific word was not very frequent in the corpus. However, if we only check which US states were mentioned in the songs, would California be more frequent?<sup><a href="https://veredshwartz.blogspot.co.il/2018/01/fun-with-lyrics.html#2">2</a></sup> And if we make this question more general, do artists tend to sing more about the countries of origin, and do some places get more attention regardless of where the artists are originally from?<br />
<br />
This time I focused on American artists, and took the lyrics of the first 200 artists from each state, checking for mentions of any states. I created a 51x51 table in which the columns represent the mentions and the rows represent the artists' state of origin. Rather than displaying this messy table, I plotted a <a href="https://en.wikipedia.org/wiki/Heat_map">heatmap</a> where the lighter colors represent higher values (and 0 values are colored black).<sup><a href="https://veredshwartz.blogspot.co.il/2018/01/fun-with-lyrics.html#3" name="top3">3</a></sup><br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEglDImuFPjNomatr60mpN6ElsblO8GBKY55bYmQKf_MEkBvcqiDgqoY0L5imoxJHG5YFAVKV6nQTWNZl1qVAvWP07uy_y9vFrARLJwRqlB7losaL6TF_H2zcfhlre_ueN_6vGReArc57Hw/s1600/state_conf_mat.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1116" data-original-width="1600" height="278" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEglDImuFPjNomatr60mpN6ElsblO8GBKY55bYmQKf_MEkBvcqiDgqoY0L5imoxJHG5YFAVKV6nQTWNZl1qVAvWP07uy_y9vFrARLJwRqlB7losaL6TF_H2zcfhlre_ueN_6vGReArc57Hw/s400/state_conf_mat.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Mention of states in lyrics by artists' state of origin. Columns: states mentioned in lyrics. Rows: states of origin.</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
</div>
Here's how to interpret this heatmap: light values on the diagonal are pretty common, meaning that it's common for artists to sing about their states of origin. Two columns have light values across many rows: California and New York. Those are states which are common in lyrics, regardless of the artist origin.<br />
<br />
Notice that the states are sorted alphabetically, so it's difficult to answer the question whether artists tend to sing about states in their proximity. A better visualization would be if we could place these statistics on a map. We can, and I used the <a href="https://developers.google.com/chart/interactive/docs/gallery/map">Google Maps API</a> to do so! Click on a state from the list and you'll see the states that sing about it visualized on a map.<br />
<br />
<iframe src="https://vered1986.github.io/misc/htmls/state_list.html" style="height: 600px; overflow: hidden; width: 100%;">
</iframe>
</div>
<ul dir="ltr" style="text-align: left;">
</ul>
<div style="text-align: left;">
<div dir="ltr" style="text-align: left;">
I think I can see a pattern of states singing about their neighbors (this kind of visualization was helpful for someone like me who doesn't know much about US geography...).<br />
<br />
<div dir="ltr" style="text-align: left;">
<b>Sentiment Analysis</b></div>
<div dir="ltr" style="text-align: left;">
Many words in Morrissey's word cloud are notably negative: <i>kill</i>, <i>hate, die,</i> <i>leave</i>, <i>gone, </i>etc<i>.</i> This is of no surprise as anyone who's been listening to Morrissey or to the Smiths knows most of their songs are gloomy; according to <a href="http://www.manchestereveningnews.co.uk/whats-on/music-nightlife-news/stats-prove-it-smiths-among-8028088">this study</a>, one of the gloomiest among UK artists.<br />
<br />
This negativity can be "proved" computationally, using software for sentiment analysis. Sentiment analysis takes a text and determines its sentiment: either negative/positive, or a range of sentiments. Traditional models used to look at the words that appear in the text independently and score the sentence according to the individual words' sentiment, recognizing "good" and "bad" words. For example, <i>"I am happy today" </i>would be considered positive thanks to the positivity of the word <i>happy </i>(and the neutrality of the other words). Today's models are mostly based on neural networks, and sometimes they also take into account the structure of the sentence (which should be helpful in recognizing that <i>"I am not happy today" </i>is negative). The <a href="http://nlp.stanford.edu:8080/sentiment/rntnDemo.html">Stanford Sentiment Analysis</a> system is an example for such a model.<br />
<br />
I was planning to compute the sentiment of all the lyrics of Morrissey vs. all the lyrics of a presumably more cheerful artist (e.g. Queen, David Bowie), but I've found that most analyzers I've tried to use did pretty bad on recognizing the sentiment of lyrics. To be fair, they are usually trained on movie/restaurant reviews, and lyrics are often more sophisticated (As a proof: we've had a human disagreement on the sentiment of several Morrissey lines at home...). Here are some examples from the <a href="http://nlp.stanford.edu:8080/sentiment/rntnDemo.html">Stanford Sentiment Analysis demo</a>:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIo3l5UMHT6-GaVSvoynZtgrDBkEKeK7rKb1KtUksVGdv7kHQ431sEcA4JuLLRlHNF2LHYJIk4i8_rFm_SftpoCnxh6McnHCjHFnX-lOgx1qd00JuYtweqCuXsM66kToh13lFoXgyodq0/s1600/happy.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="357" data-original-width="871" height="163" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIo3l5UMHT6-GaVSvoynZtgrDBkEKeK7rKb1KtUksVGdv7kHQ431sEcA4JuLLRlHNF2LHYJIk4i8_rFm_SftpoCnxh6McnHCjHFnX-lOgx1qd00JuYtweqCuXsM66kToh13lFoXgyodq0/s400/happy.PNG" width="400" /></a></div>
<br />
A positive sentence from David Bowie. Sounds fun.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjz-Rxzbg8BfzabGL_yPeO0ukSk-TWUbjVxV4lxnTHCRD4ou9-bZyaxDSIs-O6hHQTTSBX0s6abCzoyCeRadg8CkDWZx7Z5YYNTyv0ycr8oPFJ4K8_-GRELQ5Sc8RNRkNFitsEL5k43pYw/s1600/negative_sentiment_2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="350" data-original-width="781" height="178" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjz-Rxzbg8BfzabGL_yPeO0ukSk-TWUbjVxV4lxnTHCRD4ou9-bZyaxDSIs-O6hHQTTSBX0s6abCzoyCeRadg8CkDWZx7Z5YYNTyv0ycr8oPFJ4K8_-GRELQ5Sc8RNRkNFitsEL5k43pYw/s400/negative_sentiment_2.PNG" width="400" /></a></div>
<br />
A negative sentence from Muse. A bit less fun.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjld-wJtT4JRsILqJuPUljx1CB3Xs5CQcgfEOUJAz0XqlEzS8l5_LniOhQUQ68fBn7-d4ZpWJgFZD4WgSFpMRgb5PHIezpc0dJ2qrXHGXc1yirLXpewGWski5SUm7xysCnxc_8ruS7ODZY/s1600/false_positive_sentiment.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="353" data-original-width="824" height="171" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjld-wJtT4JRsILqJuPUljx1CB3Xs5CQcgfEOUJAz0XqlEzS8l5_LniOhQUQ68fBn7-d4ZpWJgFZD4WgSFpMRgb5PHIezpc0dJ2qrXHGXc1yirLXpewGWski5SUm7xysCnxc_8ruS7ODZY/s400/false_positive_sentiment.PNG" width="400" /></a></div>
<br /></div>
</div>
<div dir="ltr" style="text-align: left;">
Finally, this last example is a subtle insult (at least in my interpretation) from Morrissey: "you were good in your time", interpreted simply as a positive saying by the model. This was a difficult one!</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
</div>
<hr />
<div style="direction: ltr; text-align: left;">
<span class="Apple-style-span" style="font-size: x-small;">
<span style="color: black;"><a href="https://www.blogger.com/null" name="1">1</a><b> </b></span>In this post I use the lyrics I downloaded (315,357 songs) along with two lyrics corpora from Kaggle: <a href="https://www.kaggle.com/mousehead/songlyrics">from Sergey Kuznetsov</a> (57,650 songs) and <a href="https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics">from Gyanendra Mishra</a> (380,000 songs). </span><span style="font-size: x-small;">I was planning to share the code for scraping the lyrics from the web, but when I finally started writing this post, I've found out that the website I've been using has changed and scraping lyrics with my code no longer works.</span><span style="font-size: x-small;"> </span><span style="font-size: x-small;"> </span><span style="font-size: x-small;"></span><sup style="font-size: small;"><a href="https://veredshwartz.blogspot.co.il/2018/01/fun-with-lyrics.html#top1" style="font-size: small;">â©</a></sup><br />
<br />
<span class="Apple-style-span" style="font-size: x-small;"><a href="https://www.blogger.com/null" name="2">2</a><b> </b>It is very, very frequent in general, so the prior probability of the occurrence of <i>California</i> in songs is high, not just the conditional probability given that it's a RHCP song. I never realized how common it is until I came back from California last summer and tried to fill the void by creating and constantly listening to this <a href="https://www.youtube.com/playlist?list=PLOcjrB_jn6-HlceLe-LwzJmlVphRDGNEi">America playlist</a> (biased towards songs about California). </span><span style="font-size: x-small;"> </span><span style="font-size: x-small;"></span><sup style="font-size: small;"><a href="https://veredshwartz.blogspot.co.il/2018/01/fun-with-lyrics.html#top2" style="font-size: small;">â©</a></sup><br />
<br />
<span class="Apple-style-span" style="font-size: x-small;"><a href="https://www.blogger.com/null" name="3">3</a><b> </b>One note about the statistics in this post: they are inaccurate. Some states have just a few artists, the number of mentions is counted equally if they are one from song or many songs, I didn't normalize the statistics by the size of each state, I didn't check for mentions of cities, etc. </span><span style="font-size: x-small;"> </span><span style="font-size: x-small;"></span><sup style="font-size: small;"><a href="https://veredshwartz.blogspot.co.il/2018/01/fun-with-lyrics.html#top3">â©</a></sup></div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com0tag:blogger.com,1999:blog-9145120678290195131.post-85569454569060086632017-10-30T21:11:00.000+02:002017-10-30T22:08:23.595+02:00Ambiguity<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiin8LDGmnEpbdnYaZEV9FI28b7ehPQUc-JZMkKF0eIbg4ZKRJ-MN_omkFAYjzR5II8L-DoHVf1V5QM9UkMI7O4LIji0tzZILywr6TlQk2FFJnuzwancALgzx3STzVs14vyeKu7125JGMI/s1600/slow_children.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="600" data-original-width="400" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiin8LDGmnEpbdnYaZEV9FI28b7ehPQUc-JZMkKF0eIbg4ZKRJ-MN_omkFAYjzR5II8L-DoHVf1V5QM9UkMI7O4LIji0tzZILywr6TlQk2FFJnuzwancALgzx3STzVs14vyeKu7125JGMI/s200/slow_children.jpg" width="133" /></a></div>
One of the problems with teaching computers to understand natural language, is that much of the meaning in what people say is actually hidden in what they <i>don't</i> say. As humans, we trivially interpret the meaning of ambiguous words, written or spoken, according to their context. For example, this blog post is published in a blog that largely discusses natural language processing, so if I write "NLP", you'd know I refer to natural language processing rather than to <a href="https://en.wikipedia.org/wiki/Neuro-linguistic_programming">neuro-linguistic programming</a>. If I told you that the blog post doesn't fit into a tweet because it's too long, you'd know that the <i>blog post is too long</i> and not that <i>the tweet is too long</i>. You would infer that even without having any knowledge about <a href="https://www.recode.net/2017/9/26/16364002/twitter-longer-tweets-character-limit-140-280">Twitter's character limit</a>, because it just doesn't make sense otherwise. Unfortunately, common-sense and world knowledge that come so easily for us are not trivial to teach to machines. In this post, I will present a few cases in which ambiguity is a challenge in NLP, along with common ways in which we try to overcome it.</div>
<br />
<hr />
<table>
<tbody>
<tr>
<td><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody>
<tr><td style="text-align: center;"><a href="https://memeguy.com/photo/108837/dad-jokes-" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img alt="" height="400" src="https://memeguy.com/photos/images/dad-jokes--108837.jpg" title="Dad Jokes " width="208" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Polysemous words, providing material for dad jokes since... ever.</td></tr>
</tbody></table>
</td>
<td><b>Lexical Ambiguity</b><br />
Lexical ambiguity can occur when a word is <i>polysemous, </i>i.e. has more than one meaning, and the sentence in which it is contained can be interpreted differently depending on its correct sense.<br />
<br />
For example, the word <i>bank </i>has two meanings - either a financial institute or the land alongside the river. When we read a sentence with the word <i>bank</i>, we understand which sense of <i>bank</i> the text refers to according to the context:<br />
<br />
(1) <i>Police seek person who robbed bank in downtown Reading.</i><br />
(2) <i>The faster-moving surface water travels along the concave bank.</i><br />
<br />
In these example sentences, "<i>robbed</i>" indicates the first sense while "<i>water</i>" and "<i>concave</i>" indicate the second.<br />
<br />
<br /></td>
</tr>
</tbody></table>
<br />
<table border="1"><tbody>
<tr><td><b>Existing Solutions for Lexical Ambiguity</b><br />
Word embeddings are great, but they conflate all the different senses of a word into one vector. Since word embeddings <a href="http://veredshwartz.blogspot.co.il/2016/01/representing-words.html">are learned from the occurrences of a word in a text corpus</a>, the word embedding for <i>bank </i>is learned from its occurrences in both senses, and will be affected from neighbors related to the first sense (<i>money, ATM, union</i>) and of the second (<i>river, west, water, etc.</i>). The resulting vector is very likely to tend towards the more common sense of <i>bank</i>, as can be seen in this <a href="http://bionlp-www.utu.fi/wv_demo/">demo</a>: see how all the nearest words to <i>bank</i> are related to its financial sense.<br />
<br />
Word Sense Disambiguation (WSD) is an NLP task aimed at <i>disambiguating </i>a word in context. Given a list of potential word senses for each word, the correct sense of the word in the given context is determined. Similar to the way humans disambiguate words, WSD systems also rely on the surrounding context. A simple way to do so, in a machine-learning based solution (i.e. learning from examples), is to represent a word-in-context as the average of its context word vectors ("bag-of-words"). In the example above, we get for the first occurrence of <i>bank</i>: <i>feature_vector(bank) = 1/8( (vector(p</i><i>olice) + vector(seek) + vector(person) + vector(who) + vector(robbed) + vector(in) + vector(downtown) + vector(reading)</i><i>)</i><i>, </i>and for the second: <i>feature_vector(bank) = 1/9(vector(t</i><i>he) + vector(faster) + vector(moving) + vector(surface) + vector(water) + vector(travels) + vector(along) + vector(the) + vector(concave))</i>.</td></tr>
</tbody></table>
<br />
<hr />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEimJr9l_t5Yonqzo9hKFJJGum0fUpng1gOrMnoQyH0jWUQUDLOPVzuFkRSn1YVyMsGrZexRd9WlNG836hNwp7kLCFIoSUFiqA5RitS6sEesbBm4PpVQ07wdC0ATf9UqHjTnYrnQxZqQE94/s1600/acl.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="891" data-original-width="792" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEimJr9l_t5Yonqzo9hKFJJGum0fUpng1gOrMnoQyH0jWUQUDLOPVzuFkRSn1YVyMsGrZexRd9WlNG836hNwp7kLCFIoSUFiqA5RitS6sEesbBm4PpVQ07wdC0ATf9UqHjTnYrnQxZqQE94/s400/acl.PNG" width="355" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Can Google expand the acronym "ACL" correctly for me?</td></tr>
</tbody></table>
<b>Acronyms</b><br />
While many words in English are polysemous, things turn absolutely chaotic with acronyms. Acronyms are highly polysemous, some having dozens of different expansions. To make things even more complicated, as opposed to regular words, whose various senses are recorded in dictionaries and taxonomies like <a href="http://wordnet.princeton.edu/">WordNet</a>, acronyms are often domain-specific and not commonly known.<br />
<br />
Take for example a Google search for "ACL 2017". I get results both for the Annual Meeting of the <b>A</b>ssociation for <b>C</b>omputational <b>L</b>inguistics (which is what I was searching for) and for the <b>A</b>ustin <b>C</b>ity <b>L</b>imits festival. I have no idea whether this happens because (a) these are the two most relevant/popular expansions of "ACL" lately or the only ones that go with "2017"; or (b) Google successfully disambiguated my query, showing the NLP conference first, and leaving also the musical festival ranked lower in the search results, since it knows I also like music festivals. Probably (a) :)<br />
<br />
<table border="1"><tbody>
<tr><td><b>Existing Solutions for Acronym Expansion</b><br />
Expanding acronyms is considered a different task from WSD, in which there is no inventory of potential expansions for each acronym. Given enough context (e.g. "<i>2017</i>" is a context word for the acronym <i>ACL</i>), it is possible to find texts that contain the expansion. This can either be by searching for a pattern (e.g. <i>"Association for Computational Linguistics (ACL)"</i>) or considering all the word sequences that start with these initials, and deciding on the correct one using rules or a machine-learning based solution.</td></tr>
</tbody></table>
<br />
<hr />
<b>Syntactic Ambiguity</b>
<br />
<div dir="ltr">
No beginner NLP class is complete without at least one of the following example sentences:</div>
<div dir="ltr">
</div>
<ol style="text-align: left;">
<li><i>They ate pizza with anchovies</i></li>
<li><i><a href="https://youtu.be/NfN_gcjGoJo">I shot an elephant wearing my pajamas</a></i></li>
<li><i>Time flies like an arrow</i></li>
</ol>
Common to all these examples is that each can be interpreted as multiple different meanings, where the different meanings differ in the underlying syntax of the sentence. Let's go over the examples.<br />
<br />
The first sentence <i>"They ate pizza with anchovies"</i>, can be interpreted as (i) "they ate pizza and the pizza had anchovies on it", which is the more likely interpretation, illustrated on the left side of the image below. This sentence has at least two more crazy interpretations: (ii) they ate pizza using anchovies (instead of using utensils, or eating with their hands), as in the right side of the image below, and (iii) they ate pizza and their anchovy friends ate pizza with them.<br />
<br />
<div dir="ltr">
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfDcZuQjof0iNhxy3np3Qt1bxiCL_PsPiCGwDNs-_7uUD0cfKAnm7i0y_J5s2PKLfVDQoxMMgf8DH87wSBrShoBHIKmPvY8ZpSwzEGsDCBX3ZxPTLvnTW1Z1_jET3xbuk-9eTJn_ZCif4/s1600/pizza_with_anchovies_small.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="291" data-original-width="703" height="165" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfDcZuQjof0iNhxy3np3Qt1bxiCL_PsPiCGwDNs-_7uUD0cfKAnm7i0y_J5s2PKLfVDQoxMMgf8DH87wSBrShoBHIKmPvY8ZpSwzEGsDCBX3ZxPTLvnTW1Z1_jET3xbuk-9eTJn_ZCif4/s400/pizza_with_anchovies_small.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="tr-caption"><span style="font-size: x-small;">Visual illustration of the interpretations of the sentence <i>"They ate pizza with anchovies"</i>. <br />Image taken from <a href="https://explosion.ai/blog/syntaxnet-in-context">https://explosion.ai/blog/syntaxnet-in-context</a>.</span></td></tr>
</tbody></table>
</td></tr>
</tbody></table>
The first interpretation considers <i>"with anchovies"</i> as describing the <i>pizza, </i>while the other two consider it as describing the <i>eating</i> action. In the output of a <a href="https://veredshwartz.blogspot.co.il/2016/06/linguistic-analysis-of-texts.html">syntactic parser</a>, the interpretations will differ by the tree structure, as illustrated below.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJjptR5meUBWzkMvVq69AGGItFhVWbCDjaKkmy7sQNXSuiR1p5W-Btv5R3eNnFNcqcQaJ3NHcmm_-_v9zfwqH34Qh4EV-MOYZ3vZh_MuGY5_las_K33tYc1Czezxgl-9ANn1P8Z4V45AM/s1600/trees.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="447" data-original-width="888" height="201" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJjptR5meUBWzkMvVq69AGGItFhVWbCDjaKkmy7sQNXSuiR1p5W-Btv5R3eNnFNcqcQaJ3NHcmm_-_v9zfwqH34Qh4EV-MOYZ3vZh_MuGY5_las_K33tYc1Czezxgl-9ANn1P8Z4V45AM/s400/trees.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: x-small;">Possible syntactic trees for the sentence <i>"They ate pizza with anchovies"</i>, using <a href="https://demos.explosion.ai/displacy/">displacy</a>.</span></td></tr>
</tbody></table>
<br />
Although this is a classic example, both the <a href="https://demos.explosion.ai/displacy/">Spacy</a> and the <a href="http://nlp.stanford.edu:8080/corenlp/process">Stanford Core NLP</a> demos got it wrong. The difficulty is that syntactically speaking, both trees are likely. Humans know to prefer the first one based on the semantics of the words, and using their knowledge that anchovy is something that you <i>eat </i>rather than <i>eat with</i>. Machines don't come with this knowledge.<br />
<br />
A similar parser decision is crucial in the second sentence, and just in case you haven't managed to find the funny interpretations yet: <i>"I shot an elephant wearing my pajamas"</i> has two ambiguities: first, does <i>shoot</i> mean <i>taking a photo of</i>, or <i>pointing a gun to</i>? (a lexical ambiguity). But more importantly, who's wearing the pajamas? Depending on whether <i>wearing</i> is attached to <i>shot</i> (meaning that I wore the pajamas while shooting) or to <i>elephant</i> (meaning that the elephant miraculously managed to squeeze into my pajamas). This entire scene, regardless of the interpretation, is very unlikely, and please don't kill elephants, even if they're stretching your pajamas.<br />
<br />
The third sentence is just plain weird, but it also has multiple interpretations, of which you can read about <a href="https://en.wikipedia.org/wiki/Time_flies_like_an_arrow;_fruit_flies_like_a_banana">here</a>.<br />
<br />
<table border="1"><tbody>
<tr><td><b>Existing Solutions for </b><b>Syntactic Ambiguity</b><br />
In the past, parsers were based on deterministic grammar rules (e.g. a noun and a modifier create a noun-phrase) rather than on machine learning. A possible solution to the ambiguity issue was to add different rules for different words. For more details, you can read my <a href="https://www.quora.com/Natural-Language-Processing-What-does-it-mean-to-lexicalize-PCFGs/answer/Vered-Shwartz?srid=8J1C">answer</a> to <a href="https://www.quora.com/Natural-Language-Processing-What-does-it-mean-to-lexicalize-PCFGs" ref="canonical">Natural Language Processing: What does it mean to lexicalize PCFGs?</a> on <a href="https://www.quora.com/">Quora</a>.
<br />
<br />
Today, similarly to other NLP tasks, parsers are mostly based on neural networks. In addition to other information, the word embeddings of the words in the sentence are used for deciding on the correct output. So potentially, such a parser may learn that "eat * with [y]" yields the output in the left of the image if y is edible (similar to word embeddings of other edible things), otherwise the right one.</td></tr>
</tbody></table>
</div>
<div dir="ltr">
<br /></div>
<hr />
<div dir="ltr">
<b>Coreference Ambiguity</b><br />
Very often a text mentions an entity (someone/something), and then refers to it again, possibly in a different sentence, using another word. Take these two paragraphs from a news article as an example:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj64RDdrTrndCc4qCmKFUggxhZu-ZeJR4hn6xw3ulys_NH_y5LpKQbuGjpEzWzCfDB5xEmIJU92vMEGSdTjufE7sEEHEagshDS7IC2Qqr_QtFdZZSLrYIvs-JozLGbzfcHH8XhyphenhyphenDbNHNrc/s1600/coref.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="240" data-original-width="626" height="122" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj64RDdrTrndCc4qCmKFUggxhZu-ZeJR4hn6xw3ulys_NH_y5LpKQbuGjpEzWzCfDB5xEmIJU92vMEGSdTjufE7sEEHEagshDS7IC2Qqr_QtFdZZSLrYIvs-JozLGbzfcHH8XhyphenhyphenDbNHNrc/s320/coref.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: x-small;">From <a href="https://www.theguardian.com/sport/2017/sep/22/donald-trump-nfl-national-anthem-protests">https://www.theguardian.com/sport/2017/sep/22/donald-trump-nfl-national-anthem-protests</a>. <br />The various entities participating in the article were marked in different colors.</span></td></tr>
</tbody></table>
<br />
I marked various entities that participate in the article in different colors. I grouped together different mentions of the same entities, including pronouns (<b>"<i>he</i>" </b>as referring to "<i>that son of a bitch</i>"; excuse my language, I'm just quoting Trump) and different descriptions ("<i>Donald Trump</i>", "<i>the president</i>"). To do that, I had to use my common sense (the <i>he</i> must refer to <i>that son of a bitch</i> who disrespected the flag, definitely not to <i>the president</i> or the <i>NFL owners</i>, right?) and my world knowledge (<i>Trump</i> is <i>the president</i>). Again, any task that requires world knowledge and reasoning is difficult for machines.<br />
<br />
<div style="-webkit-text-stroke-width: 0px; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;">
<table border="1"><tbody>
<tr><td><b>Existing Solutions for Coreference Resolution</b><br />
Coreference resolution systems group mentions that refer to the same entity in the text. They go over each mention (e.g. <i>the president</i>), and either link it to an existing group containing previous mentions of the same entity (<i>[Donald Trump, the president]</i>), or start a new entity cluster (<i>[the president]</i>). Systems differ from each other, but in general, given a pair of mentions (e.g. <i>Donald Trump, the president</i>), they extract features referring either to each single mention (e.g. part-of-speech, word vector) or to the pair (e.g. gender/number agreement, etc.), and decide whether these mentions refer to the same entity. <br />
<br />
Note that mentions can be proper-names (<i>Donald Trump</i>), common nouns (<i>the president</i>) and pronouns (<i>he</i>); identifying coreference between pairs of mentions from each type requires different abilities and knowledge. For example, proper-name + common noun may require world knowledge (<i>Donald Trump is the president</i>), while pairs of common nouns can sometimes be solved with semantic similarity (e.g. synonyms like <i>owner </i>and <i>holder</i>). Pronouns can sometimes be matched to their antecedent (original mention) based on proximity and linguistic cues such as gender and number agreement, but very often there is more than one possible option for matching. <br />
<div style="margin: 0px;">
</div>
</td></tr>
</tbody></table>
</div>
<br />
<div dir="ltr" style="-webkit-text-stroke-width: 0px; color: black; font-family: "Times New Roman"; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: left; text-decoration-color: initial; text-decoration-style: initial; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;">
</div>
A nice example of solving coreference ambiguity is the <a href="http://commonsensereasoning.org/winograd.html">Winograd Schema challenge</a>, of which I've first heard from this <a href="https://artistdetective.wordpress.com/2015/12/31/winograd-schema-challenge/">post</a> in the <a href="https://artistdetective.wordpress.com/">Artificial Detective</a> blog. In this contest, computer programs are given a sentence with two nouns and an ambiguous pronoun, and they need to answer which noun the pronoun refers to, as in the following example:<br />
<br />
<i>The trophy would not fit in the brown suitcase because it was too big. What was too big?</i><br />
<span style="background-color: #fff2cc;">Answer 0: the trophy</span><br />
Answer 1: the suitcase<br />
<br />
Answering such questions requires, yes, you guessed correctly - commonsense and world knowledge. In the given example, the computer must reason that for the first object to fit into the second, the first object must be smaller than the second, so if the trophy could not fit into the suitcase, the trophy must be too big. Conversely, if instead of <i>big</i>, the question would have read <i>small</i>, the answer would have been "<i>the suitcase</i>".<br />
<br /></div>
<hr />
<div dir="ltr">
<b>Noun Compounds</b></div>
<div dir="ltr">
Words are usually considered as the basic unit of a language, and many NLP applications use <a href="https://veredshwartz.blogspot.co.il/2016/01/representing-words.html">word embeddings</a> to represent the words in the text. Word embeddings do a pretty decent job in capturing the semantics of a single word, and sometimes also its syntactic and morphological properties. The problem starts when we want to capture the semantics of a multi-word expression (or a sentence, or a document). The embedding of a word, for example <i>dog</i>, is learned from its occurrences in a large text corpus; the more common a word is, the more occurrences there are, and the higher the quality of the learned word embedding would be (it would be located "correctly" in the vector space near things that are similar to <i>dog</i>). A bigram like <i>hot dog </i>is already much less frequent, even less frequent is <i>hot dog bun</i>, and so on. The conclusion is clear - we can't learn embeddings for multi-word expressions the same way we do for single words.<br />
<br />
The alternative is to try to somehow combine the word embeddings of the single words in the expression into a meaningful representation. Although there are many approaches for this task, there is no one-size-fits-all solution for this problem; a multi-word expression is not simply the sum of its single word meanings (<i>hot dog</i> is an extreme counter-example!).<br />
<br />
One example out of many would be noun-compounds. A noun-compound is a noun that is made up of two or more words, which usually consists of the head (main) noun and its modifiers, e.g. <i>video conference</i>, <i>pumpkin spice latte</i>, and <i>paper clip</i>. The use of noun-compounds in English is very common, but most noun-compounds don't appear frequently in text corpora. As humans, we can usually interpret the meaning of a new noun-compound if we know the words it is composed of; for example, even though I've never heard of <i>watermelon soup</i>, I can easily infer that it is a <i>soup </i><b>made of</b> <i>watermelon</i>.<br />
<br />
Similarly, if I want my software to have a nice vector representation of <i>watermelon soup</i>, there is no way I can base it on the corpus occurrences of <i>watermelon soup</i> -- it would be too rare. However, I used my commonsense to build a representation of <i>watermelon soup </i>in my head -- how would my software know that there is a <b>made of </b>relation between <i>watermelon </i>and <i>soup</i>? This relation can be one out of many, for example: <i>video conference</i> (<b>means</b>), <i>paper clip </i>(<b>purpose</b>), etc. Note that the relation is implicit, so there is no immediate way for the machine to know what's the correct relation between the head and the modifier.<sup><a href="http://veredshwartz.blogspot.co.il/2017/10/ambiguity.html#1" name="top1">1</a> </sup>To complicate things a bit further, many noun-compounds are non-compositional, i.e. the meaning of the compound is not a straightforward combination of the meaning of its words, as in <i><a href="https://en.wikipedia.org/wiki/Hot_dog#Etymology">hot dog</a>, <a href="https://en.wikipedia.org/wiki/Babysitting#Etymology">baby sitting</a>, and <a href="https://youtu.be/eUTbDa4lzW0?t=2m">banana hammock</a>.</i><br />
<br />
<table border="1">
<tbody>
<tr><td><b>Existing Solutions for Noun-compound Interpretation</b><br />
Automatic methods for interpreting the relation between the head and the modifier of noun-compounds have largely been divided into two approaches:<br />
<br />
(1) machine-learning methods, i.e. hand-labeling a bunch of noun-compounds to a set of pre-defined relations (e.g. part of, made of, means, purpose...), and learning to predict the relation for unseen noun-compounds. The features are either related to each single word (head/modifier), such as their word vectors or lexical properties from <a href="https://wordnet.princeton.edu/">WordNet</a>, or to the noun-compound itself and its corpus occurrences. Some methods also try to learn a vector representation for a noun-compound in the form of applying a function to the word embeddings of its single words (e.g. vector(<i>olive oil</i>) = function(vector(<i>olive</i>), vector(<i>oil</i>))).<br />
<br />
(2) finding joint occurrences of the nouns in a text corpus, some of which would explicitly describe the relation between the head and the modifier. For example "<i>oil </i>made of <i>olives</i>".<br />
<br />
While there has been a lot of work in this area, success on this task is still mediocre. A recent <a href="http://aclweb.org/anthology/W16-1604">paper</a> suggested that current methods succeed mostly due to predicting the relation based solely on the head or on the modifier - for example, most noun-compounds with the head "<i>oil</i>" hold the <b>made of</b> relation (<i>olive oil, coconut oil, avocado oil</i>, ...). While this guess can be pretty accurate most of the times, it may cause funny mistakes as in the meme below.</td></tr>
</tbody></table>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody>
<tr><td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbrJLbHqhwBTGpzLJ69vTXxsOkA3eMbl1w7twmz53MLg2VsHfAUOCK54aBcyq_U0j9daXc6zNkhFQnUmcEor7NfGiybq-gexwFYlSNtWpfnSmNaCVMYYAVUaLq924esQj5BOxL_ZbVQ_U/s1600/baby_oil.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="300" data-original-width="300" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbrJLbHqhwBTGpzLJ69vTXxsOkA3eMbl1w7twmz53MLg2VsHfAUOCK54aBcyq_U0j9daXc6zNkhFQnUmcEor7NfGiybq-gexwFYlSNtWpfnSmNaCVMYYAVUaLq924esQj5BOxL_ZbVQ_U/s200/baby_oil.jpg" width="200" /></a></td></tr>
<tr><td class="tr-caption" style="direction: ltr;"><span style="font-size: x-small;">From <a href="http://www.quickmeme.com/meme/3r9thy">http://www.quickmeme.com/meme/3r9thy</a>.</span></td></tr>
</tbody></table>
<br />
For the sake of simplicity, I focused on two-word noun-compounds, but noun-compounds with more than two words have an additional ambiguity - a syntactic ambiguity - what are the head-modifier relations in the compound? It is often referred to as bracketing. Without getting into too many details, consider the example of <i>hot dog bun</i> from before. It should be interpreted as [[<i>hot dog</i>][<i>bun</i>]]<i> </i>rather than [<i>hot </i>[<i><a href="https://www.google.co.il/search?q=%22dog+bun%22+-hot&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjy58rYgpnXAhULb1AKHStWBwQQ_AUICigB&biw=1280&bih=600">dog bun</a></i>]]<i>.</i><br />
<br />
<br />
<br />
<br /></div>
<div dir="ltr">
<hr />
<b>More to read?</b><br />
Yeah, I know it was a long post, but there is so much more ambiguity in language that I haven't discussed. Here is another selected topic, in case you're looking for more to read. We all speak a second language called <i>emoji</i>, which is full of ambiguity. Here are some interesting articles about it: <a href="https://www.thestar.com/business/tech_news/2017/08/04/emoji-could-cause-confusion-trouble-in-the-workplace.html">Emoji could cause confusion, trouble in the workplace</a>, <a href="http://mashable.com/2017/06/03/emoji-twitter-handles-meanings/#JQEeyCKGBkq3">The real meaning of all those emoji in Twitter handles</a>, <a href="http://www.jellyfish.net/en-us/news-and-views/learning-the-language-of-emoji-0">Learning the language of emoji</a>, and <a href="https://www.usatoday.com/story/tech/2017/07/16/emoji-celebrate-all-emojis-worldemojiday/466053001/">Why emojis may be the best thing to happen to language in the digital age</a>. For the older people among us (and in the context of emoji, I consider myself old too, so no offence anyone), if you're not sure about the meaning of an emoji, why don't you check <a href="https://emojipedia.org/">emojipedia</a> first, just to make sure you're not accidentally using <a href="https://emojipedia.org/aubergine/">phallic symbols</a> in your grocery list?<br />
<br />
<hr style="text-align: right;" />
<div style="direction: ltr;">
<span class="Apple-style-span" style="font-size: x-small;"><span style="color: black;"><a href="https://www.blogger.com/null" name="1" style="font-weight: bold;">1</a><b> </b></span></span><span style="font-size: x-small;">I</span><span style="font-size: x-small;">n this very interesting </span><span style="font-size: x-small;"><a href="http://people.ischool.berkeley.edu/~nakov/selected_papers_list/JNLE2013.pdf">paper</a> </span><span style="font-size: x-small;">by Preslav Nakov there is a nice observation: a noun-compound is a "compression device" that allows saying more with less words</span><span style="font-size: x-small;">.</span><span style="font-size: x-small;"> </span><sup style="font-size: small;"><a href="http://veredshwartz.blogspot.co.il/2017/10/ambiguity.html#top1" style="font-size: small;">â©</a></sup></div>
</div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com5tag:blogger.com,1999:blog-9145120678290195131.post-50378742821669237332017-08-09T19:24:00.000+03:002017-08-09T19:28:21.729+03:00Paraphrasing<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="ltr" style="text-align: left;">
One of the things that make natural language processing so difficult is language variability: there are multiple ways to express the same idea/meaning. I mentioned it several times in this blog, since it is a true challenge for any application that aims to interact with humans. You may program it to understand common things or questions that a human may have, but if the human decides to deviate from the script and phrase it slightly differently, the program is helpless. If you want a good example, take your favorite personal assistant (Google assistant, Siri, Alexa, etc.) and ask it a question you know it can answer, but this time use a different phrase. Here is mine:</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh6p00T0UwLwahzRoY5IJLE3LnrVFsRGBW6amspOe-rSm74jDX3nlzDUM0PgyiXLTVWt8zIPQH_c3NTgF8fh0QElP24gmxTe-UNqKekK3tBFzKquClfAdt3hW_2FY0AOLWrG1QGnyXdstI/s1600/Screenshot_2017-07-28-17-48-02-447_com.google.android.googlequicksearchbox.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1280" data-original-width="720" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh6p00T0UwLwahzRoY5IJLE3LnrVFsRGBW6amspOe-rSm74jDX3nlzDUM0PgyiXLTVWt8zIPQH_c3NTgF8fh0QElP24gmxTe-UNqKekK3tBFzKquClfAdt3hW_2FY0AOLWrG1QGnyXdstI/s320/Screenshot_2017-07-28-17-48-02-447_com.google.android.googlequicksearchbox.png" width="180" /></a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhOQ7_HCXads0RmQf9msKvLwekD-RtnXG5S7NslsuasfGI9vGQRERwxZaVNwZ1AHYg1G6n85yP4zdJHvWfjsuRZgVWTci9FI0LJ3xJnEtt3hzA5Vzfttqy29QY5hk0SUs4-vZJivY9Brz8/s1600/Screenshot_2017-07-28-15-16-23-941_com.google.android.googlequicksearchbox.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1280" data-original-width="720" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhOQ7_HCXads0RmQf9msKvLwekD-RtnXG5S7NslsuasfGI9vGQRERwxZaVNwZ1AHYg1G6n85yP4zdJHvWfjsuRZgVWTci9FI0LJ3xJnEtt3hzA5Vzfttqy29QY5hk0SUs4-vZJivY9Brz8/s320/Screenshot_2017-07-28-15-16-23-941_com.google.android.googlequicksearchbox.png" width="180" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
Both questions I asked have roughly the same meaning, yet, Google answers the first perfectly but fails to answer the second, backing off to showing search results. In fact, I just gave you a "free" example of another difficult problem in NLP which is ambiguity. It seems that Google interpreted <i>showers</i> as "meteor showers" rather than as a light rain.</div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
One way to deal with the language variability difficulty is to construct a huge dictionary that contains groups or pairs of texts with roughly the same meaning: <b><i>paraphrases</i></b>. Then, applications like the assistant can, given a new question, look up the dictionary for any question they were programmed to answer which has the same meaning. Of course, this is a naive idea, given that language is infinite and one can always form a new sentence that has never been said before. But it's a good start, and it may help developing algorithms that can associate a new unseen text to an existing dictionary entry (i.e. generalizing). </div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
Several approaches have been used to construct such dictionaries, and in this post I will present some of the simple-but-smart approaches. </div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<b>Translation-based paraphrasing</b></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
The idea behind this approach is super clever and simple: suppose we are interested in collecting paraphrases in English. If two English texts are translated to the same text in a foreign language, then they are likely paraphrases of each other. Here is an example:</div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZEvJMpcyzqFRdR7_FmMtOXwLZU3iCbD6w_SFo3VCrKt7SSpa8Bf8HpmOg3Djquiv0pj1i3MoTgJvhrSMhvsHds7M87D4qOUTUWNQY3l8bnj8isH0LhX8rvluHZyYRAQtLRwPMivsx330/s1600/translation-based.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="227" data-original-width="536" height="135" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZEvJMpcyzqFRdR7_FmMtOXwLZU3iCbD6w_SFo3VCrKt7SSpa8Bf8HpmOg3Djquiv0pj1i3MoTgJvhrSMhvsHds7M87D4qOUTUWNQY3l8bnj8isH0LhX8rvluHZyYRAQtLRwPMivsx330/s320/translation-based.PNG" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="direction: ltr; text-align: center;">The English texts on the left are translated into the same Italian text on the right, implying that they have the same meaning.</td></tr>
</tbody></table>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
This approach goes as far as <a href="http://www.aclweb.org/anthology/P01-1008">2001</a>. The most prominent resource constructed with this approach is the paraphrase database (<a href="http://paraphrase.org/#/">PPDB</a>). It is a resource containing hundreds of millions of text pairs with roughly the same meanings. Using the online demo, I looked up for paraphrases of "<i>nice to meet you</i>", yielding a <a href="http://paraphrase.org/#/search?q=nice%20to%20meet%20you&filter=&lang=en">bunch</a> of friendly variants that may be of use for conference small talks: </div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
</div>
<br />
<table align="center" frame="border">
<tbody>
<tr><td style="text-align: center;"><i>it was nice meeting you</i></td></tr>
<tr><td style="text-align: center;"><i>it was nice talking to you</i></td></tr>
<tr><td style="text-align: center;"><i>nice to see you</i></td></tr>
<tr><td style="text-align: center;"><i>hey, you guys</i></td></tr>
<tr><td style="text-align: center;"><i>it's nice to meet you</i></td></tr>
<tr><td style="text-align: center;"><i>very nice to meet you</i></td></tr>
<tr><td style="text-align: center;"><i>nice to see you</i></td></tr>
<tr><td style="text-align: center;"><i>i'm pleased to meet you</i></td></tr>
<tr><td style="text-align: center;"><i>it's nice to meet you</i></td></tr>
<tr><td style="text-align: center;"><i>how are you</i></td></tr>
<tr><td style="text-align: center;"><i>i'm delighted</i></td></tr>
<tr><td style="text-align: center;"><i>it's been a pleasure</i></td></tr>
<tr><td><hr />
</td></tr>
<tr><td class="tr-caption" style="direction: ltr; text-align: center;">Paraphrases of "nice to meet you", from <a href="http://paraphrase.org/#/search?q=nice%20to%20meet%20you&filter=&lang=en">PPDB</a>.</td></tr>
</tbody></table>
<br />
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
In practice, all these texts appear as paraphrases of "<i>nice to meet you</i>" in the resource, with different scores (to what extent is this text a paraphrase of "<i>nice to meet you</i>"?). These texts were found to be translated to the same text in a single or in <b>multiple </b>foreign languages, and their scores correspond to the translation scores (as explained <a href="http://veredshwartz.blogspot.ca/2015/09/translation-models.html">here</a>), along with other heuristics.<sup><a href="http://veredshwartz.blogspot.co.il/2017/08/paraphrasing.html#2" name="top2">2</a> </sup></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
While this approach provides a ton of very useful paraphrases, as you can guess, it also introduces errors, as in every automatic method. One type of an error occurs when the foreign word has more than one sense, each translating into a different, unrelated English word. For example, the Spanish word <i>estacion </i>has two meanings: <i>station </i>and <i>season</i>. When given a Spanish sentence that contains this word, it is translated (hopefully) to the correct English word according to the context. This paraphrase approach, however, does not look at the original sentences in which these words occur, but only at the phrase table -- a huge table of English phrases and their Spanish translations <i>without their original contexts.</i> It has therefore no way at this point to tell that <i>stop </i>and <i>station </i>refer to the same meaning of <i>estacion, </i>and are therefore paraphrases, while <i>season </i>and <i>station</i> are translations of two different senses of <i>estacion.</i></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<i><br /></i></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
Even without making such a horrible mistake of considering two texts as paraphrases when they are not related at all, paraphrasing is not well-defined, and the <i>paraphrase </i>relation encompasses many different relations. For example, looking for <a href="http://paraphrase.org/#/search?q=tired&filter=&lang=en">paraphrases of the word <i>tired</i> in PPDB</a>, you will get equivalent phrases like <i>fatigued</i>, more specific phrases like <i>overtired/exhausted</i>, and related but not-quite-the-same phrases like <i>bored. </i>This may occur when the translator likes being creative and does not remain completely faithful to the original sentence, but also when the target language does not contain an exact translation for a word, defaulting in a slightly more specific or more general word. While this is not a specific phenomenon of this approach but rather of all the paraphrasing approaches (for different reasons), this has been studied by the PPDB people who did <a href="http://www.aclweb.org/anthology/P15-1146">an interesting analysis</a> of the different semantic relations the resource captures.</div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
The following approaches focus on paraphrasing predicates. A <a href="http://www.k12reader.com/term/simple-predicate/">predicate</a> is a text describing an action or a relation between one or more entities/arguments, very often containing a verb. For example: <i>John <b>ate</b> an apple</i> or <i>Amazon <b>acquired</b> Whole Foods</i>. Predicate paraphrases are pairs of predicate templates -- i.e. predicates whose arguments were replaced by placeholders -- that would have roughly the same meaning given an assignment to their arguments. For example, <b>[a]<sub>0</sub> acquired [a]<sub>1</sub></b> and <b>[a]<sub>0</sub> bought [a]<sub>1 </sub></b>are paraphrases given the assignment [a]<sub>0 </sub>= <i>Amazon </i>and [a]<sub>1 </sub>= <i>Whole Foods</i>.<sup><a href="http://veredshwartz.blogspot.co.il/2017/08/paraphrasing.html#1" name="top1">1</a> </sup>Most approaches focus on <span style="text-align: center;">binary predicates (predicates with two arguments).</span></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<b>Argument-distribution paraphrasing</b></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
This approach relies on a simple assumption: if two predicates have the same meaning, they should normally appear with the same arguments. Here is an example:</div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFWoYDi-0hSKKorzrBru9ssPYyxl3oYWuCPLMUqsCoggLp6XuIQTbjuZgz2kxRt_N0PUSf81myBPqKNRfYviO52tvsTu0kREihMBr-TPnX2jQgn81VTAXLDvTu1X6F4uPuia-NdtM8MYg/s1600/distributional.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="243" data-original-width="600" height="162" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFWoYDi-0hSKKorzrBru9ssPYyxl3oYWuCPLMUqsCoggLp6XuIQTbjuZgz2kxRt_N0PUSf81myBPqKNRfYviO52tvsTu0kREihMBr-TPnX2jQgn81VTAXLDvTu1X6F4uPuia-NdtM8MYg/s400/distributional.PNG" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="direction: ltr; text-align: center;"></td></tr>
</tbody></table>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<span style="text-align: center;">In this example, the </span><b>[a]<sub>0</sub></b><span style="text-align: center;"> slots in both predicates are expected to contain names of companies that acquired other companies while the </span><b>[a]<sub>1</sub></b><span style="text-align: center;"> slot is expected to contain acquired companies. </span></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<span style="text-align: center;"><br /></span></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<span style="text-align: center;">The <a href="http://dl.acm.org/citation.cfm?id=502559">DIRT</a> method represents each predicate as two vectors: (1) the distribution of words that appeared in its </span><b>[a]<sub>0</sub></b><span style="text-align: center;"> argument slot, and (2) t</span><span style="text-align: center;">he distribution of words that appeared in its </span><b>[a]<sub>1</sub></b><span style="text-align: center;"> argument slot. For example, the </span><b>[a]<sub>0</sub></b><span style="text-align: center;"> vectors of the predicates in the example will have positive/high values for names of people and names of companies that acquired other companies, and low values for other (small) companies and other unrelated words (<i>cat</i>, <i>cookie, ...</i>). </span><span style="text-align: center;">To measure the similarity between two predicates, the two vector pairs (</span><b>[a]<sub>0</sub></b><span style="text-align: center;"> in each predicate and </span><b>[a]<sub>1</sub></b><span style="text-align: center;"> in each predicate) </span><span style="text-align: center;">are compared using vector similarity measures (i.e. cosine similarity), and a final score averages the per-slot similarities.</span></div>
<div style="direction: ltr; text-align: left;">
<br /></div>
<div style="direction: ltr; text-align: left;">
Now, while it is true that predicates with the same meaning often share arguments, it is definitely not true that predicates that share a fair amount of their argument instantiations are always paraphrases. A simple counterexample would be of predicates with opposite meanings, that often tend to appear with similar arguments: for instance, <b>"[stock] rise to [30]"</b> and <b>"[stock] fall to [30]" </b>or "<b>[a]<sub>0</sub></b><span style="text-align: center;"> <b>acquired </b></span><b>[a]<sub>1</sub></b>" and "<b>[a]<sub>0</sub></b><span style="text-align: center;"> <b>sold </b></span><b>[a]<sub>1</sub></b>" with any <b>[a]<sub>0 </sub></b>that once bought an <b>[a]<sub>1 </sub></b>and then sold it.</div>
<div style="direction: ltr; text-align: left;">
<br /></div>
<div style="direction: ltr; text-align: left;">
Following this approach, other methods were suggested, such as capturing a directional inference relation between predicates (e.g. <b>[a]<sub>0</sub></b><span style="text-align: center;"> <b>shot </b></span><b>[a]<sub>1</sub></b><b> => [a]<sub>0</sub></b><span style="text-align: center;"> <b>killed </b></span><b>[a]<sub>1</sub></b> but not vice versa), releasing a huge resource of such predicate pairs (see the <a href="http://www.cs.tau.ac.il/~joberant/homepage_files/publications/thesis.pdf">paper</a>); and a method to predict whether one predicate entails the other, given a specific context (see the <a href="https://www.aclweb.org/anthology/C/C16/C16-1273.pdf">paper</a>). </div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<b>Event-based paraphrases</b></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
Another good source for paraphrases is multiple descriptions of the same news event, as various news reporters are likely to choose different words to describe the same event. To automatically group news headlines discussing the same story, it is common to group them according to the publication date and word overlap. Here is an example of some headlines describing the acquisition of <i>Whole Foods</i> by <i>Amazon</i>:</div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjUzeiJNcVjlBRQLpo25exvqwcTKgR7DuNkEXpCpggOEM_gdIVB-rldjDXil2qhd7Uy3Ne7qnMVuKve2ubrU0OtOmmxYoGE-fzb4p98dQfN7gvsLSB9tTbTjXEBko0JxABydOcxNvdgg_8/s1600/tweets.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="682" data-original-width="1442" height="188" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjUzeiJNcVjlBRQLpo25exvqwcTKgR7DuNkEXpCpggOEM_gdIVB-rldjDXil2qhd7Uy3Ne7qnMVuKve2ubrU0OtOmmxYoGE-fzb4p98dQfN7gvsLSB9tTbTjXEBko0JxABydOcxNvdgg_8/s400/tweets.PNG" width="400" /></a></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
We can stop here and say that all these headlines are sentential paraphrases. However, going a step further, if we've already observed in the past <i>Google </i><b>to acquire </b><i>YouTube</i> / <i>Google</i><b> is buying </b><i>YouTube </i>as sentential paraphrases (and many other similar paraphrases), we can generalize and say that <b>[a]<sub>0</sub></b><span style="text-align: center;"> <b>to</b> <b>acquire </b></span><b>[a]<sub>1</sub></b> and <b>[a]<sub>0</sub></b><span style="text-align: center;"> <b>is buying </b></span><b>[a]<sub>1 </sub></b>are <i>predicate</i> paraphrases.</div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
Early works relying on this approach are <a href="http://nlp.cs.nyu.edu/pubs/papers/shinyama-hlt02.pdf">1</a>, <a href="https://arxiv.org/pdf/cs/0304006.pdf">2</a>, followed by some more complex methods like <a href="http://www.aclweb.org/website/old_anthology/D/D13/D13-1183.pdf">3</a>. We recently harvested such paraphrases from Twitter, assuming that tweets with links to news web sites that were published on the same day are likely to describe the same news events. If you're interested in more details, here are the <a href="http://aclweb.org/anthology/S/S17/S17-1019.pdf">paper</a>, the <a href="http://u.cs.biu.ac.il/~havivv/papers/chirps_poster.pdf">poster</a> and the <a href="https://github.com/vered1986/Chirps">resource</a>.</div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
This approach is potentially more accurate than the argument-distribution approach. The latter assumes that predicates that often occur with the same arguments are paraphrases, while the former considers predicates with the same argument as paraphrases only if it believes that they discuss the same event.</div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<b>What does the future hold?</b> neural paraphrasing methods, of course. I won't go into technical details (I feel that there are enough "neural network for dummies" blog posts out there, and I'm by no means an expert on that topic). The idea is to build a model that reads a sequence of words and then generates a different sequence of words that has the same meaning. If it sounds like inexplicable magic, it is mostly because even the researchers working on this task can at most make educated guesses on why something works well or not. In any case, if this ever ends up working well, it will be much better than the resources we have today, since it will be capable of providing paraphrases / judging correctness of paraphrases for new texts that were never observed before.</div>
<div class="separator" dir="ltr" style="clear: both; text-align: left;">
<br /></div>
<hr />
<div style="direction: ltr; text-align: left;">
<span class="Apple-style-span" style="font-size: x-small;">
<span style="color: black;"><a href="https://www.blogger.com/null" name="1" style="font-weight: bold;">1</a><b> </b></span>Of course, given a different choice of arguments, these predicates will not be considered as paraphrases. For example, <i>Mary acquired a skill</i> is not a paraphrase of <i>Mary bought a skill</i>. The discussed approaches consider predicate-pairs as paraphrases, if there exists an argument assignment (/context) under which these predicates are paraphrases.</span><span style="font-size: x-small;"> </span><span style="font-size: x-small;"> </span><sup style="font-size: small;"><a href="http://veredshwartz.blogspot.co.il/2017/08/paraphrasing.html#top1" style="font-size: small;">â©</a></sup><br />
<span style="font-size: x-small;"><span style="color: black;"><a href="https://www.blogger.com/null" name="2" style="font-weight: bold;">2</a><b> </b></span>See also <a href="http://www.aclweb.org/anthology/E17-1083">more recent work</a> on translation-based paraphrasing.</span><span style="font-size: x-small; text-align: right;"> </span><sup style="font-size: small; text-align: right;"><a href="http://veredshwartz.blogspot.co.il/2017/08/paraphrasing.html#top2" style="font-size: small;">â©</a></sup></div>
</div>
</div>
</div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com16tag:blogger.com,1999:blog-9145120678290195131.post-41337017417512194262017-03-01T16:58:00.000+02:002017-03-01T17:06:34.717+02:00Women in STEM*<div dir="rtl" style="text-align: right;" trbidi="on">
<div style="direction: ltr; text-align: left;">
This is a special post towards International Women's Day (March 8th). Every year I find myself enthusiastically conveying my thoughts about the topic to the people around me, so I thought I might as well share it with a broader audience. As always, this post presents my very limited knowledge/interpretation to a broadly discussed and studied topic. However, it may be a bit off topic for this blog, so if you're only interested in computational stuff, you can focus on section 3.<br />
<b><br /></b>
<b>1. The Problem</b><br />
Even though we are half of the population, women are quite poorly represented in STEM:<br />
<br />
<table border="1">
<tbody>
<tr>
<td><i><b>USA</b>: the percentage of computing occupations held by women has been declining since 1991, when it reached a high of 36%. The current rate is 25%.</i> [2016, <a href="https://www.ncwit.org/sites/default/files/resources/womenintech_facts_fullreport_05132016.pdf">here</a>]<br />
<i><b><br /></b></i>
<i><b>OECD member countries</b>: While women account for more than half of university graduates in scientific fields in several OECD countries, they account for only 25% to 35% of researchers in most OECD countries.</i> [2006, <a href="https://www.oecd.org/sti/sci-tech/womeninscientificcareersunleashingthepotential.htm">here</a>]</td></tr>
</tbody></table>
<b><br /></b>
<b>2. The Causes (and possible solutions)</b><br />
<b><br /></b>
<b>2.1 Cognitive Differences</b><br />
There is a common conception that female abilities in math are biologically inferior to those of males. Many highly cited psychology papers prove differently, for example:<br />
<br />
<i>"Stereotypes
that girls and women lack mathematical ability persist, despite
mounting evidence of gender similarities in math achievement." </i>[<a href="http://psycnet.apa.org/journals/bul/136/1/103/">1</a>].<br />
<br />
<i>"...provides evidence that mathematical and scientific reasoning develop from a set of biologically based cognitive capacities that males and females share. These capacities lead men and women to develop equal talent for mathematics and science."</i> [<a href="http://psycnet.apa.org/journals/amp/60/9/950/">2</a>]<br />
<br /></div>
<div dir="ltr" style="text-align: left;">
<ol style="text-align: left;"></ol>
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://imgs.xkcd.com/comics/how_it_works.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://imgs.xkcd.com/comics/how_it_works.png" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">From <a href="https://imgs.xkcd.com/comics/how_it_works.png">https://imgs.xkcd.com/comics/how_it_works.png</a>.</td></tr>
</tbody></table>
<br />
In addition, if cognitive differences were so prominent, there wouldn't be so many women graduating in scientific fields. It seems that the problem lies in <i>occupational gender segregation</i>, which may be explained by any one of the following:</div>
<div dir="ltr" style="text-align: left;">
<b><br /></b></div>
<div dir="ltr" style="text-align: left;">
<b>2.2 Family Life</b><br />
Here are some references from studies conducted about occupational gender segregation:</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
<i>"In some math-intensive fields, women with children are penalized in promotion rates."</i> [<a href="http://psycnet.apa.org/journals/bul/135/2/218">3</a>]<br />
<ol style="text-align: left;"></ol>
<i>"[...] despite the women's movement and more efforts in society to open occupational doors to traditional male-jobs for women, concerns about balancing career and family, together with lower value for science-related domains, continue to steer young women away from occupations in traditionally male-dominated fields, where their abilities and ambitions may lie."</i> [<a href="http://www.tandfonline.com/doi/abs/10.1080/13803610600765786">4</a>]<br />
<br />
<i>"women may âpreferâ those [jobs] with flexible hours in order to allow time for childcare, and may also
âpreferâ occupations which are relatively easy to interrupt for a period of
time to bear or rear children."</i> [<a href="http://staging.ilo.org/public/libdoc/ilo/2001/101B09_6_engl.pdf">5</a>] (the quotation marks are later explained, indicating that this is not a personal preference but rather influenced by learned cultural and social values).<br />
<br />
I'd like to focus the discussion now on my local point view of the situation in Israel, since I suspect that it is the most prominent cause of the problem here. I would be very interested in getting comments regarding what it is like in other countries.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td><a href="http://files.explosm.net/comics/Rob/rolls.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://files.explosm.net/comics/Rob/rolls.png" height="145" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">From <a href="http://explosm.net/comics/2861/">http://explosm.net/comics/2861/</a></td></tr>
</tbody></table>
<br />
According to the <a href="http://www.cbs.gov.il/reader/?MIval=cw_usr_view_Folder&ID=141">Central Bureau of Statistics</a>, in 2014, 48.9% of the workers in Israel were women (and 51.1% were men). The average salary was 7,439 NIS for women and 11,114 for men. Wait, what?... let me introduce another (crucial) factor.<br />
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
While the fertility rate has decreased in all other <a href="http://stats.oecd.org/Index.aspx?DataSetCode=FAMILY">OECD</a> member countries, in Israel it remained stable for the last decade, with an average of 3.7 children per family. On a personal note, as a married woman without children, I can tell you that it is definitely an issue, and "when are you planning to have children already?" is considered a perfectly valid question here, even from strangers (and my friends with 1 or 2 children often get "when do you plan to have the 2nd/3rd child?").</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
Paid maternity leave is 14 weeks with a possibility (used by anyone who can afford it) to extend it to 3 more unpaid months. Officially, any one of the parents can take maternity leave, but in practice, since this law was introduced in 1998, only roughly 0.4% of the parents who took maternity leave were fathers. </div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
Here is the number connecting the dots, and explaining the salary gap: in 2014, the average number of work hours per week was 45.2 for men and 36.7 for women. The culture in Israel is torn between the traditional family roles (mother as a main parent) and the modern opportunities for women. Most women I know have a career in the morning, and a second job in the afternoon with the kids. With a hard constraint of leaving work before 16:00 to pick up the kids, in a demanding market like in Israel, it is much harder for a woman to get promoted. It poses the high-tech industry, in which the working hours are known to be long, as a male-dominated environment. Indeed, in 2015, only 36.2% of the high-tech workers in Israel were women.</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
This situation is doubly troubling: on the one hand, it is difficult for women who do choose demanding careers. They have to juggle between home and work in a way that men are never required to. On the other hand, we are oriented since childhood to feminine occupations that are less demanding in working hours. </div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
Don't get me wrong, I'm not here to judge. Being a feminist doesn't entail that the woman must have a career while the man has to stay at home with the children. Each couple can decide on their division of labor as they wish. It's the social expectations and cultural bias that I'm against. I've seen this happening time after time: the man and the woman both study and build up their careers, they live in equality, and then the birth of their first child, and specifically maternity leave, is the slippery slope after which equality is a fantasy. </div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
<div style="text-align: left;">
To make a long story short, I think it is not women the market is against, but mothers. When I say "against" I include allegedly good ideas such as allowing a mother to leave work at 16:00. While I'm not against leaving work at 16:00 (modern slavery is a topic for another discussion...), I don't see why this "privilege" should be reserved only for mothers. In my humble opinion, it will benefit mothers, fathers, children and the market if men and women could each get 3 days a week to leave work as "early" as at 16:00. It wouldn't hurt if both men and women will have the right to take parental leave together, developing their parenthood as a shared job. This situation will never change unless the market will overcome ancient society rules and stop treating parenthood as a job for women.<br />
<b><br /></b>
<b>2.3 Male-dominated Working Environments </b><br />
Following the previous, tech workplaces (everywhere) are dominated by men, so that even women who choose to work in this industry might feel uncomfortable in their workplaces. Luckily for me I can't attest this by my own experience: I've never been treated differently as a woman, and have never felt threatened or uncomfortable in situations in which I was an only woman. This <a href="https://www.ft.com/content/51c81a8c-69d9-11e6-a0b1-d87a9fea034f">article</a> exemplifies some of the things that other women experienced:<br />
<br />
<i>"Many [women] will say that their voice is not heard, they are interrupted or ignored in meetings; that much work takes place on the golf course, at football matches and other male-dominated events; that progress is not based on merit and women have to do better than men to succeed, and that questions are raised in selection processes about whether a woman âis tough enoughâ."</i><br />
<div>
<br /></div>
</div>
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://www.phdcomics.com/comics/archive/phd081604s.gif" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://www.phdcomics.com/comics/archive/phd081604s.gif" height="172" style="cursor: move;" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="font-size: 12.8px;">From <a href="http://www.phdcomics.com/comics/archive.php?comicid=490">http://www.phdcomics.com/comics/archive.php?comicid=490</a></td></tr>
</tbody></table>
<ol style="text-align: left;"></ol>
I've only become aware of these problems recently, so I guess it is both a good sign (that it might not be too common, or at least that not all women experience that), but also a bad sign (that many women still suffer from it and there's not enough awareness). This <a href="http://m-mitchell.com/gender/papers/5Aug2016.pdf">interesting essay written by Margaret Mitchell</a> suggests some practical steps to make women feel more comfortable in their workplaces.<br />
<br />
Of course, things get much worse when you consider sexual harassment in workplaces. I know the awareness to the subject is very high today, an employer's duty to prevent sexual harassment is statutory in many countries, and many big companies require new employees to undergo a sexual harassment prevention training. While this surely mitigates the problem, it is still too common, with <a href="https://www.susanjfowler.com/blog/2017/2/19/reflecting-on-one-very-strange-year-at-uber">a disturbing story just from the last week</a> (and many other stories untold). As with every other law, there will always be people breaking it, but it is the employers' duty to investigate any reported case and handle it even at the cost of losing a valuable worker.<br />
<div>
<br /></div>
<b>2.4 Gender Stereotypes </b><br />
Simply because it's so difficult to change reality; even if some of the reasons why women were previously less likely to work in these industries are no longer relevant, girls will still be less oriented to working in these fields since they are considered unsuitable for them.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://scontent-cdg2-1.xx.fbcdn.net/v/t1.0-9/14721569_1320697071283268_7090892967047125448_n.jpg?oh=9a2dfff11b58d9ca10b19ae3dd1df8d0&oe=593365A9" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="320" src="https://scontent-cdg2-1.xx.fbcdn.net/v/t1.0-9/14721569_1320697071283268_7090892967047125448_n.jpg?oh=9a2dfff11b58d9ca10b19ae3dd1df8d0&oe=593365A9" width="211" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">From <a href="https://www.facebook.com/DoodleTimeSarah/">https://www.facebook.com/DoodleTimeSarah/</a></td></tr>
</tbody></table>
<br />
An interesting illustration was provided in this <a href="https://www.google.co.il/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwjsop_uzKHSAhUG2CwKHaFCCjgQFggYMAA&url=https%3A%2F%2Frepository.wlu.edu%2Fbitstream%2Fhandle%2F11021%2F16290%2FCoyle_theses_2010.pdf%3Fsequence%3D5&usg=AFQjCNFpgXa1BiKxnB5VpnDt-kGiIowC-g&sig2=bcVHujXy3TzC4_MOnq-uyA">work</a>, where 26 girls (around 4 years old) were shown different Barbie dolls and asked whether they believed women could do masculine jobs. When the Barbie dolls were dressed in "regular" outfits, many of them replied negatively, but after being showed a Barbie dressed up in a masculine outfit (firefighter, astronaut, etc.), the girls believed that they too could do non-stereotypical jobs.<br />
<br />
This is the vicious circle that people are trying to break by encouraging young girls to study scientific subjects and supporting woman already working in these fields. Specifically, by organizing women-only conferences, offering scholarships for women, and making sure that there is a female representative in any professional group (e.g. panel, committee, etc). While I understand the rational behind changing the gender distribution, I often feel uncomfortable with these solutions. I'll give an example.<br />
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
Let's say I submitted a paper to the main conference in my field, and that paper was rejected. Then somebody tells me "there's a women-only workshop, why don't you submit your paper there?". If I submit my paper there and it gets accepted, how can I overcome the feeling of "my paper wasn't good enough for a men's conference, but for a woman's paper it was sufficient"?</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
For the same reason, I'm uncomfortable with affirmative action. If I'm a woman applying for a job somewhere and I find out that they prefer women, I might assume that there was a man who was more talented/adequate than me but they settled for me because I was a woman. If that's true, it is also unfair for that man. In general, I want my work to be judged solely based on its quality, preferably without taking gender into consideration, for better and for worse.<br />
<br />
I know I'm presenting a naive approach and that in practice, gender plays a role, even if subconsciously. I also don't really have a better solution for that, but I do hope that if we take care of all the other reasons I discussed, this distribution will eventually change naturally. </div>
</div>
<div dir="ltr" style="text-align: left;">
<div style="text-align: left;">
<script async="" charset="utf-8" src="//platform.twitter.com/widgets.js"></script>
</div>
</div>
<div dir="ltr" style="text-align: left;">
<b style="text-align: left;"><br /></b></div>
<div dir="ltr" style="text-align: left;">
<b style="text-align: left;">3. Statistics and Bias</b></div>
<div dir="ltr" style="text-align: left;">
</div>
<div style="direction: ltr; text-align: left;">
<div style="text-align: left;">
Last year there was an interesting paper [<a href="https://arxiv.org/abs/1606.06121">6</a>], followed by a lengthy discussion, about gender stereotypes in word embeddings. <a href="https://veredshwartz.blogspot.co.il/2016/01/representing-words.html">Word embeddings</a> are trained with the objective of capturing meaning through co-occurrence statistics. In other words, words that often occur next to the same neighboring words in a text corpus are optimized to be close together in the vector space. Word embeddings have proved to be extremely useful for many downstream NLP applications.<br />
<br />
The problem that this paper presented was that these word embeddings capture also "bad" statistics, for example gender stereotypes with regard to professions. For instance, word embeddings have a nice property of capturing analogies like <i>"man:king :: woman:queen"</i>, but these analogies contain also gender stereotypes like <i>"father:doctor :: mother:nurse", "man:computer programmer :: woman:homemaker"</i>, and <i>"he:she :: pilot:flight attendant"</i>.<br />
<br />
Why this is happening is pretty obvious - word embeddings are not trained to capture "truth" but only statistics. If most nurses are women, they would occur in the corpus next to words that are more likely to occur with feminine words than with masculine words, resulting in higher similarity between <i>nurse </i>and <i>woman </i>than <i>nurse </i>and <i>man</i>. In other words, if the input corpus reflects stereotypes and biases of society, so will the word embeddings.<br />
<br />
So why is this a problem, anyway? Don't we want word embeddings to capture the statistics of the real world, even the kind of statistics we don't like? If something should be bothering us, it is the bias in society, rather than the bias these word embeddings merely capture. Or in other words:<br />
<br />
<div style="-webkit-text-stroke-width: 0px; color: black; direction: ltr; font-family: "Times New Roman"; font-size: medium; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px;">
</div>
<br />
<div style="direction: ltr; orphans: 2; text-align: left; text-indent: 0px; widows: 2;">
<blockquote class="twitter-tweet" data-lang="en" style="-webkit-text-stroke-width: 0px; color: black; font-family: "Times New Roman"; font-size: medium; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: normal; letter-spacing: normal; text-transform: none; white-space: normal; word-spacing: 0px;">
<div dir="ltr" lang="en">
<div style="margin: 0px;">
What's this deal with de-biasing embeddings? Shouldn't we rather let embeddings as is and aim at de-biasing society instead?</div>
</div>
<div style="margin: 0px;">
â Angeliki Lazaridou (@aggielaz) <a href="https://twitter.com/aggielaz/status/781872134490644480">September 30, 2016</a></div>
</blockquote>
<br />
I like this tweet because I was wondering just the same when I first heard about this work. The key concern about bias in word embeddings is that these vectors are commonly used in applications, and this might inadvertently amplify unwanted stereotypes. The example in the paper mentions web search aided by word embeddings. The scenario described is of an employer looking for an intern in computer science by searching for terms related to computer science, and the authors suggest that a LinkedIn page of a male researcher might be ranked higher in the results than that of a female researcher, since computer science terms are closer in the vector space to male names than to female names (because of the current bias). In this scenario, and in many other possible scenarios, the word embeddings are not just passively recording the gender bias, but might actively contribute to it!</div>
</div>
</div>
<div style="direction: ltr; text-align: left;">
<br />
Hal Daumé III wrote a blog post called <a href="http://language%20bias%20and%20black%20sheep/">Language Bias and Black Sheep</a> about the topic, and suggested that the problem goes even deeper, since corpus co-occurrences don't always capture real-world co-occurrences, but rather statistics of things that are talked about more often:<br />
<br />
<i>"Which leads us to the "black sheep problem." We like to think that language is a reflection of underlying truth, and so if a word embedding (or whatever) is extracted from language, then it reflects some underlying truth about the world. The problem is that even in the simplest cases, this is super false."</i><br />
<br />
Prior to reading this paper (and the discussion and blog posts that followed it), I never realized that we are more than just passive observers of data; the work we do can actually help mitigate biases or inadvertently contribute to them. I think we should all keep this in mind and try to see in our next work whether it can have any positive or negative affect on that matter -- just like we try to avoid overfitting, cherry-picking, and annoying reviewer 2.<br />
<br />
<hr />
<b><span style="font-size: x-small;">References:</span></b><br />
<span style="font-size: x-small;">[1] <i>Cross-national patterns of gender differences in mathematics: A meta-analysis.</i> Else-Quest, Nicole M.; Hyde, Janet Shibley; Linn, Marcia C. Psychological Bulletin, Vol 136(1), Jan 2010, 103-127.</span><br />
<span style="font-size: x-small;">[2] <i>Sex Differences in Intrinsic Aptitude for Mathematics and Science?: A Critical Review</i>. Spelke, Elizabeth S. American Psychologist, Vol 60(9), Dec 2005, 950-958.</span><br />
<span style="font-size: x-small;">[3] <i>Women's underrepresentation in science: Sociocultural and biological considerations.</i> Ceci, Stephen J.; Williams, Wendy M.; Barnett, Susan M. Psychological Bulletin, Vol 135(2), Mar 2009, 218-261. </span><br />
<span style="font-size: x-small;">[4] <i>Why don't they want a male-dominated job? An investigation of young women who changed their occupational aspirations.</i> Pamela M. Frome, Corinne J. Alfeld, Jacquelynne S. Eccles, and Bonnie L. Barber. Educational Research And Evaluation Vol. 12 , Iss. 4,2006</span><br />
<span style="font-size: x-small;">[5] <i>Women, Gender and Work: What Is Equality and How Do We Get There?</i> Loutfi, Martha Fetherolf. International Labour Office, 1828 L. Street, NW, Washington, DC 20036, 2001.</span><br />
<span style="font-size: x-small;">[6] <i>Quantifying and Reducing Stereotypes in Word Embeddings.</i> Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, Adam Kalai. 2016 ICML Workshop on #Data4Good: Machine Learning in Social Good Applications.</span><br />
<br /></div>
<div style="direction: ltr; text-align: left;">
<hr />
<span style="font-size: x-small;"><sup>*</sup><a href="https://en.wikipedia.org/wiki/Science,_technology,_engineering,_and_mathematics">STEM</a> = science, technology, engineering and mathematics</span></div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com6tag:blogger.com,1999:blog-9145120678290195131.post-54917025843604946962016-11-23T17:31:00.000+02:002016-11-23T17:31:15.820+02:00Antonymy<div dir="rtl" style="text-align: right;" trbidi="on">
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/cKUvKE3bQlY/0.jpg" frameborder="0" height="266" src="https://www.youtube.com/embed/cKUvKE3bQlY?feature=player_embedded" width="320"></iframe></div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
In the Seinfeld episode, "the opposite", George says that his life is the opposite of everything he wanted it to be, and that every instinct he has is wrong. He decides to go against his instincts and do the opposite of everything. When the waitress asks him whether to bring him his usual order, "tuna on toast, coleslaw, and a cup of coffee", he decides to have the opposite: "Chicken salad, on rye, untoasted. With a side of potato salad. And a cup of tea!". Jerry argues with him on what's the opposite of tuna, which is according to him, salmon. So which one of them is right? If you ask me, nor salmon nor chicken salad is the opposite of tuna. There is no opposite of tuna. But this funny video demonstrates one of the biggest problems in the task of automatically detecting antonyms: even us humans are terrible at that!</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
<b>It's a Bird, It's a Plane, It's Superman (not antonyms)</b></div>
<div dir="ltr" style="text-align: left;">
Many people would categorize a pair of words as opposites if they represent two mutually exclusive options/entities in the world, like <i>male</i> and <i>female</i>. <i>black</i> and <i>white</i>, and <i>tuna</i> and <i>salmon</i>. The intuition is clear when these two words <i>x </i>and <i>y</i> represent the <b>only</b> two options in the world. In set theory, it means that <i>y</i> is the negation/complement of <i>x</i>. In other words, everything in the world which is not <i>x</i>, must be <i>y</i> (figure 1).</div>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEicJpZCy16UnxiDie5lzuhe9En6L1HAPYntkilvOrMBeK6QvVf4Yn7jw_5GxjVVLOsj8kiQZ1mpRv0trs7EBhpdQRO8uFGi6UYj4iAnNxIJTu5FIErh_SC0jaA_mRRcVLC10KzqfQ7HvC4/s1600/negation.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEicJpZCy16UnxiDie5lzuhe9En6L1HAPYntkilvOrMBeK6QvVf4Yn7jw_5GxjVVLOsj8kiQZ1mpRv0trs7EBhpdQRO8uFGi6UYj4iAnNxIJTu5FIErh_SC0jaA_mRRcVLC10KzqfQ7HvC4/s1600/negation.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 1: <i>x</i> and <i>y</i> are the only options in the world U</td></tr>
</tbody></table>
<br />
<div dir="ltr" style="text-align: left;">
In this sense, <i>tuna</i> and <i>salmon</i> are not antonyms - they are actually more accurately defined as co-hyponyms: two words that share a common hypernym (<i>fish</i>). They are indeed mutually exclusive, as one cannot be both a <i>tuna </i>and a <i>salmon</i>. However, if you are not a <i>tuna</i>, you are not necessarily a <i>salmon</i>. You can be another type of fish (<i>mackerel</i>, <i>cod</i>...) or something else which is not a fish at all (e.g. <i>person</i>). See figure 2 for a set theory illustration. </div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_NPH_sYvVmgfNIDTVoTs52N88unwmUMNVBuWq8metIh6BT4MH6tW5OT9E-FfG7D-gm2Jz1ilklfeRLQ2TeuML3kVKW-ghrVH5YCR5dRcW1ZFFvuw4NEk8UqXdAp2Am1dWgkM6AeGzAgw/s1600/cohyponyms.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_NPH_sYvVmgfNIDTVoTs52N88unwmUMNVBuWq8metIh6BT4MH6tW5OT9E-FfG7D-gm2Jz1ilklfeRLQ2TeuML3kVKW-ghrVH5YCR5dRcW1ZFFvuw4NEk8UqXdAp2Am1dWgkM6AeGzAgw/s1600/cohyponyms.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="tr-caption" style="font-size: 12.8px;">Figure 2: <i>salmon</i> and <i>tuna</i> are mutually exclusive, but not the only options in the world</td></tr>
</tbody></table>
<br />
</td></tr>
</tbody></table>
<div dir="ltr" style="text-align: left;">
Similarly, George probably had in mind that <i>tuna </i> and <i>chicken salad</i> are mutually exclusive options for sandwich fillings. He was probably right; a tuna-chicken salad sandwich sounds awful. But since there are other options for sandwich fillings (peanut butter, jelly, peanut butter and jelly...), these two can hardly be considered as antonyms, even if we define antonyms as complements within a <b>restricted</b> set of entities in the world (e.g. fish, sandwich fillings). I suggest the "it's a bird, it's a plane, it's superman" binary test for antonymy: if you have more than two options, it's not antonymy!</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
<b>Wanted Dead or Alive (complementary antonyms)</b></div>
<div dir="ltr" style="text-align: left;">
What about <i>black</i> and <i>white</i>? These are two colors out of a wide range of colors in the world, failing the bird-plane-Superman test. However, if we narrow our world down to people's skin colors, these two may be considered as antonyms.</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
Other examples for complementary antonyms are <i>day</i> and <i>night</i>, <i>republicans</i> and <i>democrats</i>, <i>dead</i> and <i>alive</i>, <i>true</i> and <i>false</i>, <i>stay</i> and <i>go</i>. As you may have noticed, they can be of different parts of speech (noun, adjective, verb), but the two words within each pair both share the same part of speech (comment if you can think of a negative example!).</div>
<div dir="ltr" style="text-align: left;">
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjV9Dd4adV88Ug0AOu9E4_56zRvAZdP74lHnwidSlX4EKmjevKlYNyu9Qcybg6iTjwgwYIUuKvmZoSO-XZZuEefguW_6S9zp2QIrtC9UpVXxNDgAsDfZ8wBbLq8RwRS5F0M94mPdb4ZrSQ/s1600/go_or_stay.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjV9Dd4adV88Ug0AOu9E4_56zRvAZdP74lHnwidSlX4EKmjevKlYNyu9Qcybg6iTjwgwYIUuKvmZoSO-XZZuEefguW_6S9zp2QIrtC9UpVXxNDgAsDfZ8wBbLq8RwRS5F0M94mPdb4ZrSQ/s1600/go_or_stay.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 3: Should I stay or should I go now?</td></tr>
</tbody></table>
<br />
So are we cool with complementary antonyms? Well, not quite. If you say that <i>female</i> and <i>male</i> are complementary antonyms, people might tell you that gender is not binary, but a spectrum. Some of these antonyms actually have other, uncommon or hidden options. Like <i>in coma</i> for the <i>dead</i> and <i>alive</i> pair, <i>libertarians</i> in addition to <i>republicans</i> and <i>democrats</i>, etc. Still, these pairs are commonly considered as antonyms, since there are two <b>main </b>options.<br />
<br />
So what have we learned about complementary antonyms? That they are borderline, they depend on the context in which they occur, and they might be offensive to minorities. Use them with caution.<br />
<br />
<b>The Good, the Bad [and the Ugly?] (graded antonyms)</b><br />
Even the strictest definition of antonymy includes pairs of gradable adjectives representing the two ends of a scale. Some examples are <i>hot </i>and <i>cold, fat</i> and <i>skinny</i>. <i>young</i> and <i>old</i>, <i>tall</i> and <i>short</i>, <i>happy</i> and <i>sad</i>. Set theory and my binary test aren't suitable for these types of antonyms.<br />
<br />
Set theory isn't adequate because a gradable adjective can't be represented as a set, e.g. "the set of all tall people in the world". The definition of a graded adjective changes depending on the context and is very subjective. For example, I'm relatively short, so everyone looks tall to me, while my husband is much taller than me, so he is more likely to say someone is short. The set of tall people in the world changes according to the person who defines it.<br />
<br />
In addition, by definition, testing for binarism fails. A cup of coffee can be more than just <i>hot </i>or <i>cold</i>. It can be <i>boiling</i>, <i>very hot</i>, <i>hot</i>, <i>warm</i>, <i>cool</i>, <i>cold</i> or <i>freezing</i>. And we can add more and more discrete options to the scale of coffee temperature.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYdgyp9N3P7lTu0kXniJZz9poYwjvh5_b7uYV2vYvKvTNx0Pe1E_-YP77cp152TnLISiplHFRsDYNLIy3kKtrFsa5HiGn8nnp2JMaykZ5Ubef0cC0ixz1MOIiGEhz1-J7sxLjGpYViv0A/s1600/hot_and_cold.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="222" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYdgyp9N3P7lTu0kXniJZz9poYwjvh5_b7uYV2vYvKvTNx0Pe1E_-YP77cp152TnLISiplHFRsDYNLIy3kKtrFsa5HiGn8nnp2JMaykZ5Ubef0cC0ixz1MOIiGEhz1-J7sxLjGpYViv0A/s320/hot_and_cold.jpg" width="320" /></a></div>
<br />
What makes specific pairs of gradable adjectives into antonyms? While the definition requires that they would be in the ends of the scale, intuitively I would say that they should only be symmetric in the scale, e.g. <i>hot</i> and <i>cold</i>, <i>boiling</i> and <i>freezing</i>, <i>warm</i> and <i>cool</i>, but not <i>hot</i> and <i>freezing</i>.<br />
<span style="color: red;"><br /></span>
<b>Antonymy in NLP</b><br />
While there is a vast linguistics literature about antonyms, I'm less familiar with it, and I'm going to focus on some observations and interesting points about antonymy that appear in NLP papers that I read.<br />
<br />
The natural logic formulation ([1]) makes a distinction between "alternation" - words that are mutually exclusive, and "negation" - words that are both mutually exclusive and cover all the options in the world. While I basically claimed in this post that the former is not antonymy, we've seen that in some cases, if the two words represent the two main options, they may be considered as antonyms.<br />
<br />
However, people tend to disagree on these borderline word pairs, so sometimes it's easier to conflate them under a more loose definition. For example, [2] had an annotation task in which they asked crowdsourcing workers to choose the semantic relation that holds for a pair of terms. They followed the natural logic relations, but decided to merge "alternation" and "negation" into a weaker notion of "antonyms".<br />
<br />
More interesting observations about antonyms, and references to linguistic papers, can be found in [3], [4], and [5].<br />
<br />
After we established that humans find it difficult to decide whether two words are antonyms, you must be wondering whether automatic methods can do reasonably well on this task. There has been a lot of work on antonymy identification (see the papers in the references, and their related work sections). I will focus on my little experience with antonyms. We've just published a new paper ([6]) in which we analyze the roles of two main information sources used for automatic identification of semantic relations. The task is defined as follows: given a pair of words <i>(x, y)</i>, determine what is the semantic relation that holds between them, if any (e.g. synonymy, hypernymy, antonymy, etc). As in <a href="https://veredshwartz.blogspot.co.il/2016/05/improving-hypernymy-detection.html">this post</a>, we've used information from <i>x</i> and <i>y</i>'s <i>joint</i> occurrences in a large text corpus, as well as information about the <i>separate</i> occurrences of each word <i>x</i> and <i>y</i>. We found that among all the semantic relations we tested, antonymy was almost the hardest to identify (only synonymy was worse).<br />
<br />
The use of information about separate occurrences of <i>x</i> and <i>y</i> is based on the <a href="https://en.wikipedia.org/wiki/Distributional_semantics#Distributional_hypothesis">distributional hypothesis</a>, which I've mentioned several times in this blog. Basically, if we look at the distribution of neighboring words of a word <i>x</i>, it may tell us something about the meaning of <i>x</i>. If we'd like to know what's the relation between <i>x</i> and <i>y</i>, we can compute something on top of the neighbor distributions of each word. For example, we can expect the distributions of <i>x</i> and <i>y</i> to be similar if <i>x</i> and <i>y</i> are antonyms, since one of the properties of antonyms is that they are interchangeable (a word can be replaced with its antonym and the sentence will remain grammatical and meaningful). Think about replacing <i>tall</i> with <i>short</i>, <i>day</i> with <i>night</i>, etc. The problem is that this is similarly true for synonyms - you can expect <i>high</i> and <i>tall</i> to also appear with similar neighboring words. So basing the classification on distributional information may lead to confusing antonyms with synonyms.<br />
<br />
The joint occurrences may help identifying the relation that holds between the words in a pair, as some patterns indicate a certain semantic relation - for instance, "<i>x</i> is a type of <i>y</i>" may indicate that <i>y</i> is a hypernym of <i>x</i>. The problem is that patterns that are indicative of antonymy, such as "either <i>x</i> or <i>y</i>" (either <i>cold</i> or <i>hot</i>) and "<i>x</i> and <i>y</i>" (<i>day</i> and <i>night</i>), may also be indicative of co-hyponymy (either <i>tuna</i> or <i>chicken salad</i>). In any case, this seems far less bad than confusing antonyms with synonyms; in some applications it may suffice to know that <i>x</i> and <i>y</i> are mutually exclusive, with no importance to whether they are antonyms or co-hyponyms. For instance, when you query a search engine, you'd like it to retrieve results including synonyms of your search query (e.g. returning <i>New York City subway map</i> when you search for <i>NYC subway map</i>), but you wouldn't want it to include mutually exclusive words (e.g. <i>Tokyo subway map</i>).<br />
<br />
One last thing to remember is that these automatic methods are trained and tested on data collected from humans. If we can't agree on what's considered antonymy, we can't expect these automatic methods to succeed in this any better than we do.</div>
<hr />
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
<b><span style="font-size: x-small;">References</span></b></div>
<div dir="ltr" style="text-align: left;">
<span style="font-size: x-small;"><br /></span></div>
<div dir="ltr" style="text-align: left;">
<span style="font-size: x-small;">[1] <i>Natural Logic for Textual Inference.</i> Bill MacCartney and Christopher D. Manning. RTE 2007.</span><br />
<span style="font-size: x-small;">[2] <i>Adding Semantics to Data-Driven Paraphrasing.</i> Ellie Pavlick, Johan Bos, Malvina Nissim, Charley Beller, Benjamin Van Durme, and Chris Callison-Burch. ACL 2015.</span></div>
<div dir="ltr" style="text-align: left;">
<span style="font-size: x-small;">[3] <i>Computing Word-Pair Antonymy.</i> Saif Mohammad, Bonnie Dorr and Graeme Hirst. EMNLP 2008.</span><br />
<span style="font-size: x-small;">[4] <i>Computing Lexical Contrast.</i> Saif Mohammad, Bonnie Dorr, Graeme Hirst, and Peter Turney. CL 2013.</span><br />
<span style="font-size: x-small;">[5] <i>Taking Antonymy Mask off in Vector Space.</i> Enrico Santus, Qin Lu, Alessandro Lenci, Chu-Ren Huang. PACLIC 2014.</span><br />
<span style="font-size: x-small;">[6] <i>Path-based vs. Distributional Information in Recognizing Lexical Semantic Relations.</i> Vered Shwartz and Ido Dagan. CogALex 2016.</span></div>
<div dir="ltr" style="text-align: left;">
<br /></div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com4tag:blogger.com,1999:blog-9145120678290195131.post-77923649946837143942016-11-12T19:26:00.001+02:002016-11-23T00:07:48.955+02:00Question Answering<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: left;">
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">In the my <a href="https://veredshwartz.blogspot.co.il/2015/07/natural-language-processing.html">introductory post about NLP</a> I introduced the following </span><span style=" font-family: "roboto" , "arial" , sans-serif;"><span style="font-size: 13px;">survey question: when you search something in Google (or any other search engine of your preference), is your query:</span></span><br />
<span style=" font-family: "roboto" , "arial" , sans-serif;"><span style="font-size: 13px;">(1) a full question, such as "What is the height of Mount Everest?"</span></span><br />
<span style=" font-family: "roboto" , "arial" , sans-serif;"><span style="font-size: 13px;">(2) composed of keywords, such as "height Everest"</span></span><br />
<span style=" font-family: "roboto" , "arial" , sans-serif;"><span style="font-size: 13px;"><br /></span></span>
<span style=" font-family: "roboto" , "arial" , sans-serif;"><span style="font-size: 13px;">I never published the results since, as I suspected, there were too few answers to the survey, and they were probably not representative of the entire population. However, my intuition back then was that only older people are likely to search with a grammatical question, while people with some knowledge in technology would use keywords. Since then, my intuition was somewhat supported by (a) <a href="https://www.theguardian.com/uk-news/2016/jun/16/grandmother-nan-google-praises-search-thank-you-manners-polite">this lovely grandma</a> that added "please" and "thank you" to her search queries, and (b) <a href="https://www.aclweb.org/anthology/N/N16/N16-1081.pdf">this paper from Yahoo Research</a> that showed that search queries with question intent do not form fully syntactic sentences, but are made of segments (e.g. [height] [Mount Everest]). </span></span><br />
<span style=" font-family: "roboto" , "arial" , sans-serif;"><span style="font-size: 13px;"><br /></span></span>
<span style=" font-family: "roboto" , "arial" , sans-serif;"><span style="font-size: 13px;">Having said that, searching the web to get an answer to a question is not quite the same as actually asking the question and getting a precise answer:</span></span><br />
<span style=" font-family: "roboto" , "arial" , sans-serif;"><span style="font-size: 13px;"><br /></span></span>
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><i>Here's the weird thing about search engines. It was like striking oil in a world that hadn't invented internal combustion. Too much raw material. Nobody knew what to do with it. </i></span><span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><br />â <a href="https://en.wikipedia.org/wiki/Ex_Machina_(film)">Ex Machina</a>ï»ż</span><br />
<div>
<br /></div>
<span style=" font-family: "roboto" , "arial" , sans-serif;"><span style="font-size: 13px;">It's not enough to formulate your question in a way that the search engine will have any chance of retrieving relevant results. Now you need to process the returned documents and search for the answer. </span></span><br />
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><br /></span>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhywY7bIOkelHmnj_l7evGKX3_padN2_CIxaXjccn_aAUyI0_fKYzaqyMeMLGr8TTwBei7BXqslMSUovIbYTKap2dPoPaim6Q3KOmz0xbljZALSnbaDklnH76y4Y58N2NN-TrlPpCyUBn0/s1600/lmgtfy.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><span style="color: black;"><img border="0" height="186" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhywY7bIOkelHmnj_l7evGKX3_padN2_CIxaXjccn_aAUyI0_fKYzaqyMeMLGr8TTwBei7BXqslMSUovIbYTKap2dPoPaim6Q3KOmz0xbljZALSnbaDklnH76y4Y58N2NN-TrlPpCyUBn0/s320/lmgtfy.PNG" width="320" /></span></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: xx-small;"><span style="font-family: "roboto" , "arial" , sans-serif; text-align: left;">Getting an answer to a question by querying a search engine is not trivial; I guess this is the reason so many people ask questions in social networks, and some other people insult them with </span><a href="http://lmgtfy.com/" style="font-family: roboto, arial, sans-serif; text-align: left;">Let me Google that for you</a><span style="font-family: "roboto" , "arial" , sans-serif; text-align: left;">. </span></span></td></tr>
</tbody></table>
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><br /></span>
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">The good news is that there are question answering systems, designed to do exactly that: automatically answer a question given as input; the bad news is that like most semantic applications in NLP, it is an extremely difficult task, with limited success. </span><br />
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><br /></span>
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">Question answering systems have been around since the 1960s. Originally, they were developed to support natural language </span><span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">queries to</span><span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"> databases, before web search was available. Later, question answering systems were able to find and extract answers from free text.</span><br />
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><br /></span>
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><span style="font-size: 13px;">A successful example of a question answering system is </span><a href="https://en.wikipedia.org/wiki/Watson_(computer)" style="font-size: 13px;">IBM Watson</a><span style="font-size: 13px;">. Today Watson is <a href="http://www.ibm.com/watson/">described</a> by IBM as "a cognitive technology that can think like a human", and is used in many of IBM's projects, not just for question answering. Originally, it was trained to answer natural logic questions -- or more precisely, to form the correct question to a given answer, as in the </span></span><span style="font-family: "roboto" , "arial" , sans-serif;"><span style="font-size: 13px;">television game show <a href="https://en.wikipedia.org/wiki/Jeopardy!">Jeopardy</a>. On </span></span><span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">February 2011, Watson <a href="https://youtu.be/YgYSv2KSyWg">competed in </a></span><span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><a href="https://youtu.be/YgYSv2KSyWg">Jeopardy</a> against </span><span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">former winners of the show, and won! It had access to millions of web pages, including Wikipedia, which were processed and saved before the game. During the game, it wasn't connected to the internet (so it couldn't use a search engine, for example). The Jeopardy video is pretty cool, but if you have no patience watching it all (I understand you...), here's a highlight:</span><br />
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><br /></span>
<br />
<div style="background-color: white; font-family: arial, helvetica, sans-serif; font-size: 13px; text-align: justify;">
HOST: This trusted friend was the first non-dairy powdered creamer. Watson?</div>
<div style="background-color: white; font-family: arial, helvetica, sans-serif; font-size: 13px; text-align: justify;">
WATSON: What is milk?</div>
<div style="background-color: white; font-family: arial, helvetica, sans-serif; font-size: 13px; text-align: justify;">
HOST: No! That wasnât wrong, that was really wrong, Watson.</div>
<br />
<br />
<span style="font-family: Arial, Helvetica, sans-serif; font-size: 13px;">Another example is the personal assistants: Apple's <a href="http://www.apple.com/ios/siri/">Siri</a>, Amazon's <a href="http://alexa.amazon.com/spa/index.html">Alexa</a>, Microsoft's <a href="https://support.microsoft.com/en-us/help/17214/windows-10-what-is">Cortana</a>, and <a href="https://assistant.google.com/">Google Assistant</a>. They are capable of answering an impressively wide range of questions, but it seems they are often manually designed to <a href="http://www.cheatsheet.com/gear-style/20-questions-to-ask-siri-for-a-hilarious-response.html/?a=viewall">answer specific questions</a>.</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVZoADZpYCW6KOaCxjXw5CYSXtx86QtI5efVmJ5P7y_Ix63FzC6sK3NQf9XH0SHBOI3SmxUZd6fnFUl5xKgClbU6UD7ipDTeQN5xPn-nYRaOLJR8cQq2fAKhi2Rmhkmdq-7tqdevNXwAs/s1600/qa_flow.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><span style="color: black;"><br /></span></a></div>
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">So how does question answering work? I assume that each question answering system employs a somewhat different architecture, and some of the successful ones are proprietary. I'd like to present two approaches. The first is a general architecture for question answering from the web, and the second is question answering from knowledge bases.</span><br />
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><br /></span>
<span style="font-family: "roboto" , "arial" , sans-serif;"><span style="font-size: 13px;"><b><u>Question answering from the web</u></b></span></span><br />
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><br /></span>
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">I'm following a project report I submitted to a course 3 years ago, in which I exemplified this process on the question <i>"When was Mozart born?"</i>. This example was originally taken from some other paper, which is hard to trace now. Apparently, it is a <a href="https://scholar.google.co.il/scholar?hl=en&q=%22when+was+mozart+born%22&btnG=&as_sdt=1%2C5&as_sdtp=">popular example</a> in this field.</span><br />
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><br /></span>
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">The system preforms the following steps:</span><br />
<br />
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj88jfiWz1JSHju-x3ROdDdkKVxKs6thCWF-uuR2HjjkaT7Hx2_iVDsrLccGn3k5jPqT188rlkJsa6NRadWG77YEUxNQS3f-s0VZYz6PE360mdhWcUtEkbb2ETBfkEvUzEJKBJ3TNXLdf4/s1600/qa_flow.png" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><span style="color: black;"><img border="0" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj88jfiWz1JSHju-x3ROdDdkKVxKs6thCWF-uuR2HjjkaT7Hx2_iVDsrLccGn3k5jPqT188rlkJsa6NRadWG77YEUxNQS3f-s0VZYz6PE360mdhWcUtEkbb2ETBfkEvUzEJKBJ3TNXLdf4/s640/qa_flow.png" width="310" /></span></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">A possible architecture for a question answering system. </td></tr>
</tbody></table>
<ul style="text-align: left;">
<li><b style="font-family: roboto, arial, sans-serif; font-size: 13px;">Question analysis</b><b style="font-family: roboto, arial, sans-serif; font-size: 13px;"> - </b><span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">parse the natural language question, and extract some properties:</span></li>
<br />
<ul style="font-family: roboto, arial, sans-serif; font-size: 13px;">
<li><u>Question type</u> - mostly, QA systems support factoid questions (a question whose answer is a fact, as in the given example). Other types of questions, e.g. opinion questions, will be discarded at this point.</li>
<br />
<li><u>Answer type</u> - what is the type of the expected answer, e.g. person, location, date (as in the given example), etc. This can be inferred with simple heuristics using the WH-question word, for example <i>who => person, where => location, when => date. </i></li>
<br />
<li><u>Question subject and object</u> - can be extracted easily by using a <a href="https://veredshwartz.blogspot.co.il/2016/06/linguistic-analysis-of-texts.html">dependency parser</a>. These can be used in the next step of building the query. In this example, the subject is <i>Mozart.</i><i><br /></i></li>
<br />
</ul>
<li><span style="font-family: "roboto" , "arial" , sans-serif;"><span style="font-size: 13px;"><b>Search</b> - prepare the search query, and retrieve documents from the search engine. The query can be an expected answer template (which is obtained by applying some transformation to the question), e.g. <i>"Mozart was born in *"</i>. Alternatively, or in case the answer template retrieves no results, the query can consist of keywords (e.g. <i>Mozart</i>, <i>born</i>).<br /><br />Upon retrieving documents (web pages) that answer the query, the system focuses on certain passages that are more likely to contain the answer ("candidate passages"). These are usually ranked according to the number of query words they contain, their word similarity to the query/question, etc.</span></span></li>
<br />
<li><span style="font-family: "roboto" , "arial" , sans-serif;"><span style="font-size: 13px;"><b>Answer extraction</b> - try to extract candidate answers from the candidate passages. This can be done by using <a href="https://en.wikipedia.org/wiki/Named-entity_recognition">named entity recognition</a> (NER) that identifies in the text mentions of people, locations, organizations, dates, etc. Every mention whose entity type corresponds to the expected answer type is a candidate answer. In the given example, any entity recognized as DATE in each candidate passage will be marked as a candidate answer, including <i>"27 January 1756"</i> (the correct answer) and <i>"5 December 1791"</i> (Mozart's death date).<br /><br />The system may also keep some lists that can be used to answer closed-domain questions, such as <i>"which city [...]"</i> or <i>"which color [...]"</i> that can be answered using a list of cities and a list of colors, respectively. If the system identified that the answer type is <i>color</i>, for example, it will search the candidate passage for items contained in the list of colors. In addition, for "how much" and "how many" questions, regular expressions identifying numbers and measures can be used.</span></span></li>
<br />
<li><span style="font-family: "roboto" , "arial" , sans-serif;"><b style="font-size: 13px;">Ranking</b><span style="font-size: 13px;"> - assign some score for each candidate answer, rank the candidate answers in descending order according to their scores, and return a list of ranked answers. This phase differs between systems. The simple approach would be to represent an answer by some characteristics (e.g. surrounding words) and learn a supervised classifier to rank the answers.</span><br /><br /><span style="font-size: 13px;">An alternative approach is to try to "prove" the answer logically. In the first phase, the system creates an expected answer template. In our example it would be </span><i style="font-size: 13px;">"Mozart was born in *"</i><span style="font-size: 13px;">. By assigning the candidate answer </span><i style="font-size: 13px;">"27 January 1756" </i><span style="font-size: 13px;">to the expected answer template, we get the hypothesis </span><i style="font-size: 13px;">"Mozart was born in <i>27 January 1756</i>"</i><span style="font-size: 13px;">, which we would like to prove from the candidate passage. Suppose that the candidate passage was </span><i style="font-size: 13px;">"[...] Wolfgang Amadeus Mozart was born in Salzburg, Austria, in January 27, 1756. [...]"</i><span style="font-size: 13px;">, a person would know that given the candidate passage, the hypothesis is true, therefore this candidate answer should be ranked high.</span><br /><br /><span style="font-size: 13px;">To do this automatically, Harabagiu and Hick ([1]) used a </span><a href="https://en.wikipedia.org/wiki/Textual_entailment" style="font-size: 13px;">textual entailment</a><span style="font-size: 13px;"> system: the system receives two texts and determines whether if the first text (</span><i style="font-size: 13px;">text</i><span style="font-size: 13px;">) is true, it means that the second one (</span><i style="font-size: 13px;">hypothesis</i><span style="font-size: 13px;">) is also true. Some of these systems return a number, indicating to what extent this is true. This number can be used for ranking answers.<br /><br />While this is a pretty cool idea, the unfortunate truth is that textual entailment systems do not perform better than question answering systems, or very good in general. So reducing the question answering problem to that of recognizing textual entailment doesn't really solve question answering. </span></span></li>
<br />
</ul>
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><b><u>Question answering from knowledge bases</u></b></span><br />
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><b><u><br /></u></b></span>
<span style="font-family: "roboto" , "arial" , sans-serif;"><span style="font-size: 13px;">A knowledge base, such as Freebase/<a href="http://www.wikidata.org/">Wikidata</a> and <a href="http://dbpedia.org/">DBPedia</a>, is a large-scale set of facts about the world in a machine-readable format. Entities are related to each other via relations, creating triplets like <i>(Donald Trump, spouse, Melania Trump)</i> and <i>(idiocracy, instance of, film)</i> (no association between the two facts whatsoever ;)). </span></span><span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">Entities can be people, books and movies, countries, etc. Example relations are </span><i style="font-family: roboto, arial, sans-serif; font-size: 13px;">birth place, spouse, occupation, instance of</i><span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">, etc. While these facts are saved in a format which is easy for a machine to read, I never heard of a human who searches information in knowledge bases. Which is too bad, since it contains an abundance of information.</span><br />
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><br /></span>
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">So some researchers (e.g. [2], following [3]) came up with the great idea of letting people ask a question in natural language (e.g. </span><i style="font-family: roboto, arial, sans-serif; font-size: 13px;">"When was Mozart born?"</i><span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">), parsing the question automatically to relate it to a fact in the knowledge base, and answer accordingly.</span><br />
<span style="font-family: "roboto" , "arial" , sans-serif;"><span style="font-size: 13px;">This reduces the question answering task to understanding the natural language question, whereas querying for the answer from a knowledge base requires no text processing. The task is called </span></span><span style="font-family: "roboto" , "arial" , sans-serif;"><b style="font-size: 13px;">executable semantic parsing</b><span style="font-size: 13px;">. The natural language question is mapped into some logic representation, e.g. </span><a href="https://en.wikipedia.org/wiki/Lambda_calculus" style="font-size: 13px;">Lambda calculus</a><span style="font-size: 13px;">. For example, the example question would be parsed to something like <b>λx.DateOfBirth(Mozart, x)</b>. The logical form is then </span><span style="font-size: 13px;">executed against a knowledge base; for instance, it would search for a fact such as <i>(Mozart, DateOfBirth, x)</i> and return x. </span></span><br />
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><br /></span>
<span style="font-family: "roboto" , "arial" , sans-serif;"><span style="font-size: 13px;">Despite having the answer appear in a structured format rather than in free text, this task is still considered hard, because parsing a natural language utterance into a logical form is difficult.* </span></span></div>
</div>
<div dir="ltr" style="text-align: left;">
<br />
<span style="font-family: Arial, Helvetica, sans-serif; font-size: 13px;">By the way, simply asking Google</span> <i style="font-family: roboto, arial, sans-serif; font-size: 13px;">"When was Mozart born?" </i><span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">seems to take away my argument that "</span><span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">searching the web to get an answer to a question is not quite the same as actually asking the question and getting a precise answer":</span><br />
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><br /></span>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjDCHz7jYvIIaXJkM4M0zfiQt0xuDFzjr5gjDiydIuo81fpDiJxwgpPDvReYvE3ra-hBFyRinZxwvMWnSXqGEvy2_Risg66dn8OgweVEGqFitWHoWA8VaIjW8iIHDkUoUGc5De23y-ppsg/s1600/google.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><span style="color: black;"><img border="0" height="442" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjDCHz7jYvIIaXJkM4M0zfiQt0xuDFzjr5gjDiydIuo81fpDiJxwgpPDvReYvE3ra-hBFyRinZxwvMWnSXqGEvy2_Risg66dn8OgweVEGqFitWHoWA8VaIjW8iIHDkUoUGc5De23y-ppsg/s640/google.PNG" width="640" /></span></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Google understands the question and answers precisely.</td></tr>
</tbody></table>
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><br /></span>
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">Only that it doesn't. Google added this feature to </span><span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">its search engine in 2012,</span><span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"> in which it </span><span style=" font-family: "roboto" , "arial" , sans-serif;"><span style="font-size: 13px;">presents information boxes above the regular search results, for some queries and questions.<span style=" "> </span></span></span><span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;">They parse the natural language query and try to retrieve results from their huge knowledge base, known as </span><span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><a href="https://googleblog.blogspot.co.il/2012/05/introducing-knowledge-graph-things-not.html">Google knowledge graph</a>. Well, I don't know exactly how they do it, but I guess that similarly to the previous paragraph, their main effort is in parsing and understanding the query, which can then be matched against facts in the graph.</span><br />
<span style="font-family: "roboto" , "arial" , sans-serif; font-size: 13px;"><br /></span></div>
<hr dir="ltr" style="text-align: left;" />
<div dir="ltr" style="text-align: left;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: x-small;"><u>References</u>:</span></div>
<div dir="ltr" style="text-align: left;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: x-small;">[1] <i>Methods for Using Textual Entailment in Open-Domain Question Answering.</i> Sanda Harabagiu and Andrew Hick. In ACL and COLING 2006.</span></div>
<div dir="ltr" style="text-align: left;">
<span style="font-family: Arial, Helvetica, sans-serif; font-size: x-small;">[2] <i>Semantic Parsing on Freebase from Question-Answer Pairs.</i> Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. In EMNLP 2013.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif; font-size: x-small;">[3] <i>Learning to parse database queries using inductive logic programming. </i>John M. Zelle and Raymond J. Mooney. In AAAI 1996.</span></div>
<div>
<hr />
<div dir="ltr" style="text-align: left;">
<span style="font-size: x-small;">* If you're interested in more details, I recommend going over the materials from the very interesting ESSLLI 2016 course on <a href="http://esslli2016.unibz.it/?page_id=356">executable semantic parsing</a>, which was given by Jonathan Berant.</span></div>
</div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com2tag:blogger.com,1999:blog-9145120678290195131.post-13922253901072285442016-08-28T12:06:00.000+03:002016-08-28T12:20:08.912+03:00Crowdsourcing (for NLP)<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="ltr" style="text-align: left;">
Developing new methods to solve scientific tasks is cool, but they usually require data. We researchers often find ourselves collecting data rather than trying to solve new problems. I've collected data for most of my papers, but never thought of it as an interesting blog post topic. Recently, I attended <a href="http://esslli2016.unibz.it/?page_id=346">Chris Biemann's excellent crowdsourcing course</a> at <a href="http://esslli2016.unibz.it/">ESSLLI 2016</a> (the 28th European Summer School in Logic, Language and Information), and was inspired to write about the topic. This blog post will be much less technical and much more high-level than the course, as my posts usually are. Nevertheless, credit for many interesting insights on the topic goes to Chris Biemann.<sup><a href="http://veredshwartz.blogspot.co.il/2016/08/crowdsourcing-for-nlp.html#1" name="top1">1</a> </sup>
</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
<b><br />Who needs data anyway?</b><br />
So let's start from the beginning: what is this data and why do we need it? Suppose that I'm working on automatic methods to recognize the semantic relation between words, e.g. I want my model to know that <i>cat</i> is a type of <i>animal</i>, and that <i>wheel </i>is a part of a <i>car</i>.<br />
<br />
At the very basic level, if I already developed such a method, I will want to check <b>how well it does compared to humans</b>. Evaluation of my method requires annotated data, i.e. a set of word pairs and their corresponding true semantic relations, annotated by humans. This will be the "test set"; the human annotations are considered as "gold/true labels". My model will try to predict the semantic relation between each word-pair (without accessing the true labels). Then, I will use some evaluation metric (e.g. precision, recall, F1 or accuracy) to see how well my model predicted the human annotations. For instance, my model would have 80% accuracy if for 80% of the word-pairs it predicted the same relation as the human annotators.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQ4XL95VC9GxN6ktQZNFwVUk1nA06KwrCzyeOyN3RVxEeNUsBWnDqI0nGxARfotuoITkF4xWVRjJbc6ke5cB1qOP_RErwiayiI2dva4l8uRoO_7xhYjRFo4U9HUbLXuEPF-B08BFj38lI/s1600/dataset_example.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="126" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQ4XL95VC9GxN6ktQZNFwVUk1nA06KwrCzyeOyN3RVxEeNUsBWnDqI0nGxARfotuoITkF4xWVRjJbc6ke5cB1qOP_RErwiayiI2dva4l8uRoO_7xhYjRFo4U9HUbLXuEPF-B08BFj38lI/s320/dataset_example.PNG" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 1: an example of dataset entries for recognizing the semantic relation between words.</td></tr>
</tbody></table>
If that was the only data I needed, I would have been lucky. You don't need that many examples to test your method. Therefore, I could select some word-pairs (randomly or using some heuristics), and annotate them myself, or bribe my colleagues with cookies (as I successfully did twice). The problem starts when you need training data, i.e., when you want your model to learn to predict something based on labelled examples. That usually requires many more examples, and annotating data is a very tiring and Sisyphean work.<br />
<br />
What should we do, then? Outsource the annotation process -- i.e., pay with real money, not cookies!<br />
<br />
<b><br />What is crowdsourcing?</b><br />
<br />
The word crowdsourcing is a blend word composed of <b>crowd (intelligence) + (out-)sourcing</b> [1]. The idea is to take a task that can be performed by experts (e.g. translating a document from English to Spanish), and outsource it to a large crowd of non-experts (<i>workers</i>) that can perform it.<br />
<br />
The <i>requester </i>defines the task, and the <i>workers</i> work on it. The requester than decides whether to accept/reject the work and pays the workers (in case of acceptance).<br />
<br />
The benefits of using "regular" people rather than experts are:<br />
<ol style="text-align: left;">
<li>You pay them much less than experts - typically a few cents per question (/<i>task</i>). (e.g., [2] found that in translation tasks, the crowd reached the same quality as the professionals, with less than 12% of the costs).</li>
<li>They are more easily available via crowdsourcing platforms (see below).</li>
<li>By letting multiple people work on the task rather than a single/few experts, the task could be completed in a shorter time. </li>
</ol>
The obvious observation is that the quality of a <i>worker</i> is not as good as the expert; in crowdsourcing, it is <b>not a single worker that replaces the expert, but the</b> <b>crowd</b>. Rather than trusting a single worker, you assign each task to a certain number of workers, and combine their results. A common practice is to use the majority voting. For instance, let's say that I ask 5 workers what is the semantic relation between <i>cat </i>and <i>dog</i>, giving them several options. 3 of them say that <i>cat </i>and <i>dog </i>are mutually exclusive words (e.g. one cannot be both a cat and a dog), one of them says that they are opposites, and one says that <i>cat</i> is a type of <i>dog</i>. The majority has voted in the favor of mutually exclusive, and this is what I will consider as the correct answer.<sup><a href="http://veredshwartz.blogspot.co.il/2016/08/crowdsourcing-for-nlp.html#2" name="top2">2</a></sup><br />
<br />
The main crowdsourcing platforms (out of many others) are <a href="https://www.mturk.com/mturk/welcome">Amazon Mechanical Turk</a> and <a href="https://www.crowdflower.com/">CrowdFlower</a>. In this blog post I will not discuss the technical details of these platforms. If you are interested in a comparison between the two, refer to <a href="http://crowdsourcing-class.org/tutorial_slides/03-crowdsourcing_platforms.pdf">these slides</a> from the NAACL 2015 crowdsourcing tutorial.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiR-c6GFF2XL9lVQ2SHi-hXJFCres_-x3RjC6xwb5UeW0_yvL-6Pv4UzGR7JzLFNO-IK7yQvhrqfVZHIAT1jTbfnFIJZhREHCFT2EQarF8m3mUnIkvhSzbr-7OH0NOJ-a_L5YNoP0p9zlc/s1600/annotation.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="210" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiR-c6GFF2XL9lVQ2SHi-hXJFCres_-x3RjC6xwb5UeW0_yvL-6Pv4UzGR7JzLFNO-IK7yQvhrqfVZHIAT1jTbfnFIJZhREHCFT2EQarF8m3mUnIkvhSzbr-7OH0NOJ-a_L5YNoP0p9zlc/s400/annotation.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 2: An example of a question in Amazon Mechanical Turk, from my project.</td></tr>
</tbody></table>
<b><br />What can be crowdsourced?</b><br />
<br />
Not every data we need to collect can be collected via crowdsourcing; some data may require expert annotation, e.g. if we need to annotate the syntactic trees of sentences in natural language, that's probably a bad idea to ask non-experts to do so.<br />
<br />
The rules of thumb for crowdsourcability are:<br />
<ul style="text-align: left;">
<li>The task is easy to explain, and you as a requester indeed explain it simply. They key idea is to <b>keep it simple</b>. The instructions should be short, i.e. do not expect workers to read a 50 page manual. They don't get paid enough for that. The instructions should include examples.<span style="color: red;"><br /></span></li>
<li>People can easily agree on the "correct" answer, e.g. <i>"is there a cat in this image?"</i> is good, <i>"what is the meaning of life?"</i> is really bad. Everything else is borderline :) One thing to consider is the possible number of correct answers. For instance, if the worker should reply with a sentence (e.g. "describe the following image"), they can do so in so many ways. Always aim one possible answer for a question.</li>
<li>Each question is relatively small.</li>
<li>Bonus: the task is fun. Workers will do better if they enjoy the task. If you can think of a way to <a href="https://en.wikipedia.org/wiki/Gamification">gamify</a> your task, do so!</li>
</ul>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiL_E3XKXzoQbXSNztq1AsoOWsIbIG4NUiolOCOcUtu2zaydGe-Qdw03ZfozS_bYaSaUZu1a_TvrGR5m4XPLtXNNnkEiBV9RUNylUjnkjngU5XjbaFa8XmQCbBK7x67vr7XzvRtszbpJ9Y/s1600/cat.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiL_E3XKXzoQbXSNztq1AsoOWsIbIG4NUiolOCOcUtu2zaydGe-Qdw03ZfozS_bYaSaUZu1a_TvrGR5m4XPLtXNNnkEiBV9RUNylUjnkjngU5XjbaFa8XmQCbBK7x67vr7XzvRtszbpJ9Y/s320/cat.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 3: Is there a cat in this image?</td></tr>
</tbody></table>
<div>
<br />
Some tasks are borderline and may become suitable for crowdsourcing if presented in the right way to the workers. If the task at hand seems too complicated to be crowdsourced, ask yourself: can I break it into smaller tasks that can each be crowdsourced? For example, let workers write a sentence that describes an image, and accept all answers; then let other workers validate the sentences (ask them: does this sentence really describe this image?).</div>
<b><br /></b>
<b>Some examples for (mostly language-related) data collected with crowdsourcing</b><br />
(references omitted, but are available in the course slides in the link above).<br />
<ul style="text-align: left;">
<li>Checking whether a sentence is grammatical or not.</li>
<li>Alignment of dictionary definitions - for instance, if a word has multiple meanings, and hence has multiple definitions in each dictionary - the task was to align the definitions corresponding to the same meaning in different dictionaries.</li>
<li>Translation.</li>
<li>Paraphrase collection - get multiple sentences with the same meaning. These were obtained by asking multiple workers to describe the same short video.</li>
<li><span style="color: red;"><a href="https://en.wikipedia.org/wiki/Duolingo#Crowdsourced_translation">Duolingo started as a crowdsourcing project!</a></span></li>
<li><span style="color: red;"><a href="https://en.wikipedia.org/wiki/ReCAPTCHA#Criticism">And so did reCAPTCHA!</a></span></li>
</ul>
<b>How to control for the quality of data?</b><br />
<span style="color: red;"><br /></span>
OK, so we collected a lot of data. How do we even know if it's good? Can I trust my workers to do well on the task? Could they be as good as experts? And what if they just want my money and are cheating on the task just to get easy money?<br />
<br />
There are many ways to control for the quality of workers:
<br />
<ol style="text-align: left;">
<li>The crowdsourcing platforms provide some information about the workers, such as the number of tasks they completed in the past, their <b>approval rate</b> (% of their tasks that were approved), location, etc. You can define your requirements from the workers based on this information.</li>
<li><b>Don't trust a single worker</b> -- define that your task should be answered by a certain number of workers (typically 5) and aggregate their answers (e.g. by majority voting).</li>
<li>Create <b>control questions</b> - a few questions for which you know the correct answer. These questions are displayed to the worker just like any other questions. If a worker fails to answer too many control questions, the worker is either not good or trying to cheat you. Don't use this worker's answers (and don't let the worker participate in the task anymore; either by rejecting their work or by blocking them).<sup><a href="http://veredshwartz.blogspot.co.il/2016/08/crowdsourcing-for-nlp.html#3" name="top3">3</a></sup></li>
<li>Create a <b>qualification test</b> - a few questions for which you know the correct answer. You can require that any worker who wants to work on your task must take the test and pass it. As opposed to the control questions, the test questions don't have to be identical in format to the task itself, but should predict the worker's ability to perform the task well.</li>
<li><b>Second-pass reviewing</b> - create another task in which workers validate previous workers' answers. </li>
<li><b>Bonus the good workers</b> - they will want to keep working for you.<b><br /></b></li>
<li><b>Watch out for spammers! </b>Some workers are only after your money, and they don't take your task seriously, e.g. they will click on the same answer for all questions. There is no correlation between the number of questions workers answer and their quality, however, it is worth looking at the most productive workers: some of them may be very good (and you might want to give them bonuses), while some of them may be spammers.</li>
</ol>
<b>Ethical issues in crowdsourcing</b><span style="color: red;"></span><br />
<b><br /></b>
As a requester, you need to make sure you treat your workers properly. Always remember that workers are first of all people. When you consider how much to pay or whether to reject a worker's work, think of the following:<br />
<br />
<ul style="text-align: left;">
<li>Many workers rely on crowdsourcing as their main income. </li>
<li>They have no job security.</li>
<li>Rejection in some cases is unfair - even if the worker was bad in the task, they still spent time working (unless you are sure that they are cheating).</li>
<li>New workers do lower-paid work to build up their reputation, but underpaying is not fair and not ethical.</li>
<li>Are you sure you explained the task well? Maybe it is your fault if all the workers performed badly?</li>
</ul>
The good news is that, from my little experience, paying well pays off for the requester too. If you pay enough (but not too much!), you get good workers that want to do the task well. When you underpay, the good workers don't want to work on your task - they can get better paying tasks. The time to complete the task will be longer. And if you are like me, the thought of underpaying your workers will keep you awake at night. So pay well :)<sup><a href="http://veredshwartz.blogspot.co.il/2016/08/crowdsourcing-for-nlp.html#4" name="top4">4</a></sup><br />
<br />
<b>Important take-outs for successful crowdsourcing:</b><br />
<ul style="text-align: left;">
<li>Work in small batches. If you have 10,000 questions, don't publish all at once. Try some, learn from your mistakes, correct them and publish another batch. Mistakes are bound to happen, and they might cost you good money!</li>
<li>Use worker errors to improve instructions (remember: it might be your fault).</li>
<li>KEEP. IT. SIMPLE.</li>
<li>Use quality control mechanisms.</li>
<li>Don't underpay!</li>
<li>Always expect workers to be sloppy. Repeat guidelines and questions and don't expect workers to remember them.</li>
<li>If your questions are automatically generated, use random order and try to balance the number of questions with each expected answer, otherwise workers will exploit this bias (e.g. if most word-pairs are unrelated, they will mark <b>all of them</b> as unrelated without looking twice).</li>
<li>Make workers' lives easier, and they will perform better. For instance, if you have multiple questions regarding the same word, group them together.</li>
<li>If you find a way to make your task more fun, do so!</li>
</ul>
<hr />
<span style="font-family: inherit; font-size: xx-small;">
<span style="color: black; font-size: small;"><b><u>References</u></b></span></span><br />
<span style="color: black; font-family: inherit; font-size: small;"><span style="color: black; font-size: small;">[1] Howe, Jeff. <i>The rise of crowdsourcing.</i> Wired magazine 14.6 (2006).<br />[2] Omar F. </span>Zaidan and Chris Callison-Burch <i>Crowdsourcing translation: professional quality from non-professionals.</i> In ACL 2011.</span><br />
<br /></div>
<hr />
<div style="direction: ltr; text-align: left;">
<span class="Apple-style-span" style="font-size: x-small;">
<span style="color: black;"><a href="https://www.blogger.com/null" name="1" style="font-weight: bold;">1</a><b> </b></span>And I would also like to mention another wonderful <a href="http://naacl.org/naacl-hlt-2015/tutorial-crowdsourcing-for-nlp.html">crowdsourcing tutorial</a> that I attended last year at NAACL 2015, which was given by Chris Callison-Burch, Lyle Ungar, and Ellie Pavlick. Unfortunately, at that time I had no personal experience with crowdsourcing, nor believed that my university will ever have budget for that, therefore made no effort to remember the technical details; I was completely wrong. A year later I published a <a href="https://aclweb.org/anthology/S/S16/S16-2013.pdf">paper</a> about a dataset collected with crowdsourcing, on which I even got a</span><span style="font-size: x-small;"> best paper award</span><span style="font-size: x-small;"> </span><span style="font-size: x-small;"> :) </span><sup style="font-size: small;"><a href="http://veredshwartz.blogspot.co.il/2016/08/crowdsourcing-for-nlp.html#top1" style="font-size: small;">â©</a></sup>
<br />
<span class="Apple-style-span" style="font-size: x-small;"><a href="https://www.blogger.com/null" name="2" style="font-weight: bold;">2</a><b> </b>For more sophisticated aggregation methods that assign weights to workers based on their quality, see <a href="http://www.dirkhovy.com/blog/index.php?id=81">MACE</a>.</span><span style="font-size: x-small;"> </span><sup style="font-size: small;"><a href="http://veredshwartz.blogspot.co.il/2016/08/crowdsourcing-for-nlp.html#top2" style="font-size: small;">â©</a></sup></div>
<div dir="ltr" style="text-align: left;">
<span class="Apple-style-span" style="font-size: x-small;"><a href="https://www.blogger.com/null" name="3" style="font-weight: bold;">3</a><b> </b></span><span style="font-size: x-small;">Blocking a worker means that they can't work on your tasks anymore. Rejecting a worker means that they are not paid for the work they have already done. As far as I know, it is not recommended to reject a worker, because then they write bad things about you in <a href="http://turkernation.com/">Turker Nation</a> and nobody wants to work for you anymore. In addition, you should always give workers the benefit of the doubt; maybe you didn't explain the task well enough.</span><a href="http://veredshwartz.blogspot.co.il/2016/08/crowdsourcing-for-nlp.html#top3" style="font-size: small;"><sup>â©</sup></a></div>
<div dir="ltr" style="text-align: left;">
<span class="Apple-style-span" style="font-size: x-small;"><a href="https://www.blogger.com/null" name="4" style="font-weight: bold;">4</a><b> </b></span><span style="font-size: x-small;">So how much should you pay? First of all, not less than 2 cents. Second, try to estimate how long a single question takes and aim an hourly pay of around 6 USDs. For example, in <a href="https://aclweb.org/anthology/S/S16/S16-2013.pdf">this paper</a> I paid 5 cents per question, which I've been told is the higher bound for such tasks.</span><a href="http://veredshwartz.blogspot.co.il/2016/08/crowdsourcing-for-nlp.html#top4" style="font-size: small;"><sup>â©</sup></a></div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com2tag:blogger.com,1999:blog-9145120678290195131.post-9077860186856558932016-06-20T17:35:00.000+03:002016-06-22T19:39:44.757+03:00Linguistic Analysis of Texts<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="ltr" style="text-align: left;">
Not long ago, Google released their new parser, oddly named <a href="http://googleresearch.blogspot.co.il/2016/05/announcing-syntaxnet-worlds-most.html">Parsey McParseface</a>. For a couple of days, popular media was swamped with announcements about Google solving all AI problems with their new magical software that understands language [e.g. <a href="http://www.cnbc.com/2016/05/13/google-launched-an-ai-tool-that-understands-english-called-parsey-mcparseface.html">1</a>, <a href="http://www.telegraph.co.uk/technology/2016/05/17/has-googles-parsey-mcparseface-just-solved-one-of-the-worlds-big/">2</a>].</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
Well, that's not quite what it does. In this post, I will explain about the different steps applied for analyzing sentence structure. These are usually used as a preprocessing step for higher-level tasks that try understanding the meaning of sentences, e.g. <a href="https://en.wikipedia.org/wiki/Intelligent_personal_assistant">intelligent personal assistants</a> like Siri or Google Now.</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
The following tools are traditionally used one after the other (also known as the "linguistic annotation/processing pipeline"). Generally speaking, the accuracy of available tools for the tasks in this list is in decreasing order. Some low-level tasks are considered practically solved, while others still have room for improvement.</div>
<div dir="ltr" style="text-align: left;">
</div>
<ol dir="ltr" style="text-align: left;">
<li><b>Sentence splitting</b> - as simple as it sounds: receives a text document/paragraph and returns its partition to sentences. While it sounds like a trivial task -- cut the text on every occurrence of a period -- it is a bit trickier than that; sentences can end with an exclamation / question mark, and periods are also used in acronyms and abbreviations in the middle of the sentence. The simple period rule will fail on <a href="http://articles.chicagotribune.com/1995-06-13/features/9506130094_1_thin-diet-drizzle">this text</a>, for example. Still, sentence splitting is practically considered a solved task, using predefined rules and some learning algorithms. See <a href="http://tech.grammarly.com/blog/posts/How-to-Split-Sentences.html">this</a> for more details.<br /><br /></li>
<li><b>Tokenization </b>- a tokenizer receives a sentence and splits it to tokens. Tokens are mostly words, but words that are short forms of negation or auxiliaries are split to two tokens, e.g. <i>I'm => I 'm</i>, <i>aren't => are n't</i>.<br /><br /></li>
<li><b>Stemming</b> / <b>Lemmatization </b>- Words appear in natural language in many forms, for instance, verbs have different tense suffixes (-ing, -ed, -s), nouns have plurality suffixes (s), and adding suffixes to words can sometimes change their grammatical categories, as in <i>nation </i>(noun) => <i>national </i>(adjective) => <i>nationalize </i>(verb).<br />The goal of both stemmers and lemmatizers is to "normalize" words to their common base form, such as <i>"cats" => "cat", "eating" => "eat"</i>. This is useful for many text-processing applications, e.g. if you want to count how many times the word <i>cat</i> appears in the text, you may also want to count the occurrences of <i>cats</i>. <br />The difference between these two tools is that <b>stemming</b> removes the affixes of a word, to get its stem (root), which is not necessarily a word on its own, as in <i>driving </i>=> <i>driv</i>. <b>Lemmatization</b>, on the other hand, analyzes the word morphologically and returns its lemma. A lemma is the form in which a word appears in the dictionary (e.g. singular for nouns as in <i>cats </i>=> <i>cat</i>, infinitive for verbs as in <i>driving </i>=> <i>drive</i>). <br />Using a lemmatizer is always preferred, unless there is no accurate lemmatizer for that language, in that case a stemmer is better than nothing.</li>
<br />
<li><b>Part of speech tagging</b> - receives a sentence, and tags each word (token) with its part of speech (POS): noun, verb, adjective, adverb, preposition, etc. For instance, the following sentence: <i>I'm using a part of speech tagger</i> is tagged in <a href="http://nlp.stanford.edu:8080/parser/index.jsp">Stanford Parser</a> as:<br /><i>I/PRP 'm/VBP using/VBG a/DT part/NN of/IN speech/NN tagger/NN ./. </i>Which means that <i>I</i> is a personal pronoun, <i>'m</i> (am) is a verb, non-3rd person singular present, and if you're interested, here's the <a href="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">list</a> to interpret the rest of the tags.<br />(POS taggers achieve around 97% accuracy).</li>
<br />
<li><b>Syntactic parsing</b> - analyzes the syntactic structure of a sentence, outputting one of two types of parse trees: constituency-based or dependency-based.<br /><br /><b>Constituency</b> - segments the sentence into syntactic phrases: for instance, in the sentence <i>the brown dog ate dog food</i>, [the brown dog] is a noun phrase, [ate dog food] is a verb phrase, and [dog food] is also a noun phrase.<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="tr-caption" style="text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="tr-caption" style="text-align: center;"><div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRDV3dZ4ccf_Nhr5tTW5_sJmXpG9RnZ8jYne4FUffwdl2rdruk6vZbxdLLsM3w3-epKxcxs35RtA-Yd3PzKx4Nv73fTQ_7gSDt-dh8t_lwLu_tUmPOPPvgTs_QoLxVqcMmkSPipp3BqXI/s1600/constituency.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRDV3dZ4ccf_Nhr5tTW5_sJmXpG9RnZ8jYne4FUffwdl2rdruk6vZbxdLLsM3w3-epKxcxs35RtA-Yd3PzKx4Nv73fTQ_7gSDt-dh8t_lwLu_tUmPOPPvgTs_QoLxVqcMmkSPipp3BqXI/s1600/constituency.png" /></a></div>
<span style="font-size: x-small;">An example of constituency parse tree, parsed manually by me and visualized using <a href="http://mshang.ca/syntree/">syntax tree generator</a>.</span></td></tr>
</tbody></table>
</td></tr>
</tbody></table>
<b>Dependency</b> - connects words in the sentence according to their relationship (subject, modifier, object, etc.). For example, in the sentence <i>the brown dog ate dog food</i>, the word <i>brown</i> is a modifier of the word <i>dog</i>, which is the subject of the sentence. I've mentioned dependency trees in the <a href="https://veredshwartz.blogspot.co.il/2016/05/improving-hypernymy-detection.html">previous post</a>: I used them to represent the relation that holds between two words, which is a common use.<br />(Parsey McParseface is a dependency parser. Best dependency parsers achieve around 94% accuracy).</li>
</ol>
<div dir="ltr" style="text-align: left;">
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="tr-caption" style="text-align: center;"><div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEheZoo891B4NBvqsAiD7v3OyRo4b-IVDuL3q-8lK9QwrsHBiY01AJtpx5t0a04qqHvBoV85BdeW3K3qFwYYFh3d-q4zyYWpccy7uKmoGhhD9BR7zkwsEnkjWaG2pg68HDivIK7qyhknEIk/s1600/dependency.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEheZoo891B4NBvqsAiD7v3OyRo4b-IVDuL3q-8lK9QwrsHBiY01AJtpx5t0a04qqHvBoV85BdeW3K3qFwYYFh3d-q4zyYWpccy7uKmoGhhD9BR7zkwsEnkjWaG2pg68HDivIK7qyhknEIk/s1600/dependency.PNG" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
An example of dependency parser output, using Stanford Core NLP.</td></tr>
</tbody></table>
<br />
Other tools, which are less basic, but are often used, include:</div>
<div dir="ltr" style="text-align: left;">
<ul style="text-align: left;">
<li><b>Named entity recognition (NER)</b> - receives a text and marks certain words or multi-word expressions in the text with named entity tags, such as PERSON, LOCATION, ORGANIZATION, etc. <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8FrhFO0Q0Ud7TkdbizGx_np4lGT4_F2Y4YHHhWjz7ShcCvwNl5Z06WVX2AYa8bMDpl84h0AUGacG06dQWx6fuT0IW8fdLulWpYLHvYeGwKUwdb84F09Da2qxIwf0k6Lz-yE-kXC8e_Tc/s1600/ner.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto; text-align: center;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8FrhFO0Q0Ud7TkdbizGx_np4lGT4_F2Y4YHHhWjz7ShcCvwNl5Z06WVX2AYa8bMDpl84h0AUGacG06dQWx6fuT0IW8fdLulWpYLHvYeGwKUwdb84F09Da2qxIwf0k6Lz-yE-kXC8e_Tc/s1600/ner.PNG" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: 12.8px;">An example of NER from Stanford Core NLP.</span></td></tr>
</tbody></table>
</li>
<br /><br />
<li><b>Coreference resolution</b> - receives a text and connects words that refer to the same entity (called "mentions"). This includes, but not limited to:<br />pronouns (he, she, I, they, etc.) - <i>I just read <span style="color: red;">Lullaby</span>. <span style="color: red;">It</span> is a great book.</i><br />different names / abbreviations - e.g., the beginning of the text mentions <span style="color: red;">Barack Obama</span>, which is later referred to as <span style="color: red;">Obama</span>.<br />semantic relatedness - e.g. the beginning of the text mentions <span style="color: red;">Apple </span>which is later referred to as <span style="color: red;">the </span><span style="color: red;">company</span>.<br /><br />This is actually a tougher task than the previous ones, and accordingly, it achieves less accurate results. In particular, sometimes it is difficult to determine which entity a certain mention refers to (while it's easy for a human to tell): e.g. <i>I told John I don't want Bob to join dinner, because I don't like him.</i> Who does <i>him </i>refer to?<br />Another thing is that it is very sensitive to context, e.g. in one context <span style="color: red;">apple </span>can be co-referent with <span style="color: red;">the company</span>, while in another, that discusses the fruit, it is not true.<br /><br /></li>
<li><b>Word sense disambiguation (WSD)</b> - receives a text and decides on the correct sense of each word in the given context. For instance, if we return to the <i>apple </i>example, in the sentence <i>Apple released the new iPhone</i>, the correct sense of apple is the company, while in <i>I ate an apple after lunch</i> the correct sense is the fruit. Most WSD systems use <a href="http://wordnetweb.princeton.edu/perl/webwn">WordNet</a> for the sense inventory.<br /><br /></li>
<li><b>Entity linking</b> - and in particular, <b>Wikification</b>: receives a text and links entities in the text to the corresponding Wikipedia articles. For instance, in the sentence <i>1984 is the best book I've ever read, </i>the word 1984 should be linked to <a href="https://en.wikipedia.org/wiki/Nineteen_Eighty-Four">https://en.wikipedia.org/wiki/Nineteen_Eighty-Four</a> (rather than to the <a href="https://en.wikipedia.org/wiki/1984_(disambiguation)">articles</a> discussing the films / TV shows).<br />Entity linking can complement word sense disambiguation, since most proper names (as <i>Apple </i>or <i>1984</i>) are not present in WordNet.<br /><br /></li>
<li><b>Semantic role labeling (SRL)</b> - receives a sentence and detects the predicates and arguments in the sentence. A predicate is usually a verb, and each verb may have several arguments, such as agent / subject (the person who does the action), theme (the person or thing that undergoes the action), instrument (what was used for doing the action), etc. For instance, in the sentence <i>John baked a cake for Mary</i>, the predicate is bake, and the arguments are agent:John, theme:cake, and goal:Mary. This is not just the final task in my list: it is the task which is the closest to understanding the semantics of a sentence.</li>
</ul>
</div>
<div dir="ltr" style="text-align: left;">
<br />
Here is an example for (a partial) analysis of the sentence: <i>The brown dog ate dog food, and now he is going to sleep</i><i>, </i>using <a href="http://nlp.stanford.edu:8080/corenlp/process">Stanford Core NLP</a>:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGj4mR9ec6zbqw0EugTaBiyeUczVTTC3ThPgZ9ZJgQqPXEzfKl4Fb4PoFPHnuN4EGhNJs2HxNP2FAiD_vYojHJPznB2qGwvQhWZLf6qGJYn5_v8tXTywjoZPGUuswLQP4BD21xEIlqKvs/s1600/all.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="215" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjGj4mR9ec6zbqw0EugTaBiyeUczVTTC3ThPgZ9ZJgQqPXEzfKl4Fb4PoFPHnuN4EGhNJs2HxNP2FAiD_vYojHJPznB2qGwvQhWZLf6qGJYn5_v8tXTywjoZPGUuswLQP4BD21xEIlqKvs/s400/all.PNG" width="400" /></a></div>
</td></tr>
<tr><td class="tr-caption" style="text-align: center;">Analysis of <i>The brown dog ate dog food</i>, <i>and now he is going to sleep</i>, using Stanford Core NLP.</td></tr>
</tbody></table>
All this effort, and we are not even yet talking about deep understanding of the sentence meaning, but rather analyzing the sentence structure, perhaps as a step toward understanding its meaning. As my previous posts show, it's hard as it is to understand a single word's meaning. In one of my next posts I will describe methods that deal with the semantics of a sentence.</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
By the way, if you are potential users of these tools, and you are looking for a parser, Google's parser is not the only one available. <a href="https://github.com/elikip/bist-parser">BIST</a> is more accurate and faster than Parsey McParseface, and <a href="http://spacy.io/">spaCy</a> is slightly less accurate, but much faster than both. </div>
<div dir="ltr" style="text-align: left;">
</div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com4tag:blogger.com,1999:blog-9145120678290195131.post-86789151488467176562016-05-25T18:24:00.002+03:002016-06-22T19:38:52.564+03:00Improving Hypernymy Detection<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="ltr" style="text-align: left;">
From time to time I actually make some progress with my own research, that I think might be interesting or beneficial for others. Now is such a time.* so let me share that with you.</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
If you've read my blog post about <a href="http://veredshwartz.blogspot.co.il/2015/07/lexical-inference.html">lexical inference</a>, then you should already be familiar with my research goals. I'll explain it shortly: I'm working on automated methods to recognize that a certain term's meaning (word or multi-word expression) can be inferred from another's. </div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
There are several interesting lexical-semantic relations, such as synonymy/equivalence (<i>intelligent/smart, elevator/lift</i>), hypernymy/subclass (<i>parrot/bird, stare/look</i>), meronymy/part-of (<i>spoon/cutlery, London/England</i>), antonymy/opposite (<i>short/tall, boy/girl</i>), and causality (<i>flu/fever, rain/flood</i>). </div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
These relations are interesting, because whenever we encounter one of the terms in a given sentence, we can use our knowledge to infer new facts. For instance, the sentence "<i>I live in London</i>" (wishful thinking...), could be used to infer "<i>I live in England</i>", knowing that <i>London </i>is a part of <i>England</i>. Of course we also need to know something about the sentence itself, because saying that "<i>I left London</i>" doesn't necessarily entail that "<i>I left England</i>". I might have just taken the train to Reading for the <a href="http://ukandireland-aug14.blogspot.co.il/2014/09/23082014.html">Festival</a> :) But this is another line of research which I haven't explored deeply yet, so we'll leave that for another post.</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
In this particular work, we've focused on the common hypernymy relation between nouns (and noun phrases). We developed a method that given a pair of nouns <i>(x, y)</i> (e.g. <i>(cat, animal)</i>, <i>(abbey road, record), (apple, animal)</i>) predicts whether <i>y </i>is a hypernym of <i>x</i> - or in other words, whether <i>x</i> is a subclass of <i>y</i> (e.g. <i>cats </i>are a subclass of <i>animals</i>) or an instance of <i>y </i>(e.g. <i>abbey road</i> is an instance of <i>record</i>). I'll try to keep things simple. If you're interested in more details or the references to other papers, please refer to the <a href="http://arxiv.org/pdf/1603.06076v1.pdf">paper</a>.</div>
<div dir="ltr" style="text-align: left;">
<br />
There are two main approaches in the literature of hypernymy detection: path-based and distributional. Path-based methods are very elegant (a matter of taste, I guess). They assume that if <i>y</i> is a hypernym of <i>x</i>, then this relation must be expressed frequently enough when looking at a large text corpus. A pair of words that tends to be connected through patterns such as "X and other Y", "Y such as X", "X is a Y" is likely to hold the hypernymy relation (e.g. <i>cats and other animals, fruit such as apples</i>). To overcome some adjectives and relative clauses that stand in the way of the important information (as in<i> Abbey Road is [the 11th studio] album</i>), a <a href="https://en.wikipedia.org/wiki/Dependency_grammar">dependency parser</a> is used, outputting the syntactic relation between words in the sentence (e.g. <i>Abbey Road</i> is connected to <i>is</i> and <i>is </i>is connected to <i>album</i>). See the figure below for an example of such a path.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFAr5cHdNv5kPVEGA4RdOfFmO3zmCVO9TG30vt4osfshRIN5r6BG6KZs9Y2ZjM-MMTa0uLsD-mm2xv6rmQrYrTcaTgtB-dN1kRZX208XWflUC3QcNqgMyqyHfuWlD5tza92M6p9mXiwQE/s1600/x_is_y.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFAr5cHdNv5kPVEGA4RdOfFmO3zmCVO9TG30vt4osfshRIN5r6BG6KZs9Y2ZjM-MMTa0uLsD-mm2xv6rmQrYrTcaTgtB-dN1kRZX208XWflUC3QcNqgMyqyHfuWlD5tza92M6p9mXiwQE/s1600/x_is_y.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="tr-caption" style="direction: ltr; font-size: 12.8px;">An example of a dependency path between <i>parrot </i>and <i>bird</i></td></tr>
</tbody></table>
</td></tr>
</tbody></table>
These paths serve as features for classification. These methods use training data - a list of <i>(x, y) </i>pairs with their corresponding label (e.g. <i>(cat, animal, True)</i>, <i>(apple, animal, False)</i>). Each <i>(x, y)</i> pair is represented by all the dependency paths that connected them in the corpus: the feature vector holds an entry for each dependency path in the corpus, and the value is the frequency in which this path connected <i>x </i>and <i>y</i> (e.g. for <i>(cat, animal)</i> how many times "<i>cat is an animal</i>" occur in the corpus, how many times "<i>animals such as cats</i>" occur, etc.). A classifier is trained over these vectors to predict whether <i>y </i>is a hypernym of <i>x</i>.<br />
<br />
Though this method works nicely, it suffers from one major limitation: it cannot generalize. If <i>x1</i> and <i>y1 </i>are mainly connected by the "X is considered as Y" pattern, and <i>x2</i> and <i>y2 </i>are connected via "X is regarded as Y", they practically share no information. These are considered two different paths. Attempts to generalize such paths by replacing words along the paths with wild-cards (e.g. "X is * as Y") or part-of-speech tags (e.g. "X is VERB as Y") may end up in paths too-general (e.g. "X is denied as Y", which also generalizes to "X is VERB as Y", is a negative path).<br />
<br />
In contrast, the distributional approach considers the separate occurrences of <i>x</i> and <i>y</i> in the corpus. It relies on the distributional hypothesis, that I've already mentioned <a href="http://veredshwartz.blogspot.co.il/2015/07/lexical-inference.html">here</a> and <a href="http://veredshwartz.blogspot.co.il/2016/01/representing-words.html">here</a>. The main outcome of this hypothesis is that words can be represented using vectors, with similar words (in meaning) sharing similar vectors. In recent years, people have been using these vectors in supervised hypernymy detection. To represent each <i>(x, y) </i>pair as a vector, they somehow combined <i>x </i>and <i>y</i>'s vectors (e.g. by concatenating them). They trained a classifier on top of these vectors and predicted for new pairs (e.g. <i>(apple, fruit)</i>) whether <i>y</i> is a hypernym of <i>x</i>. These methods have shown good performance, but it was later found that they tend to overfit to the training data, and are pretty bad in generalizing; for example, if you are trying to predict hypernymy for a new pair <i>(x, y) </i>with rare words <i>x</i> and <i>y</i> that weren't observed in the training data, the prediction will be (only slightly better than) a guess.</div>
<div dir="ltr" style="text-align: left;">
<br />
To sum up recent work - path-based methods can leverage information about the relation between a pair of words, but they do not generalize well. On the other hand, distributional methods might not recognize the relation between a pair of words, but they contain useful information about each of the words. Since these two approaches are complementary, we combined them!</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
We started by improving path representation, using a <a href="https://en.wikipedia.org/wiki/Recurrent_neural_network">recurrent neural network</a>. I can't possibly explain the technical details without writing a long background post about neural networks first, so I'll skip most of the technical details. I'll just say that this is a machine learning model that processes sequences (e.g. sequence of words, letters, edges, etc), that can, among other things, output a vector representing the entire sequence. In our case, we split the dependency path to edges, and let the network learn a vector representation of the entire path, by going over the edges sequentially. Then, we replace the traditional path features that represent a term-pair, e.g. "X is defined as Y", with the vector representing the path - the path embbeding.<br />
<br />
The nice thing about these path embbedings is that -- can you guess? <b>similar paths have similar path embeddings</b>. This happens thanks to two things. First, the network can learn that certain edges are important for detecting hypernymy, while others are not, which may lead to consolidating paths that differ by certain unimportant edges.<br />
<br />
Moreover, since neural networks can only work on vectors, the entire information we use is encoded as vectors. For instance, the words along the path (e.g. <i>is, defined, as</i>) are all encoded as vectors. We use word embeddings, so similar words have similar vectors. This results in similar vectors for paths that differ by a a pair of similar words, e.g. "X is defined as Y" and "X is described as Y".<br />
<br />
Similar paths having similar vectors is helpful for the classifier. In the paper, we show that our method performed better than the prior methods. Just to give an intuition, let's say for instance that the classifier learned that the path "X company is a Y", which was pretty common in the corpus, indicates hypernymy. And let's say that "X ltd is a Y" only occurred once for a positive <i>(x, y)</i> pair. The previous methods would probably decide that such a path is not indicative of hypernymy, since they don't have enough evidence about it. However, our method recognizes that <i>ltd</i> and <i>company</i> are similar words, yielding similar path vectors for these two paths. If "X company is a Y" is considered indicative, then so does "X ltd is a Y".<br />
<br />
At last, we combined the complementary path-based and distributional approaches. To add distributional information to our model (the information on the separate occurrences of each term <i>x </i>and <i>y</i>), we simply added the word embedding vectors of <i>x</i> and <i>y </i>to the model, allowing it to rely on this information as well. With this simple change we achieve significant improvement in performance compared to prior methods in each approach. For so many years people have been saying these two approaches are complementary, and turns out it was actually not too difficult to combine them :)<br />
<br />
<br /></div>
<div dir="ltr" style="text-align: left;">
<b>Paper details</b><br />
<i>Improving Hypernymy Detection with an Integrated Path-based and Distributional Method</i>. Vered Shwartz, Yoav Goldberg and Ido Dagan. ACL 2016. <a href="http://arxiv.org/pdf/1603.06076v1.pdf">link</a><br />
<div>
<br /></div>
</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
<span style="font-size: x-small;">* Now = a few months ago, but the paper was under review and I couldn't publish anything about it.</span></div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com0tag:blogger.com,1999:blog-9145120678290195131.post-55123779002864080882016-03-08T23:49:00.000+02:002016-03-26T20:28:48.565+03:00Text Classification<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="rtl" style="text-align: right;" trbidi="on">
<div style="direction: ltr; text-align: left;">
</div>
<div style="text-align: left;">
<div dir="ltr" style="text-align: left;">
Given a piece of text (document), can software recognize the topic(s) it discusses? If you're not convinced that such a thing could be helpful, let's just start with two facts:</div>
</div>
<ul dir="ltr" style="text-align: left;">
<li>90% of the data in the world today has been created in the last two years [<a href="http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html">1</a>].</li>
<li>Our attention span is now less than that of a goldfish [<a href="http://www.telegraph.co.uk/news/science/science-news/11607315/Humans-have-shorter-attention-span-than-goldfish-thanks-to-smartphones.html">2</a>], and we almost never read through an article [<a href="http://www.slate.com/articles/technology/technology/2013/06/how_people_read_online_why_you_won_t_finish_this_article.html">3</a>].</li>
</ul>
<div dir="ltr" style="text-align: left;">
These two together lead to sooo much data that might be of interest to you, that you'll never read. If only there were someone who could read articles for you and decide whether they intersect with your topics of interest. Well, automatic text classification can assist you with that.</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
</div>
<div dir="ltr" style="text-align: left;">
<u style="text-align: left;">Representation</u></div>
<div dir="ltr" style="text-align: left;">
<span style="text-align: left;">As in the </span><a href="http://veredshwartz.blogspot.co.il/2016/01/representing-words.html" style="text-align: left;">last post about word representation</a><span style="text-align: left;">, we must first decide how to represent the documents. Intuitively, we want the algorithm to classify a document to a certain topic based on the document's content, as in figure 1. We need the document representation to reflect that.</span><br />
<br style="text-align: left;" />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghEPxIb0IjtzR7kj05h4mpUEhyp6eC-Nlsbxgl3RIMz9pmMcXLEC13_fuIccItBWkyz90seEHXppM2d0a_c9nCWd-QsymjPRL6uiP1TC-aAo08cDqsaLjWJvUThyGsm4Z9AWcRbnJwlF0/s1600/documents.JPG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghEPxIb0IjtzR7kj05h4mpUEhyp6eC-Nlsbxgl3RIMz9pmMcXLEC13_fuIccItBWkyz90seEHXppM2d0a_c9nCWd-QsymjPRL6uiP1TC-aAo08cDqsaLjWJvUThyGsm4Z9AWcRbnJwlF0/s1600/documents.JPG" /></a></td></tr>
<tr><td class="tr-caption" style="direction: ltr; font-size: 12.8px;"><span style="font-family: "cambria math" , serif; font-size: 10pt;">Figure 1: Two example documents, one (Doc1) about computer science and the other (Doc2) about news. </span><br />
<div>
<span style="font-family: "cambria math" , serif; font-size: 10pt;"><br /></span></div>
</td></tr>
</tbody></table>
<span style="text-align: left;">The simplest and most common approach is the "bag-of-words" approach, in which each document is represented by all its words. Some words may be indicative of one topic or another, e.g. </span><i style="text-align: left;">soccer, player, </i><span style="text-align: left;">and </span><i style="text-align: left;">match</i><span style="text-align: left;"> might indicate that a document is about sports, while </span><i style="text-align: left;">government</i><span style="text-align: left;">, </span><i style="text-align: left;">prime minister,</i><span style="text-align: left;"> and </span><i style="text-align: left;">war</i><span style="text-align: left;"> suggest that this is a news article. If you remember the spam filtering example from the </span><a href="http://veredshwartz.blogspot.co.il/2015/08/supervised-learning.html" style="text-align: left;">supervised learning post</a><span style="text-align: left;">, this is exactly the same: some words in the mail message may indicate that it is spam. Spam filtering can be regarded as a text classification task in which each document is a mail message, classified to one of the two topics <i>spam</i> / <i>not spam</i>.</span></div>
<div dir="ltr" style="text-align: left;">
<br />
The bag-of-words approach is simple, yet may yield nice results. However, it ignores multi-word expressions (<i>document classification</i>) some of which are non-compositional, i.e. the phrase's meaning is different from the separate word meanings (<i>rock and roll</i>). It also ignores word order, and syntax. Some other methods can consider these features as well. In this post we will stick to the simple bag-of-words approach, which is often good enough for this task.<br />
<br />
<u>Methods</u><br />
Choosing the method in which documents will be classified to topics depends, first of all, on the available data. If we have a sample of labeled documents (documents with known topics), we will prefer <a href="http://veredshwartz.blogspot.co.il/2015/08/supervised-learning.html">supervised classification</a>. In supervised learning, the algorithm is given a training set of labeled instances (e.g. document and its topic), and it tries to learn a model that predicts those labels correctly. This model is later used to predict the label (topic) of unseen instances (documents).<br />
<br />
Otherwise, if we only have a bunch of documents without their topics (which is the more common case), we will apply unsupervised classification (<a href="https://en.wikipedia.org/wiki/Cluster_analysis">clustering</a>). Instead of trying to attach a label (topic) to each instance (document), the algorithm groups together similar documents, which seem to be about the same topic. The output of the clustering process is clusters of documents, each cluster represents a topic.<br />
<br />
<u><b>Supervised Document Classification</b></u><br />
Different methods to classify documents differ in the instance representation and the learning algorithms. The general scheme is to represent each document as a feature vector, use a multi-class classifier and feed it the training instances and labels (documents and their topics) to learn a model. Then, given a new unlabeled document, the classifier can predict the document's topic, with some level of success.<br />
<br />
Following the bag-of-words approach, we may represent each document as a |V|-dimensional vector, where V is our vocabulary. Each cell represents a word which may or may not appear in the document. There are a few variants of the cells content, here are some common ones:<br />
<div dir="ltr" style="text-align: left;">
</div>
<ul dir="ltr" style="text-align: left;">
<li><b>Binary </b>- 1 if the word occurred in the document, 0 otherwise. It is a set representation of the document's words.</li>
<li><b>Frequency </b>- the number of times that a word occurred in the document. We can expect that topic prominent words would appear frequently in the topic documents, while other words would occur occasionally. </li>
<li><b>TF-IDF</b> - I might be skipping a few simpler metrics, but this one is very useful. When we count the frequency of each word, we might end up with some words that are frequent in all documents, regardless of the topic. For instance, <a href="https://en.wikipedia.org/wiki/Stop_words">stop words</a> and function words (<i>the, it, and, what...</i>) are never indicative of a certain topic. The TF-IDF metric handles this problem by measuring the importance of a term in a certain document, given all the other documents. The TF (Term-Frequency) measure is proportional to the word frequency in the document. The IDF (Inverse-Document-Frequency) decreases if the word is generally frequent in all documents. This way, a word gets a high score if it is relatively non-common in general, but common in the specific document.</li>
</ul>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9B0jGzmF4EfBuvnvB3Iyi9F_EFSEBi5pmKr2lcyP2DObtNbzMLCVuvW0ky5KfVflSVp76VMZ0BjdXvSH7LaBt6A-4LCmvWdb-VGcP2NAVHIW9JoZ62GLw-3ZmqxvksGRdA-64llTrLEk/s1600/bag-of-words.JPG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="75" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9B0jGzmF4EfBuvnvB3Iyi9F_EFSEBi5pmKr2lcyP2DObtNbzMLCVuvW0ky5KfVflSVp76VMZ0BjdXvSH7LaBt6A-4LCmvWdb-VGcP2NAVHIW9JoZ62GLw-3ZmqxvksGRdA-64llTrLEk/s400/bag-of-words.JPG" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="direction: ltr; font-size: 12.8px;">Figure 2: <span style="font-family: "cambria math" , serif; font-size: 10pt;">The corresponding (partial) bag-of-words vectors for the example documents, with the binary (top) and frequency (bottom) variants.</span></td></tr>
</tbody></table>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
It is no coincidence this post is following the one about word representations: speaking in word representation terminology, we can now see that the frequency document vector is the sum of one-hot vectors of all the words in the document. Can you see it?<br />
Correspondingly, when working with word embeddings (continuous dense vectors), there is a variant of bag-of-words called CBOW (continuous bag-of-words): it is the sum/average of word vectors for each word in the document. This results in a D-dimensional vector that represents the entire document, where D is the word vectors dimension.<br />
<br />
Once we represented each document as a feature vector, we let the classifier train over the feature vectors and corresponding labels. It may notice that feature vectors with high values in the cells of <i style="text-align: left;">soccer, player, </i><span style="text-align: left;">and </span><i style="text-align: left;">match</i><span style="text-align: left;"> tend to occur with the label <i>sports</i>, for example. </span><br />
<span style="text-align: left;"><br /></span>
<span style="text-align: left;">There is a broad choice of algorithms, some may perform better than others. Roughly, they all try to do the same thing: t</span><span style="text-align: left;">he multi-class classifier learns a weight for each feature and each label. In our text classification task, it learns the salience of each word in the vocabulary for each topic. For example, </span><i style="text-align: left;">soccer, player, </i><span style="text-align: left;">and </span><i style="text-align: left;">match</i><span style="text-align: left;"> will be assigned high weights and </span><i style="text-align: left;">government </i><span style="text-align: left;">and</span><span style="text-align: left;"> </span><i style="text-align: left;">prime minister </i><span style="text-align: left;">will be assigned low (maybe negative) weights for the <i>sports</i> label, and the opposite will occur for the <i>news</i> label. When the classifier is trained, the objective is to learn the weights that maximize the accuracy of the training set, i.e., classifying as many documents as possible to the correct topic.</span><br />
<br />
For you coders, here is the simplest proof of concept. Other people may skip the code. This is a Python script, that works with <a href="http://scikit-learn.org/stable/">scikit-learn</a>, a machine learning package for Python. I used a subset of the topics in the <a href="http://scikit-learn.org/stable/datasets/twenty_newsgroups.html">20 newsgroup dataset</a>. I trained a simple logistic regression classifier, removing stop words and punctuation, and representing each document using CBOW with <a href="http://nlp.stanford.edu/projects/glove/">GloVe word embeddings</a> (my personal favorite). This yields 83% F1 score, which has much potential of improvement, but it's still nice with such little effort.<br />
<br />
<br />
<html>
<head>
<title>document_classification.py</title>
<style type="text/css">
.ln { color: rgb(0,0,0); font-weight: normal; font-style: normal; }
.s0 { color: rgb(0,0,128); font-weight: bold; }
.s1 { }
.s2 { color: rgb(128,128,128); font-style: italic; }
.s3 { color: rgb(0,128,0); font-weight: bold; }
.s4 { color: rgb(0,0,255); }
</style>
</head>
<body bgcolor="#ffffff">
<table bgcolor="#C0C0C0" cellpadding="5" cellspacing="0" cols="1" style="width: 100%px;">
<tr><td><center>
<span style="color: black; font-family: Arial, Helvetica;">
document_classification.py</span>
</center>
</td></tr>
</table>
<pre>
<a href="https://www.blogger.com/null" name="l1"><span class="ln">1 </span></a><span class="s0">import </span><span class="s1">sys
<a href="https://www.blogger.com/null" name="l2"><span class="ln">2 </span></a></span><span class="s0">import </span><span class="s1">nltk
<a href="https://www.blogger.com/null" name="l3"><span class="ln">3 </span></a></span><span class="s0">import </span><span class="s1">string
<a href="https://www.blogger.com/null" name="l4"><span class="ln">4 </span></a></span><span class="s0">import </span><span class="s1">codecs
<a href="https://www.blogger.com/null" name="l5"><span class="ln">5 </span></a>
<a href="https://www.blogger.com/null" name="l6"><span class="ln">6 </span></a></span><span class="s0">from </span><span class="s1">sklearn.datasets </span><span class="s0">import </span><span class="s1">fetch_20newsgroups
<a href="https://www.blogger.com/null" name="l7"><span class="ln">7 </span></a></span><span class="s0">from </span><span class="s1">sklearn.metrics </span><span class="s0">import </span><span class="s1">precision_recall_fscore_support
<a href="https://www.blogger.com/null" name="l8"><span class="ln">8 </span></a></span><span class="s0">from </span><span class="s1">sklearn.linear_model </span><span class="s0">import </span><span class="s1">LogisticRegression
<a href="https://www.blogger.com/null" name="l9"><span class="ln">9 </span></a>
<a href="https://www.blogger.com/null" name="l10"><span class="ln">10 </span></a></span><span class="s0">import </span><span class="s1">numpy </span><span class="s0">as </span><span class="s1">np
<a href="https://www.blogger.com/null" name="l11"><span class="ln">11 </span></a>
<a href="https://www.blogger.com/null" name="l12"><span class="ln">12 </span></a>
<a href="https://www.blogger.com/null" name="l13"><span class="ln">13 </span></a></span><span class="s0">def </span><span class="s1">main():
<a href="https://www.blogger.com/null" name="l14"><span class="ln">14 </span></a>
<a href="https://www.blogger.com/null" name="l15"><span class="ln">15 </span></a> </span><span class="s2"># Load the word vectors</span><span class="s1">
<a href="https://www.blogger.com/null" name="l16"><span class="ln">16 </span></a> words, wv = load_embeddings(</span><span class="s3">'glove.6B.50d.txt'</span><span class="s1">)
<a href="https://www.blogger.com/null" name="l17"><span class="ln">17 </span></a> word_to_num = { word : i </span><span class="s0">for </span><span class="s1">i, word </span><span class="s0">in </span><span class="s1">enumerate(words) }
<a href="https://www.blogger.com/null" name="l18"><span class="ln">18 </span></a>
<a href="https://www.blogger.com/null" name="l19"><span class="ln">19 </span></a> </span><span class="s2"># Load the stop words</span><span class="s1">
<a href="https://www.blogger.com/null" name="l20"><span class="ln">20 </span></a> </span><span class="s0">with </span><span class="s1">codecs.open(</span><span class="s3">'English_stop_words.txt'</span><span class="s1">, </span><span class="s3">'r'</span><span class="s1">, </span><span class="s3">'utf-8'</span><span class="s1">) </span><span class="s0">as </span><span class="s1">f_in:
<a href="https://www.blogger.com/null" name="l21"><span class="ln">21 </span></a> stop_words = set([line.strip() </span><span class="s0">for </span><span class="s1">line </span><span class="s0">in </span><span class="s1">f_in])
<a href="https://www.blogger.com/null" name="l22"><span class="ln">22 </span></a>
<a href="https://www.blogger.com/null" name="l23"><span class="ln">23 </span></a> </span><span class="s2"># Load the datasets</span><span class="s1">
<a href="https://www.blogger.com/null" name="l24"><span class="ln">24 </span></a> topics = [</span><span class="s3">'talk.politics.guns'</span><span class="s1">, </span><span class="s3">'soc.religion.christian'</span><span class="s1">,
<a href="https://www.blogger.com/null" name="l25"><span class="ln">25 </span></a> </span><span class="s3">'comp.windows.x'</span><span class="s1">, </span><span class="s3">'rec.sport.baseball'</span><span class="s1">, </span><span class="s3">'sci.med'</span><span class="s1">]
<a href="https://www.blogger.com/null" name="l26"><span class="ln">26 </span></a> newsgroups_train = fetch_20newsgroups(subset=</span><span class="s3">'train'</span><span class="s1">,
<a href="https://www.blogger.com/null" name="l27"><span class="ln">27 </span></a> remove=(</span><span class="s3">'headers'</span><span class="s1">,
<a href="https://www.blogger.com/null" name="l28"><span class="ln">28 </span></a> </span><span class="s3">'footers'</span><span class="s1">,
<a href="https://www.blogger.com/null" name="l29"><span class="ln">29 </span></a> </span><span class="s3">'quotes'</span><span class="s1">),
<a href="https://www.blogger.com/null" name="l30"><span class="ln">30 </span></a> categories=topics)
<a href="https://www.blogger.com/null" name="l31"><span class="ln">31 </span></a> y_train = list(newsgroups_train.target)
<a href="https://www.blogger.com/null" name="l32"><span class="ln">32 </span></a> newsgroups_test = fetch_20newsgroups(subset=</span><span class="s3">'test'</span><span class="s1">,
<a href="https://www.blogger.com/null" name="l33"><span class="ln">33 </span></a> remove=(</span><span class="s3">'headers'</span><span class="s1">,
<a href="https://www.blogger.com/null" name="l34"><span class="ln">34 </span></a> </span><span class="s3">'footers'</span><span class="s1">,
<a href="https://www.blogger.com/null" name="l35"><span class="ln">35 </span></a> </span><span class="s3">'quotes'</span><span class="s1">),
<a href="https://www.blogger.com/null" name="l36"><span class="ln">36 </span></a> categories=topics)
<a href="https://www.blogger.com/null" name="l37"><span class="ln">37 </span></a> y_test = list(newsgroups_test.target)
<a href="https://www.blogger.com/null" name="l38"><span class="ln">38 </span></a>
<a href="https://www.blogger.com/null" name="l39"><span class="ln">39 </span></a> </span><span class="s2"># Create the feature vectors</span><span class="s1">
<a href="https://www.blogger.com/null" name="l40"><span class="ln">40 </span></a> X_train = create_doc_vectors(newsgroups_train.data,
<a href="https://www.blogger.com/null" name="l41"><span class="ln">41 </span></a> word_to_num, wv, stop_words)
<a href="https://www.blogger.com/null" name="l42"><span class="ln">42 </span></a> X_test = create_doc_vectors(newsgroups_test.data,
<a href="https://www.blogger.com/null" name="l43"><span class="ln">43 </span></a> word_to_num, wv, stop_words)
<a href="https://www.blogger.com/null" name="l44"><span class="ln">44 </span></a>
<a href="https://www.blogger.com/null" name="l45"><span class="ln">45 </span></a> </span><span class="s2"># Create the classifier</span><span class="s1">
<a href="https://www.blogger.com/null" name="l46"><span class="ln">46 </span></a> classifier = LogisticRegression()
<a href="https://www.blogger.com/null" name="l47"><span class="ln">47 </span></a>
<a href="https://www.blogger.com/null" name="l48"><span class="ln">48 </span></a> </span><span class="s2"># Train the classifier</span><span class="s1">
<a href="https://www.blogger.com/null" name="l49"><span class="ln">49 </span></a> classifier.fit(X_train, y_train)
<a href="https://www.blogger.com/null" name="l50"><span class="ln">50 </span></a>
<a href="https://www.blogger.com/null" name="l51"><span class="ln">51 </span></a> </span><span class="s2"># Predict the topics of the test set and compute</span><span class="s1">
<a href="https://www.blogger.com/null" name="l52"><span class="ln">52 </span></a> </span><span class="s2"># the evaluation metrics</span><span class="s1">
<a href="https://www.blogger.com/null" name="l53"><span class="ln">53 </span></a> y_pred = classifier.predict(X_test)
<a href="https://www.blogger.com/null" name="l54"><span class="ln">54 </span></a> precision, recall, f1, support = \
<a href="https://www.blogger.com/null" name="l55"><span class="ln">55 </span></a> precision_recall_fscore_support(y_test, y_pred,
<a href="https://www.blogger.com/null" name="l56"><span class="ln">56 </span></a> average=</span><span class="s3">'weighted'</span><span class="s1">)
<a href="https://www.blogger.com/null" name="l57"><span class="ln">57 </span></a>
<a href="https://www.blogger.com/null" name="l58"><span class="ln">58 </span></a> </span><span class="s0">print </span><span class="s3">'Precision: %.02f%%, Recall: %.02f%%, F1: %.02f%%' </span><span class="s1">\
<a href="https://www.blogger.com/null" name="l59"><span class="ln">59 </span></a> % (precision * </span><span class="s4">100</span><span class="s1">, recall * </span><span class="s4">100</span><span class="s1">, f1 * </span><span class="s4">100</span><span class="s1">)
<a href="https://www.blogger.com/null" name="l60"><span class="ln">60 </span></a>
<a href="https://www.blogger.com/null" name="l61"><span class="ln">61 </span></a>
<a href="https://www.blogger.com/null" name="l62"><span class="ln">62 </span></a></span><span class="s0">def </span><span class="s1">create_doc_vectors(data, word_to_num, wv, stop_words):
<a href="https://www.blogger.com/null" name="l63"><span class="ln">63 </span></a> </span><span class="s2">"""
<a href="https://www.blogger.com/null" name="l64"><span class="ln">64 </span></a> Create a matrix in which each row is a document,
<a href="https://www.blogger.com/null" name="l65"><span class="ln">65 </span></a> and each document is represented as CBOW
<a href="https://www.blogger.com/null" name="l66"><span class="ln">66 </span></a> """</span><span class="s1">
<a href="https://www.blogger.com/null" name="l67"><span class="ln">67 </span></a> doc_vecs = []
<a href="https://www.blogger.com/null" name="l68"><span class="ln">68 </span></a>
<a href="https://www.blogger.com/null" name="l69"><span class="ln">69 </span></a> </span><span class="s0">for </span><span class="s1">doc </span><span class="s0">in </span><span class="s1">data:
<a href="https://www.blogger.com/null" name="l70"><span class="ln">70 </span></a> tokens = nltk.word_tokenize(doc.lower())
<a href="https://www.blogger.com/null" name="l71"><span class="ln">71 </span></a> tokens = [w </span><span class="s0">for </span><span class="s1">w </span><span class="s0">in </span><span class="s1">tokens
<a href="https://www.blogger.com/null" name="l72"><span class="ln">72 </span></a> </span><span class="s0">if </span><span class="s1">w </span><span class="s0">not in </span><span class="s1">string.punctuation
<a href="https://www.blogger.com/null" name="l73"><span class="ln">73 </span></a> </span><span class="s0">and </span><span class="s1">w </span><span class="s0">not in </span><span class="s1">stop_words]
<a href="https://www.blogger.com/null" name="l74"><span class="ln">74 </span></a> tokens = [word_to_num.get(w, -</span><span class="s4">1</span><span class="s1">) </span><span class="s0">for </span><span class="s1">w </span><span class="s0">in </span><span class="s1">tokens]
<a href="https://www.blogger.com/null" name="l75"><span class="ln">75 </span></a> doc_vector = [wv[w] </span><span class="s0">for </span><span class="s1">w </span><span class="s0">in </span><span class="s1">tokens </span><span class="s0">if </span><span class="s1">w > -</span><span class="s4">1</span><span class="s1">]
<a href="https://www.blogger.com/null" name="l76"><span class="ln">76 </span></a> doc_vector = np.mean(np.vstack(doc_vector), axis=</span><span class="s4">0</span><span class="s1">) \
<a href="https://www.blogger.com/null" name="l77"><span class="ln">77 </span></a> </span><span class="s0">if </span><span class="s1">len(doc_vector) > </span><span class="s4">0 </span><span class="s0">else </span><span class="s1">np.zeros((</span><span class="s4">1</span><span class="s1">, </span><span class="s4">50</span><span class="s1">))
<a href="https://www.blogger.com/null" name="l78"><span class="ln">78 </span></a> doc_vecs.append(doc_vector)
<a href="https://www.blogger.com/null" name="l79"><span class="ln">79 </span></a>
<a href="https://www.blogger.com/null" name="l80"><span class="ln">80 </span></a> instances = np.vstack(doc_vecs)
<a href="https://www.blogger.com/null" name="l81"><span class="ln">81 </span></a> </span><span class="s0">return </span><span class="s1">instances
<a href="https://www.blogger.com/null" name="l82"><span class="ln">82 </span></a>
<a href="https://www.blogger.com/null" name="l83"><span class="ln">83 </span></a>
<a href="https://www.blogger.com/null" name="l84"><span class="ln">84 </span></a></span><span class="s0">def </span><span class="s1">load_embeddings(embedding_file):
<a href="https://www.blogger.com/null" name="l85"><span class="ln">85 </span></a> </span><span class="s2">"""
<a href="https://www.blogger.com/null" name="l86"><span class="ln">86 </span></a> Load the pre-trained embeddings from a file
<a href="https://www.blogger.com/null" name="l87"><span class="ln">87 </span></a> """</span><span class="s1">
<a href="https://www.blogger.com/null" name="l88"><span class="ln">88 </span></a> </span><span class="s0">with </span><span class="s1">codecs.open(embedding_file, </span><span class="s3">'r'</span><span class="s1">, </span><span class="s3">'utf-8'</span><span class="s1">) </span><span class="s0">as </span><span class="s1">f_in:
<a href="https://www.blogger.com/null" name="l89"><span class="ln">89 </span></a> words, vectors = \
<a href="https://www.blogger.com/null" name="l90"><span class="ln">90 </span></a> zip(*[line.strip().split(</span><span class="s3">' '</span><span class="s1">, </span><span class="s4">1</span><span class="s1">) </span><span class="s0">for </span><span class="s1">line </span><span class="s0">in </span><span class="s1">f_in])
<a href="https://www.blogger.com/null" name="l91"><span class="ln">91 </span></a> vectors = np.loadtxt(vectors)
<a href="https://www.blogger.com/null" name="l92"><span class="ln">92 </span></a>
<a href="https://www.blogger.com/null" name="l93"><span class="ln">93 </span></a> </span><span class="s2"># Normalize each row (word vector) in the matrix to sum-up to 1</span><span class="s1">
<a href="https://www.blogger.com/null" name="l94"><span class="ln">94 </span></a> row_norm = np.sum(np.abs(vectors)**</span><span class="s4">2</span><span class="s1">, axis=-</span><span class="s4">1</span><span class="s1">)**(</span><span class="s4">1.</span><span class="s1">/</span><span class="s4">2</span><span class="s1">)
<a href="https://www.blogger.com/null" name="l95"><span class="ln">95 </span></a> vectors /= row_norm[:, np.newaxis]
<a href="https://www.blogger.com/null" name="l96"><span class="ln">96 </span></a>
<a href="https://www.blogger.com/null" name="l97"><span class="ln">97 </span></a> </span><span class="s0">return </span><span class="s1">words, vectors
<a href="https://www.blogger.com/null" name="l98"><span class="ln">98 </span></a>
<a href="https://www.blogger.com/null" name="l99"><span class="ln">99 </span></a>
<a href="https://www.blogger.com/null" name="l100"><span class="ln">100 </span></a></span><span class="s0">if </span><span class="s1">__name__ == </span><span class="s3">'__main__'</span><span class="s1">:
<a href="https://www.blogger.com/null" name="l101"><span class="ln">101 </span></a> main()
<a href="https://www.blogger.com/null" name="l102"><span class="ln">102 </span></a></span></pre>
</body>
</html>
<br />
As an example, I printed one of the documents and the topic that the classifier assigned to it:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">
<textarea cols="75" name="doc" readonly="readonly" rows="9"> I really think that this is the key point. When I saw the incident on Baseball Tonight Sunday, I couldn't believe how far away from the plate Gant went. Then he casually leaned against his bat. I don't blame the umpire at all for telling the pitcher to pitch. The worst part of the whole incident was the Braves coming out onto the field. What were they going to do, attack the umpire? The only people who should've been out there were Cox and maybe the coaches, but NO players. I agree with the person who posted before that Cox should be suspended for having no control over his team.
</textarea></span>
<b>rec.sport.baseball</b><br />
<span style="text-align: left;"><br /></span>
<span style="text-align: left;">I'd say it's an easy one, since it specifically mentions "baseball". However, nothing should be taken for granted in NLP! Anything that works is a miracle :)</span><br />
<span style="text-align: left;"><br /></span>
<u><b>Unsupervised Document Classification</b></u><br />
In some cases, we don't have labeled data, so we can't learn characteristics of instances with specific labels, e.g., words that tend to occur in documents about politics. Instead, we can find common characteristics between documents about the same (unknown) topic, and group documents from the (seemingly) same topic together in one cluster. <br />
<br />
Then, when a new document is presented, we can assign it to the most suitable cluster, based, for example, on how similar it is to the documents in each cluster. This has several purposes; first of all, we can let someone look at a few documents from each cluster and infer the topic represented by the cluster. We can also take the most common words in this cluster, automatically, and use them as "tags" that describe the topic. Another usage can be recommending a document to someone who read other documents in this cluster (assuming that they are interested in the cluster's topic).<br />
<br />
While we don't have the true labels (topics) of the training instances (documents), we assume that each document has a certain topic, which is a <i>hidden variable</i> in our model (as opposed to the documents themselves, which are <i>observed</i>). One clustering approach is through <a href="https://en.wikipedia.org/wiki/Generative_model">generative models</a>: we assume that a model generated our existing documents, and we use an algorithm that tries to reconstruct the parameters of the model, in a way the best explains our observed data.<br />
<br />
The assumption of the generative model is that our training data was generated through probabilities, that we would like to reconstruct. The simpler model, called <i>mixture of histograms,</i> assumes that each document has one topic. A document was generated as follows:<br />
<br />
<ul style="text-align: left;">
<li>The document's topic <i>c</i> was drawn from the topics' distribution (probability function) P(C).<br />For example, if we have 3 topics, <i>news, sports</i> and <i>music</i>, with the probabilities 0.5, 0.3, 0.2 respectively, then in half of the cases we are expected to "generate" a document about <i>news.</i></li>
<li>Given the topic <i>c</i>, the words in the document were sampled from a distribution of words given the topic, P(w|c). For example, if the topic is <i>news</i>, the probability of each word <i>w</i> in the vocabulary in the news topic is P(w|<i>news</i>). Since there are many words in the vocabulary, the probability for each of them is quite small (because the probability of all words sums up to 1). Anyhow, words that are likely to appear in news documents, such as <i>war</i> and <i>report </i>will have higher probabilities, e.g. P(<i>war|news</i>) > P(<i>football|news</i>). When we sample words for the generated <i>news </i>document, we will get mostly words discussing news.</li>
</ul>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEncMngV_S_2CTb3gnMotAFqhtarDq4c2_eIgfEKQ4q_4zlRJSOi6pUc_ep7tL-LSaBCTP83QKXMhlheWdrr6EGC4lMia92eTlt661NE9jpE4DQDaFzGxyr2R0LhPQYd8FMM7kBEtXT8k/s1600/clustering.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="198" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiEncMngV_S_2CTb3gnMotAFqhtarDq4c2_eIgfEKQ4q_4zlRJSOi6pUc_ep7tL-LSaBCTP83QKXMhlheWdrr6EGC4lMia92eTlt661NE9jpE4DQDaFzGxyr2R0LhPQYd8FMM7kBEtXT8k/s320/clustering.PNG" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 3: An illustration of the probability distribution of word given a topic.</td></tr>
</tbody></table>
<br />
The goal of the algorithm is to learn the probability functions P(c), and P(w|c) for each topic, given solely the documents themselves. These probabilities are estimated using an iterative algorithm called <a href="https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm">EM (expectation maximization)</a>. Since the topic of each document is unknown (hidden), you should first decide on the number of topics (how fine-grained would you like the clustering to be? should you distinguish between <i>football </i>and <i>basketball</i> or is <i>sports </i>enough?). Then, start with a random guess of the probabilities. The algorithm works in iterations, improving the probabilities at each iteration. Each iteration is made of two steps:<br />
<ol style="text-align: left;">
<li>Assign a topic to each document based on the current probabilities.</li>
<li>Given the documents' topic computed at the previous step, re-estimate the probabilities using relative frequency (e.g., the probability of <i>news</i> is the ratio between number of the documents assigned this topic, and the total number of documents).</li>
</ol>
<div>
This algorithm works nicely and it can be used to solve other problems with hidden variables. In the end, each document is assigned to a certain (meaningless) topic. As I mentioned before, the meaning of this cluster of documents can be understood by looking at the common features of several documents in the cluster (e.g. do they all discuss music?).</div>
<br />
That's it about text classification for now. Now, can you code something that automatically infers the topic of this blog post? :)<br />
<br /></div>
</div>
<ul dir="ltr" style="text-align: left;"><ul><ul>
</ul>
</ul>
</ul>
<div style="text-align: left;">
<div dir="ltr" style="text-align: left;">
<hr />
There is so much more that I didn't cover in this post, because I don't want to exhaust anyone. If you're interested in reading more, I recommend:</div>
</div>
</div>
<ul dir="ltr" style="text-align: left;">
<li><a href="http://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-and-discriminative-algorithm">The difference between generative and discriminative models</a> - it's a general machine learning topic, not specific to document classification. In this post I gave an example of a supervised discriminative model and an unsupervised generative model, but:</li>
<ul>
<li>The k-means algorithm is an unsupervised discriminative model that can be used for <a href="http://scikit-learn.org/stable/auto_examples/text/document_clustering.html">text classification</a>.</li>
<li>NaĂŻve Bayes is a supervised generative classifier that can be used for <a href="http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html">text classification</a>.</li>
</ul>
<li>Document classification can also be used for <a href="http://jmgomezhidalgo.blogspot.co.il/2013/05/language-identification-as-text.html">language identification</a>.</li>
<li>And you can't write a post about text classification without mentioning <a href="http://blog.aylien.com/post/108652969778/text-analysis-101-a-basic-understanding-for">LDA</a>. It was a bit too complex for this post, but here, I mentioned it.</li>
</ul>
<ul dir="ltr" style="text-align: left;">
<ul><ul>
</ul>
</ul>
</ul>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com6tag:blogger.com,1999:blog-9145120678290195131.post-73650103747316417922016-01-03T17:06:00.000+02:002016-01-03T17:10:05.305+02:00Representing Words<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="ltr" style="text-align: left;">
</div>
<div dir="ltr" style="text-align: left;">
We're already after 6 posts on the topic of natural language processing, and I can't believe I haven't discussed this basic topic yet. So today I'm going to discuss words; More accurately - I will discuss how words are represented in natural language processing.</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
A word is the most basic unit in semantic tasks. While other, low-level tasks, such as part-of-speech tagging (detecting the part of speech of each word in a sentence) and lemmatization (finding the lemma - a word's basic form) might be interested in the characters (mostly affixes) of the word, semantic tasks are concerned with meaning. A word is the most basic unit that conveys meaning. </div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
While a word is basically stored in the computer as a string - a sequence of characters, this representation says nothing about the word's meaning. It allows detecting a certain similarity, in very specific cases. For example, <a href="https://en.wikipedia.org/wiki/Morphological_derivation">morphological derivations</a> (e.g. <i>sing-singer</i>) and <a href="https://en.wikipedia.org/wiki/Inflection">inflections</a> (e.g. <i>cat-cats, listen-listened</i>) modify words, creating related words (with a different part-of-speech, plurality form, etc). Such words are therefore similar both in meanings and in their string representations, since they share common characters. This similarity could be detected using <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a>.</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
Needless to say, most of the words that are similar by meaning do not share many common characters. Synonyms such as <i>elevator </i>and <i>lift</i> and related words such as <i>food</i> and <i>eat</i>, are not similar at the character-level. </div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
In addition, performing operations on strings is highly inefficient. Computers can handle numbers much better than strings. Better representations are needed also for faster computation. I assume that I've convinced you <i>why </i>words need better representations. Now I can move on to telling you <i>how </i>words could be better represented as vectors (arrays).</div>
<div dir="ltr" style="text-align: left;">
<span style="text-align: left;"><br /></span></div>
<div dir="ltr" style="text-align: left;">
<span style="text-align: left;"><b>one-hot vectors</b></span></div>
<div dir="ltr" style="text-align: left;">
As I mentioned, working with strings is computationally inefficient. A simple solution to this inefficiency is to convert every string to a number. First, we need to define the vocabulary: this is the set of all distinct words in the language (or at least those that we observed in a very large corpus). Unlike a dictionary, in which the entries are word basic forms, a vocabulary contains different entries for inflections and derivations of the same basic form (e.g., <i>cat </i>and <i>cats</i>).<br />
When a semantic application processes a text, it can now replace every word in the text with its index in the vocabulary, e.g. <i>cat</i> is 12424, <i>cats</i> is 12431, <i>dog</i> is 15879, etc.</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
This is the most compact representation for a word - all it takes to store a word is one number. So what do vectors have to do with this? Another way to look at this representation is as a vector of dimension |V| (with V denoting the vocabulary), with zeros in all entries, and one set bit at the word's index (see figure 1).<br />
<br /></div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5710Ebf1fDdyF9vggXtkZRtXIRMxQbmVnON-iMxywyL1zOJhPqHarw5mWS7_-j0bt7NFVEA9FuJBBDiJznR_1Bgp0NEryBLPibVvTGk0cJv1eHmXwrvVJeXvOddHICj0x1ETDOX7JPfk/s1600/1-hot-vector.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="112" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5710Ebf1fDdyF9vggXtkZRtXIRMxQbmVnON-iMxywyL1zOJhPqHarw5mWS7_-j0bt7NFVEA9FuJBBDiJznR_1Bgp0NEryBLPibVvTGk0cJv1eHmXwrvVJeXvOddHICj0x1ETDOX7JPfk/s400/1-hot-vector.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="direction: ltr; text-align: center;"><span style="font-size: x-small;">Figure 1: an illustration of one-hot vectors of <i>cat</i>, <i>cats</i> and <i>dog</i>. The only non-zero entry in each vector is the index of the word in the vocabulary.</span></td></tr>
</tbody></table>
<div dir="ltr" style="text-align: left;">
While one-hot vectors could be stored efficiently, their main problem is that they don't capture any information about the similarity between words. They even lose the information about string-similarity, that sometimes indicates semantic similarity (when words share the same lemma / basic form). Since we're interested in representing words for semantic tasks, a better solution is needed.</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
<b>Distributional vectors</b><br />
One important characteristic of a word is the company it keeps. According to the distributional hypothesis <sup><a href="http://veredshwartz.blogspot.co.il/2016/01/representing-words.html#ref1" name="top-ref1">[1]</a></sup> , words that occur in similar contexts (with the same neighboring words), tend to have similar meanings (e.g. <i style="text-align: center;">elevator </i><span style="text-align: center;">and </span><i style="text-align: center;">lift </i><span style="text-align: center;">will both appear next to <i>down</i>, <i>up</i>, <i>building</i>, <i>floor</i>, and <i>stairs</i>); simply put, "tell me who your friends are and I will tell you who you are" - the words version. </span><br />
<span style="text-align: center;"><br /></span>
<span style="text-align: center;">This idea was leveraged to represent words by their contexts. Suppose that each word is again represented by a |V|-dimensional vector. Each entry in the vector corresponds to an index in the vocabulary. Now, instead of marking a word with its own index, we mark the occurrences of other words next to it. Given a large corpus, we search for all the occurrences of a certain word (e.g. <i>elevator</i>). We take a window of a pre-defined size <i>k</i> around <i>elevator</i>, and every word in this window is considered as a neighbor of <i>elevator</i>. For instance, if the corpus contains the sentence "The left elevator goes down to the second floor", with a window of size 5, we get the following neighbors: <i>the, left, goes, down</i>. A larger <i>k </i>will also include characteristic words such as <i>floor.</i></span><br />
<div style="text-align: left;">
We then update <i>elevator</i>'s vector by increasing the number of occurrences in the indices of <i style="text-align: center;">the, left, goes, </i><span style="text-align: center;">and</span><i style="text-align: center;"> down</i><span style="text-align: center;">. At the end of this process, each word vector contains the frequencies of all its neighbors. We can normalize the vector to get a probability distribution (how likely each word is to appear as a neighbor of the target word). There are some more complex metrics, but this is the main idea. These methods are also referred to as "counting methods"</span><span style="text-align: center;">.</span></div>
<div style="text-align: left;">
<span style="text-align: center;"><br /></span></div>
<div style="text-align: left;">
<span style="text-align: center;">The main advantage of distributional vectors is that they capture similarity between words: similar words => similar neighbors => similar vectors. Measuring similarity between vectors is possible, using measures such as <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a>, for example. We get a simple method of comparing similarities between words - we can expect <i>elevator</i> and <i>lift </i>to yield a higher similarity score than </span><i style="text-align: center;">elevator</i><span style="text-align: center;"> and, say, <i>cat </i>(and I wouldn't miss the chance to watch a <a href="https://www.youtube.com/watch?v=lunMj9zuB18">cat in an elevator</a> video). These simple vector similarities are highly effective in recognizing word similarity in semantic tasks (e.g., to overcome lexical variability).</span></div>
<div style="text-align: left;">
<span style="text-align: center;"><br /></span></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6vVLPvM-Z_3CLjBX02ufvXVQGlnHndX3a_7jL_kxZjI2Tp0sao87E2Gysk2ZVadWOvFk74ZPp27tCGX71FwZCQgd7YpIvDUf1fpbbZYgORoZxyG6FDvBEFNu7LAKxf57MrDBZ727g8tM/s1600/distributional.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="112" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6vVLPvM-Z_3CLjBX02ufvXVQGlnHndX3a_7jL_kxZjI2Tp0sao87E2Gysk2ZVadWOvFk74ZPp27tCGX71FwZCQgd7YpIvDUf1fpbbZYgORoZxyG6FDvBEFNu7LAKxf57MrDBZ727g8tM/s400/distributional.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: x-small;">Figure 2: an illustration of distributional vectors of <i>food</i>, <i>eat</i> and <i>laptop</i>. Notice that the vectors of the similar words <i>food</i> and <i>eat </i>are similar, while different from the vector of the dissimilar word <i>laptop.</i></span></td></tr>
</tbody></table>
<div style="text-align: left;">
<span style="text-align: center;"><br /></span></div>
<div style="text-align: left;">
<span style="text-align: center;">Yet, distributional vectors pose computational obstacles -- the vocabulary size is typically very high (at least a few hundred thousand words). Storing each of these |V| words in a |V|-dimensional vector results in a |V|</span><sup>2 </sup>cells matrix - this is quite large, and performing operations on all words is computationally heavy.</div>
<div style="text-align: left;">
<br /></div>
</div>
<div dir="ltr" style="text-align: left;">
<span style="text-align: left;"><b>Word embeddings</b></span><br />
Word embeddings to the rescue. The basic idea is to store the same contextual information in a low-dimensional vector; each word is now represented by a D-dimensional vector, where D is a relatively small number (typically between 50 and 1000). This approach was first presented in 2003 by Bengio et al. <sup><a href="http://veredshwartz.blogspot.co.il/2016/01/representing-words.html#ref2" name="top-ref2">[2]</a></sup> , but gained extreme popularity with <a href="https://code.google.com/p/word2vec/">word2vec</a> <sup><a href="http://veredshwartz.blogspot.co.il/2016/01/representing-words.html#ref3" name="top-ref3">[3]</a></sup> in 2013. There are also some other variants of word embeddings, like <a href="http://nlp.stanford.edu/projects/glove/">GloVe</a> <sup><a href="http://veredshwartz.blogspot.co.il/2016/01/representing-words.html#ref4" name="top-ref4">[4]</a></sup>.<br />
<br />
Instead of counting occurrences of neighboring words, the vectors are now <i>predicted</i> (learned). Without getting too technical, this is what these algorithms basically do: they start from a random vector for each word in the vocabulary. Then they go over a large corpus and at every step, observe a target word and its context (neighbors within a window). The target word's vector and the context words' vectors are then updated to bring them close together in the vector space (and therefore increase their similarity score). Other vectors (all of them, or a sample of them) are updated to become less close to the target word. After observing many such windows, the vectors become meaningful, yielding similar vectors to similar words.<br />
<br />
What are the advantages of word embeddings? First of all, despite the low dimensions, the information regarding word similarity is kept. Similar words still have similar vectors - this is what the algorithms aim to do while learning these vectors. Second, they are compact, so any operation on these vectors (e.g. computing similarities) is efficient and fast.<br />
<br />
Moreover, people have done some amazing things with these vectors. Some of these things (if not all) are possible with high-dimensional vectors as well, but are computationally difficult.<br />
<br />
<ul dir="ltr">
<li><b>Most similar words</b> - we can easily find the most similar words to a certain word, by finding the most similar vectors. For instance, the 5 most similar words to <i>July </i>are <i>June</i>, <i>January</i>, <i>October</i>, <i>November</i>, and <i>February</i>. We can also find a word which is similar to a set of other words: for example, the most similar word to <i>Israel </i>+ <i>Lebanon </i>+ <i>Syria </i>+ <i>Jordan </i>is <i>Iraq</i>, while the most similar word to <i>France </i>+ <i>Germany </i>+ <i>Italy </i>+ <i>Switzerland </i>is <i>Austria</i>! Isn't this cool? If that's not impressive enough, using techniques for projecting high-dimensional vectors to 2-dimensional space (<a href="https://lvdmaaten.github.io/tsne/">t-SNE</a>, <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">PCA</a>), one can visualize word embeddings, and get some really nice insights about what the vectors capture:</li>
</ul>
</div>
<div dir="ltr" style="text-align: left;">
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJgP7lmp3JUXMjT0CxwtdwQ-CkuBjLMvdUqbGjLY3z1XxZS7pGnNVpjJUb74tL4gAdK8BR7wwZeWmqnGzcfbdELXUdKvE7W6wbtYeJ91gSePYep2nBNbhpabQEVV_ut4t7Ti2FBXh_x9Q/s1600/tsne.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="297" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgJgP7lmp3JUXMjT0CxwtdwQ-CkuBjLMvdUqbGjLY3z1XxZS7pGnNVpjJUb74tL4gAdK8BR7wwZeWmqnGzcfbdELXUdKvE7W6wbtYeJ91gSePYep2nBNbhpabQEVV_ut4t7Ti2FBXh_x9Q/s400/tsne.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: x-small;">Figure 3: a t-SNE visualization of a tiny sample of GloVe word embeddings. It seems to have noticed that <i>bird </i>and <i>fly </i>are related, as well as <i>bird </i>and <i>cat</i>, but <i>love</i>, <i>marriage </i>and <i>wedding </i>are not, for some reason.</span></td></tr>
</tbody></table>
<ul dir="ltr"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEisAhW8nTZIzI20UnSuYHwPOSpXxZOT-SKIwa6-_K7lOmDIcCRU2SCSI2GSN7NTxq4epPSqoju7KgS4_MQWkTA3gyNn-96jxIsI7-0FvhvjwwisDX9Qhw1J-lq9nXiLoLMQRYeptn0uBuo/s1600/word2vec_analogies_1.JPG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="273" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEisAhW8nTZIzI20UnSuYHwPOSpXxZOT-SKIwa6-_K7lOmDIcCRU2SCSI2GSN7NTxq4epPSqoju7KgS4_MQWkTA3gyNn-96jxIsI7-0FvhvjwwisDX9Qhw1J-lq9nXiLoLMQRYeptn0uBuo/s400/word2vec_analogies_1.JPG" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: x-small; text-align: left;">Figure 4: two-dimensional projection of word2vec vectors of countries and their capitals, from the <a href="http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf">paper</a>. The lines between countries and their capitals are approximately parallel, indicating that the vectors capture this relation. </span></td></tr>
</tbody></table>
<li><b>Analogies </b>- Following the above figure, the authors of this paper presented some nice results regarding the vectors' ability to solve analogy questions: <i>a </i>is to <i>b </i>as <i>c </i>is to <i>d</i>; given <i>a, b, </i>and <i>c</i>, find the missing word <i>d.</i> Turns out that this can be solved pretty accurately by selecting a vector which is similar to <i>b</i> and <i>c</i>, but not to <i>a.</i> This is done by addition and subtraction, and the most famous example is the "<i>king </i>+ <i>woman </i>- <i>man </i>= <i>queen</i>", demonstrated in the following figure:</li>
</ul>
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRhT0GSluAaMuTFYJGIC1Qp-pQFasNS66AOvZiZZJYZCviiDoZPJFok_e5lDRIFoV09isymi9SBL8YyIoHWkaPh87ca7f9XXgd6x8HxxnEBun34eCXPaZG_6foO0w4NW7q03HxnpBt2vs/s1600/word2vec_analogies_2.JPG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRhT0GSluAaMuTFYJGIC1Qp-pQFasNS66AOvZiZZJYZCviiDoZPJFok_e5lDRIFoV09isymi9SBL8YyIoHWkaPh87ca7f9XXgd6x8HxxnEBun34eCXPaZG_6foO0w4NW7q03HxnpBt2vs/s1600/word2vec_analogies_2.JPG" /></a></td></tr>
<tr><td class="tr-caption" style="direction: ltr; text-align: center;"><span style="font-size: x-small;">Figure 5: an illustration of the analogical male-female relation between word pairs such as (<i>king</i>, <i>queen</i>), (<i>man</i>, <i>woman</i>) and (<i>uncle</i>, <i>aunt</i>), from the <a href="http://research.microsoft.com/pubs/189726/rvecs.pdf">paper</a>.</span></td></tr>
</tbody></table>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
While these results are very impressive (I'm still impressed and I've known about them for a long time now...), it still isn't clear how useful these vectors are. Turns out they are amazingly useful. Almost any semantic task has been re-implemented recently with the help of word embeddings. Many of them benefit from a performance boost, and it's mostly just very easy to use them. In future posts, I'll give some examples of tasks that rely on word embeddings.</div>
<ul dir="ltr" style="text-align: left;"><ul>
</ul>
</ul>
<div dir="ltr" style="text-align: left;">
<hr />
<div style="direction: ltr; text-align: left;">
<b>Suggested Reading:</b><br />
<span style="font-size: x-small;"><a href="http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/">Deep Learning, NLP, and Representations</a>, from <a href="http://colah.github.io/">Christopher Olah's blog</a> (more technical and advanced).</span></div>
<hr />
<div style="direction: ltr; text-align: left;">
<b>References:</b><br />
<div style="text-align: left;">
<a href="https://www.blogger.com/null" name="ref1"><span style="font-size: x-small;">[1]</span></a><span style="font-size: x-small;"> </span><span style="font-size: x-small;">Harris, Zellig S. "Distributional structure." Word. 1954. </span><sup style="font-size: small;"><a href="http://veredshwartz.blogspot.co.il/2016/01/representing-words.html#top-ref1" style="font-size: small;">â©</a></sup></div>
<a href="https://www.blogger.com/null" name="ref2"><span style="font-size: x-small;">[2]</span></a><span style="font-size: x-small;"> </span><span style="font-size: x-small;">Yoshua Bengio, RĂ©jean Ducharme, Pascal Vincent, and Christian Janvin, "A neural probabilistic language model", The Journal of Machine Learning Research, 2003. </span><sup style="font-size: small;"><a href="http://veredshwartz.blogspot.co.il/2016/01/representing-words.html#top-ref2" style="font-size: small;">â©</a></sup><br />
<a href="https://www.blogger.com/null" name="ref3"><span style="font-size: x-small;">[3]</span></a><span style="font-size: x-small;"> </span><span style="font-size: x-small;">Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient estimation of word representations in vector space." CoRR, 2013.. </span><sup style="font-size: small;"><a href="http://veredshwartz.blogspot.co.il/2016/01/representing-words.html#top-ref3" style="font-size: small;">â©</a></sup><br />
<a href="https://www.blogger.com/null" name="ref4" style="text-align: right;"><span style="font-size: x-small;">[4]</span></a><span style="font-size: x-small; text-align: right;"> </span><span style="font-size: x-small; text-align: right;">Jeffrey Pennington, Richard Socher, and Christopher D. Manning. "GloVe: Global Vectors for Word Representation." EMNLP 2014. </span><sup style="font-size: small; text-align: right;"><a href="http://veredshwartz.blogspot.co.il/2016/01/representing-words.html#top-ref4">â©</a></sup></div>
</div>
</div>Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com28tag:blogger.com,1999:blog-9145120678290195131.post-22732996760653865112015-11-11T17:29:00.001+02:002015-11-17T01:28:54.279+02:00Recommender Systems<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="ltr" style="text-align: left;">
Recommender systems suggest users specific content according to their preferences, by predicting the users' rating or preference of items.<br />
<br />
If this doesn't ring a bell, let me tell you how common recommender systems are in your world.<br />
<ul style="text-align: left;">
<li>Shopping sites, such as <a href="http://www.amazon.com/">Amazon</a>, <a href="http://www.ebay.com/">eBay</a>, and <a href="http://aliexpress.com/">AliExpress</a>, recommend items to purchase based on your previous purchases.<span style="font-size: x-small;">*</span></li>
</ul>
<div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDp52X1483VJdzc677bNE1lx3G7kb_PgopHTW8CWO7a5gWOGxdKWJlTqfIzWRRsBQaEo2DO8z1rzZzatJLSfYy9-AzHTmvsCdT0eB6OrMWPhaTVpwSgnX2tJ-7y6q8ZzBvZkdSo_oQclg/s1600/Amazon_recommended.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="272" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDp52X1483VJdzc677bNE1lx3G7kb_PgopHTW8CWO7a5gWOGxdKWJlTqfIzWRRsBQaEo2DO8z1rzZzatJLSfYy9-AzHTmvsCdT0eB6OrMWPhaTVpwSgnX2tJ-7y6q8ZzBvZkdSo_oQclg/s400/Amazon_recommended.PNG" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Amazon recommended me to buy a recycling bin sticker because I had recycle bins in my shopping basket</td></tr>
</tbody></table>
</div>
<ul style="text-align: left;">
<li><a href="http://www.imdb.com/">IMDB</a> recommends movies you might like, based on your other movie ratings.</li>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgo-RHVVbrJbRo2O5xT6OeeqxFuJoGKKmnYJR9QurCGoLkiN7Y1W3v22DP6eEiSEqQliG_eJJRBOSBdCpF5xNct_mHbiwjdAknhyxFXOAmGDsWv8e8RG9lbBe1o-NA1EuApg2jApSw7AQ4/s1600/imdb+recommendation.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="201" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgo-RHVVbrJbRo2O5xT6OeeqxFuJoGKKmnYJR9QurCGoLkiN7Y1W3v22DP6eEiSEqQliG_eJJRBOSBdCpF5xNct_mHbiwjdAknhyxFXOAmGDsWv8e8RG9lbBe1o-NA1EuApg2jApSw7AQ4/s400/imdb+recommendation.PNG" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">IMDB recommended my husband to watch "Pirates of Silicon Valley", based on his interest in similar movies.<br />
<div>
<br /></div>
</td></tr>
</tbody></table>
<li>Music sites, such as <a href="http://www.rdio.com/">Rdio</a>, <a href="http://www.spotify.com/">Spotify</a> and the late GrooveShark, recommend artists similar to the artists you listen to.</li>
</ul>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWZHVRP8xfi98PxDhgCQx3U509V-uEdD7x6YaCFAJj0tPggIe-FsCVWNGLqZsGXQl8DUb9Xe-B6RMGTwDiXmdM4uJhtpfeHPriBHZVTr-32tj6Qa8cUzZLpH06-m8rLbxUJwlsPmxGm4c/s1600/Rdio.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWZHVRP8xfi98PxDhgCQx3U509V-uEdD7x6YaCFAJj0tPggIe-FsCVWNGLqZsGXQl8DUb9Xe-B6RMGTwDiXmdM4uJhtpfeHPriBHZVTr-32tj6Qa8cUzZLpH06-m8rLbxUJwlsPmxGm4c/s400/Rdio.PNG" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="tr-caption" style="font-size: 12.8px;"><span style="font-size: x-small;">Rdio suggested artists that it considered similar to those I've been listening to</span></td></tr>
</tbody></table>
</td></tr>
</tbody></table>
<ul style="text-align: left;">
<li>And of course, <a href="http://youtube.com/">YouTube</a> recommends videos based on your previous watches.</li>
</ul>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_0ar89QCgsx9GErRILuewubka-RXwbPc9nte6RAVGDr8qF6WLMjed_TbbaTsg7cnrDPTl5lm8GAx6302hiMBKHi4hhU7458az_FDXljNDFM3ZLriDm9mCIV1cRwNmGCJiU3M-9Dt5yo8/s1600/YouTube.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="221" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_0ar89QCgsx9GErRILuewubka-RXwbPc9nte6RAVGDr8qF6WLMjed_TbbaTsg7cnrDPTl5lm8GAx6302hiMBKHi4hhU7458az_FDXljNDFM3ZLriDm9mCIV1cRwNmGCJiU3M-9Dt5yo8/s320/YouTube.PNG" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">YouTube identified that I like budgies and Muse and that I'm a beginner electric guitar player</td></tr>
</tbody></table>
You can't overestimate the business value of these recommendations. Studies have shown that they increase sales. So how does Amazon know which products are likely to interest me? How does Rdio know to recommend I listen to Kings of Leon and would never recommend listening to Taylor Swift? In general, these systems know that I like certain products/artists/movies/videos and would like to predict other products/artists/movies/videos I might like. Many of these systems use a quite simple algorithm, called <a href="https://en.wikipedia.org/wiki/Collaborative_filtering">Collaborative Filtering</a>.<br />
<br />
<b>Collaborative Filtering</b><br />
<br />
Let's take music recommendation as an example. The system has many registered users, and many artists. For simplicity, assume that rating applies to artists rather than albums or songs. Users can rate artists in a scale of 1-5. An average user would rate a few artists; there would still be many other artists that he doesn't rate, either because he doesn't know them or because he didn't listen to them through the website.<br />
<br />
In many cases, the user doesn't actively rate the artist, but the system infers a rating based on the user's behavior - for example, if the user listened to a certain artist many times, it counts as high rating. On the other hand, if the system offered the user this artist a couple of times and he always clicked "skip", he must really not like this artist. The rating technique doesn't really matter.<br />
<br />
User preferences are stored in a large matrix (table), in which each row represents a user, and each column represents an artist:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhXaNo_cfr8-9xigsFe7WQs0QMjHYfuY1CfafkzSGiJkxRJBylrHzdGu2UTQXxniZyZEjXv5xlajtVPihzuvryFQ34iQzVh-5cfVj7nKY2aGG4xfVbm6sUaSO4hhB3ugcMfLb_e3r00TIE/s1600/matrix_original.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhXaNo_cfr8-9xigsFe7WQs0QMjHYfuY1CfafkzSGiJkxRJBylrHzdGu2UTQXxniZyZEjXv5xlajtVPihzuvryFQ34iQzVh-5cfVj7nKY2aGG4xfVbm6sUaSO4hhB3ugcMfLb_e3r00TIE/s1600/matrix_original.PNG" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">A sample user artist ranking matrix. with 5 users and 7 artists. The ranking is between 1 (hate) to 5 (love). If a user hasn't ranked a certain artist, the rank is unknown and is marked with a question mark. This is a toy example; much more data is needed for accurate inference.</td></tr>
</tbody></table>
The idea behind the algorithm is basically to get rid of the question marks in the matrix, and replace them with predicted ratings. The basic assumptions are:<br />
<br />
<ol style="text-align: left;">
<li>If I like a certain artist (e.g. Muse), I would like similar artists (e.g. Royal Blood).</li>
<li>If I'm user A, and I agree with user B's ratings on many artists, I might agree with user B's ratings of other artists, which I haven't rated (and maybe don't know yet). For example, if both user B and I really love Taylor Swift and Miley Cyrus, and user B also really likes Bruno Mars, which I haven't rated, then I might also like Bruno Mars.</li>
</ol>
<div>
From these two reasonable assumptions, the algorithm diverges to two possible implementations:</div>
<div>
<ol style="text-align: left;">
<li><b>Based on <b>user</b> similarity</b><br />This implementation looks at a certain user and tries to complete missing ratings for this user. According to the second assumption, all we need to do is measure to what extent this user's preferences are similar to those of any other user, and then use the similar users' ratings to complete missing ratings.<br /><br />It is easy to see that each user is represented by a row vector. The dimension of this vector is the number of artists, and every cell in the vector represents the user's rating of a certain artist. For example, this is user 2's vector:<br /><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSM164IW9Kau2VrOAXmUlSNsXQ55utMDhpwAra4e-C6gepAosj51J5bSJRutBybo1aNm5yqd731JQ6jTVz5_4mI0eBArC07rK7vXpaeO-W6wiaHaq2ihBsHnR5Xa3PvopbLzg3zEGi0sc/s1600/user_2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSM164IW9Kau2VrOAXmUlSNsXQ55utMDhpwAra4e-C6gepAosj51J5bSJRutBybo1aNm5yqd731JQ6jTVz5_4mI0eBArC07rK7vXpaeO-W6wiaHaq2ihBsHnR5Xa3PvopbLzg3zEGi0sc/s1600/user_2.PNG" /></a><br /><br />We can measure the correlation of user 2's ratings with another user's rating, by looking at the subset of artists that both users rated. We treat each rating with respect to the user's average rating of artists. For example, user 2's average rating is (1 + 5 + 4 + 4) / 4 = 3.5, so his rating of Arctic Monkeys is -2.5 below the average, while his rating for Taylor Swift is +1.5 above the average. High correlation between users occurs when they have many mutual ratings with similar distance from the average. I'll spare you the formula, but let's look at an example. It seems that users 1 and 3 share a fairly similar taste in music:<br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixFGOOdRgjb-RMBWBfLI6G7pS5D5r3Y2RPQfauLW_bFUkaVdAQRQIPSkyam_pw2c_LwP9IvTHWDFkDkdfNd7QtU6M14hQMEF4jQlzJwmSjUKkNX66f3CRRgYvNlaTyEsJrIoBuyriptMg/s1600/user_similarity.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto; text-align: center;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixFGOOdRgjb-RMBWBfLI6G7pS5D5r3Y2RPQfauLW_bFUkaVdAQRQIPSkyam_pw2c_LwP9IvTHWDFkDkdfNd7QtU6M14hQMEF4jQlzJwmSjUKkNX66f3CRRgYvNlaTyEsJrIoBuyriptMg/s1600/user_similarity.PNG" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Rating vectors of user 1 (top) and 3 (bottom). The mutual ratings are similar.</td></tr>
</tbody></table>
While they both differ from user 5, that seems to hate both Arctic Monkeys and Muse:<br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijcnwE8pwDRRtPMuNP9eP_e6YqFLJETPw2r1s81gt6y1Q9Q2vIu0EHvxcOTdA3iuV-a7ugwcqqskKdnCg4hoOQGgx1D6DA4oVS7iYF7ZZN1juaLqwAzkn1DYRy6pIBGd7tgtsyrOPDwdg/s1600/different_user.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijcnwE8pwDRRtPMuNP9eP_e6YqFLJETPw2r1s81gt6y1Q9Q2vIu0EHvxcOTdA3iuV-a7ugwcqqskKdnCg4hoOQGgx1D6DA4oVS7iYF7ZZN1juaLqwAzkn1DYRy6pIBGd7tgtsyrOPDwdg/s1600/different_user.PNG" /></a><br /><br />If we would like to predict user 1's missing rating of Bruno Mars, we first need to find the <i>k</i> most similar users to user 1 that rated Bruno Mars (<i>k</i> being a parameter of the system). For each of these similar users, we compute the distance of their rating of Bruno Mars from their average rating. For example, +1, -2.5, etc. Then, we compute a weighted average of these distances, with the user-similarity as the weight. The weighted average means that the more similar a certain user is to the target user (e.g. user 1), the more weight we put on his preferences. <br /><br />This weighted average is the predicted distance from the average rating of user 1. For example, setting <i>k=1</i> will take into account user 3 as the most similar user. User 3's average rating is (5 + 1 + 5 + 2 + 5 + 1) / 6 = 3.167. The distance of his rating for Bruno Mars from the average is 1 - 3.617 = -2.617. Since we chose <i>k=1</i>, this would be exactly the distance for user 1. His average rating is (4 + 5 + 3) / 3 = 4, so his predicted rating for Bruno Mars is 4 - 2.617 = 1.83. Pretty reasonable considering that both users like rock, but user 1 shows more tolerance to pop music by liking Adele a bit more than user 3.</li>
<li><b>Based on item similarity</b><br />This variant is pretty straightforward after understanding the user-based similarity. We now look at each artist as a column vector, trying to predict what users would rate this artist. <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0YNNA-62dNHfyCBdTYEh_TgABqp8hKYeBfUFY8a1APafmLV6XOdiAZERwhwi7h49EXrVR_8xEa3yFA1zVFOEdoju9AUG9j83vYF17a88vlfL_E8q617c7BiH6I2U9NRmXgnwGEbuYTNM/s1600/similar_artists.png" imageanchor="1" style="margin-left: auto; margin-right: auto; text-align: center;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0YNNA-62dNHfyCBdTYEh_TgABqp8hKYeBfUFY8a1APafmLV6XOdiAZERwhwi7h49EXrVR_8xEa3yFA1zVFOEdoju9AUG9j83vYF17a88vlfL_E8q617c7BiH6I2U9NRmXgnwGEbuYTNM/s1600/similar_artists.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The column vector for Arctic Monkeys (left) and Muse (right). They are fairly similar in the given toy example (while not so similar in real life, given other rock artists).</td></tr>
</tbody></table>
<br />The <i>k</i> most similar artists of each artist are computed in a similar fashion. Again, with respect to the distance of each user's rating from his average (that's because different users have different rating scales - some are more generous than others). The predicted rating of a certain artist by a certain user is the weighted average of the ratings of this user for similar artists. For example, to predict user 4's rating of Arctic Monkeys with <i>k=1</i>, we simply take his rating of Muse.</li>
</ol>
<div>
Next time that some website miraculously predicts which artist/video/movie/product you would like, you should know it wasn't a wild guess but a rather simple heuristic.</div>
<div>
<hr />
<span style="font-size: x-small;">* No items were purchased during the writing of this post.</span></div>
</div>
</div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com2tag:blogger.com,1999:blog-9145120678290195131.post-21324705657971528602015-10-09T19:11:00.000+03:002015-10-09T19:11:05.735+03:00Fun with Google Ngrams<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="ltr" style="text-align: left;">
Google N-Grams is a dataset released by Google a few years back, which is based on Google Books.
The dataset is available for multiple languages. All Google books of a certain language were scanned, and the frequency of each <a href="http://veredshwartz.blogspot.co.il/2015/09/language-models.html">n-gram</a>, for n = 1 to 5, was counted by the publish year. For example, how many times did the word "no" occur in books from 1800 to 1900? How many times did the trigram "freedom of speech" occur in the year 2000?
These counts were normalized and the result is approximated probability of n-grams throughout time. The data is available for download, and can also be viewed via the <a href="http://books.google.com/ngrams/">viewer</a>.</div>
<div dir="ltr" style="text-align: left;">
<br />
While this data is very helpful for training NLP models, it can also provide some cultural, historical and sociological insights... and hours of fun!<br />
<br />
Let's warm up with a simple example, exploring the change in the English language throughout time. I took a few synonyms of the word <i>happy</i>, and compared their usage between 1800 and 2008 (the last year available in the viewer).
<iframe frameborder="0" height="220" hspace="0" marginheight="0" marginwidth="0" name="ngram_chart" scrolling="no" src="https://books.google.com/ngrams/interactive_chart?content=gay%2Cmerry%2Ccheerful%2Cdelighted&year_start=1800&year_end=2015&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cgay%3B%2Cc0%3B.t1%3B%2Cmerry%3B%2Cc0%3B.t1%3B%2Ccheerful%3B%2Cc0%3B.t1%3B%2Cdelighted%3B%2Cc0" vspace="0" width="100%"></iframe>
While the curves of <i>merry</i>, <i>cheerful</i> and <i>delighted</i> look pretty similar, the <i>gay</i> curve departs from the others in the 1980s. There's a reason for that, and that could be explained with the following graph:
<iframe frameborder="0" height="220" hspace="0" marginheight="0" marginwidth="0" name="ngram_chart" scrolling="no" src="https://books.google.com/ngrams/interactive_chart?content=gay%2Chomosexual&year_start=1800&year_end=2008&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cgay%3B%2Cc0%3B.t1%3B%2Chomosexual%3B%2Cc0" vspace="0" width="100%"></iframe>
The word <i>gay</i> in the meaning of <i>homosexual</i>, has been in use since the <a href="http://www.todayifoundout.com/index.php/2010/02/how-gay-came-to-mean-homosexual/">1950s</a>, boosting its usage since.<br />
<br />
The frequency of a certain term sometimes correlates with historical events. For example, while the word <i>war </i>is constantly in use, it was mostly prominent in books during and after World War I and World War II. See the peaks in the graph:<br />
<br />
<iframe frameborder="0" height="220" hspace="0" marginheight="0" marginwidth="0" name="ngram_chart" scrolling="no" src="https://books.google.com/ngrams/interactive_chart?content=war&year_start=1800&year_end=2008&corpus=15&smoothing=0&share=&direct_url=t1%3B%2Cwar%3B%2Cc0" vspace="0" width="100%"></iframe><br />
<br />
Another interesting thing to notice is that people (at least authors) are actually peaceful. Whenever they talk about wars, they also talk about peace. Look how similar the <i>war</i> and <i>peace</i> curves are:<br />
<iframe frameborder="0" height="220" hspace="0" marginheight="0" marginwidth="0" name="ngram_chart" scrolling="no" src="https://books.google.com/ngrams/interactive_chart?content=war%2Cpeace&year_start=1800&year_end=2008&corpus=15&smoothing=0&share=&direct_url=t1%3B%2Cwar%3B%2Cc0%3B.t1%3B%2Cpeace%3B%2Cc0" vspace="0" width="100%"></iframe> <br />
The curve similarity may suggest that the same books that discuss war also mention peace, but since the <i>war</i> curve dominates the <i>peace </i>curve, I can only assume that war is the books' main topic and peace is only mentioned a couple of times. I hope that they say good things about peace.<br />
<br />
Searching for <i>World Trade Center</i> shows that it was first mentioned around its construction in 1973, then there were a few years that it was hardly discussed, and then came along 09/11 and made it a very common topic.
<iframe frameborder="0" height="220" hspace="0" marginheight="0" marginwidth="0" name="ngram_chart" scrolling="no" src="https://books.google.com/ngrams/interactive_chart?content=World+Trade+Center&year_start=1950&year_end=2008&corpus=15&smoothing=0&share=&direct_url=t1%3B%2CWorld%20Trade%20Center%3B%2Cc0" vspace="0" width="100%"></iframe>
In some cases, the correlation with historical events is through new words that describe concepts or products. The time they start appearing in books is by the time of their invention / foundation. For example:<br />
<br />
Facebook was founded in 2004.
<br />
<iframe frameborder="0" height="220" hspace="0" marginheight="0" marginwidth="0" name="ngram_chart" scrolling="no" src="https://books.google.com/ngrams/interactive_chart?content=Facebook&year_start=1980&year_end=2008&corpus=15&smoothing=0&share=&direct_url=t1%3B%2CFacebook%3B%2Cc0" vspace="0" width="100%"></iframe>
Google was founded in 1998.
<br />
<iframe frameborder="0" height="220" hspace="0" marginheight="0" marginwidth="0" name="ngram_chart" scrolling="no" src="https://books.google.com/ngrams/interactive_chart?content=Google&year_start=1980&year_end=2008&corpus=15&smoothing=0&share=&direct_url=t1%3B%2CGoogle%3B%2Cc0" vspace="0" width="100%"></iframe>
Twitter was founded in 2006. However, <i>twitter</i> is an English word that was already in use before 2006 (and as it seems, sometimes appeared capitalized, probably in the beginning of a sentence).
<br />
<iframe frameborder="0" height="220" hspace="0" marginheight="0" marginwidth="0" name="ngram_chart" scrolling="no" src="https://books.google.com/ngrams/interactive_chart?content=twitter%2CTwitter&year_start=1800&year_end=2008&corpus=15&smoothing=3&share=" vspace="0" width="100%"></iframe>
What about some older inventions?
<iframe frameborder="0" height="220" hspace="0" marginheight="0" marginwidth="0" name="ngram_chart" scrolling="no" src="https://books.google.com/ngrams/interactive_chart?content=television%2Ctelephone&year_start=1800&year_end=2008&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Ctelevision%3B%2Cc0%3B.t1%3B%2Ctelephone%3B%2Cc0" vspace="0" width="100%"></iframe>
<br />
The invention of the telephone, which is attributed to Alexander Graham Bell, in fact involved other inventors such as Antonio Meucci and Thomas Watson. They started in 1844, but Bell granted patent for the telephone in 1876. The television was invented in 1926. Which of them had greater influence on the world? If there is any correlation between being mentioned in books and having influence on the world, it seems like the television did. Having said that, telephone is commonly referred to as phone, and in recent years also includes <i>cellphone</i> and <i>smartphone</i>. So putting all these together changes the picture:
<iframe frameborder="0" height="220" hspace="0" marginheight="0" marginwidth="0" name="ngram_chart" scrolling="no" src="https://books.google.com/ngrams/interactive_chart?content=tv%2Btelevision%2Cphone%2Btelephone%2Bsmartphone%2Bcellphone&case_insensitive=on&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2C(tv%20%2B%20television)%3B%2Cc0%3B.t1%3B%2C(phone%20%2B%20telephone%20%2B%20smartphone%20%2B%20cellphone)%3B%2Cc0" vspace="0" width="100%"></iframe>
<br />
Some words were mentioned for periods of times and then just disappeared. Take for example this list of diseases, each relevant in different eras:
<iframe frameborder="0" height="220" hspace="0" marginheight="0" marginwidth="0" name="ngram_chart" scrolling="no" src="https://books.google.com/ngrams/interactive_chart?content=Tuberculosis%2CSmallpox%2Clung+cancer%2CCholera%2CLeprosy%2CHIV&year_start=1800&year_end=2008&corpus=15&smoothing=3&share=&direct_url=t1%3B%2CTuberculosis%3B%2Cc0%3B.t1%3B%2CSmallpox%3B%2Cc0%3B.t1%3B%2Clung%20cancer%3B%2Cc0%3B.t1%3B%2CCholera%3B%2Cc0%3B.t1%3B%2CLeprosy%3B%2Cc0%3B.t1%3B%2CHIV%3B%2Cc0" vspace="0" width="100%"></iframe>
Except for historical events, you can try to use the data to search for correlations between events or phenomena. Judge for yourself:
<iframe frameborder="0" height="220" hspace="0" marginheight="0" marginwidth="0" name="ngram_chart" scrolling="no" src="https://books.google.com/ngrams/interactive_chart?content=diabetes%2Cobesity&year_start=1800&year_end=2008&corpus=15&smoothing=0&share=&direct_url=t1%3B%2Cdiabetes%3B%2Cc0%3B.t1%3B%2Cobesity%3B%2Cc0" vspace="0" width="100%"></iframe>
<iframe frameborder="0" height="220" hspace="0" marginheight="0" marginwidth="0" name="ngram_chart" scrolling="no" src="https://books.google.com/ngrams/interactive_chart?content=one+child+policy%2CMissing+Women&year_start=1900&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cone%20child%20policy%3B%2Cc0%3B.t1%3B%2CMissing%20Women%3B%2Cc0" vspace="0" width="100%"></iframe>
Bear in mind that correlation doesn't imply cause-effect relation, and not even that a third factor impacts these two phenomena. Sometimes they just happen at the same time.
<br />
<br />
Just for the fun of it, can you guess which is the most important day of the week?
<iframe frameborder="0" height="220" hspace="0" marginheight="0" marginwidth="0" name="ngram_chart" scrolling="no" src="https://books.google.com/ngrams/interactive_chart?content=Sunday%2CMonday%2CTuesday%2CWednesday%2CThursday%2CFriday%2CSaturday&year_start=1800&year_end=2008&corpus=15&smoothing=3&share=&direct_url=t1%3B%2CSunday%3B%2Cc0%3B.t1%3B%2CMonday%3B%2Cc0%3B.t1%3B%2CTuesday%3B%2Cc0%3B.t1%3B%2CWednesday%3B%2Cc0%3B.t1%3B%2CThursday%3B%2Cc0%3B.t1%3B%2CFriday%3B%2Cc0%3B.t1%3B%2CSaturday%3B%2Cc0" vspace="0" width="100%"></iframe>
It's Sunday! I expected that from English books, but I thought that Saturday will be more prominent in Hebrew books. That wasn't the case - the Hebrew graph was similar with Sunday way ahead of the other days. Maybe this happens because of translated books. Happy weekend everyone!
</div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com1tag:blogger.com,1999:blog-9145120678290195131.post-38769048844304237102015-09-29T18:17:00.000+03:002015-09-30T00:51:36.238+03:00Translation Models<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="ltr" style="text-align: left;">
This is the last part of the <a href="http://veredshwartz.blogspot.co.il/2015/08/machine-translation-1-overview.html">machine translation overview</a>, in which I will discuss translation models. To recall, a statistical machine translation system produces a translation that is required to be both <i>adequate</i>, that is, as close as possible in its meaning to the source sentence, and <i>fluent </i>in the target language. Fluency is the responsibility of the target <a href="http://veredshwartz.blogspot.co.il/2015/09/language-models.html">language model</a>, that scores a every candidate translation according to its likelihood in the target language. The translation model, which will be presented in this post, takes care of <i>adequacy</i>: it scores candidate translations with respect to the original sentence in the source language - higher scores for sentences that better preserve the meaning of the original sentence.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEicBUrBca7z_8sJa-B9A8z_F-Q6e-WoStSzg5bKb_hRLiQmtbe203hG1Hhp-bnDkCalfjOzmoFI5354AWyrkJOR7puGCryXi7OLf9qAuuIApENsNYIAvVrSHj64dzNTljk89IfvETkxWd4/s1600/IMG_1382.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEicBUrBca7z_8sJa-B9A8z_F-Q6e-WoStSzg5bKb_hRLiQmtbe203hG1Hhp-bnDkCalfjOzmoFI5354AWyrkJOR7puGCryXi7OLf9qAuuIApENsNYIAvVrSHj64dzNTljk89IfvETkxWd4/s320/IMG_1382.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Toilet sign at a restaurant in Mestre, Italy. Some kind of machine translation was used, translating <i>toilet</i> in Hebrew to <i>the makeup. </i>If you recognize funny translations in other languages, please comment!</td></tr>
</tbody></table>
<div dir="ltr">
As in language models, you don't need an expert to build the translation model. You don't even need to speak either the source or the target language. Using statistical methods, you can (theoretically), build a translation model from <a href="https://en.wikipedia.org/wiki/Swahili_language">Swahili</a> to <a href="https://en.wikipedia.org/wiki/Yiddish_language">Yiddish</a>. The only requirement is to have a <i>parallel corpus</i> - a large amount of the same text, written in both languages. For example, movie subtitles or book translations in both languages. The texts are usually aligned at the sentence-level, so it can be regarded as a large collection of sentences in the source language and their translations to the target language. For example, the first sentence from George Orwell's novel <a href="https://en.wikipedia.org/wiki/Nineteen_Eighty-Four">1984</a>, in the original edition and in the Hebrew translation:<br />
<br />
<b>en</b>: <i>It was a bright cold day in April, and the clocks were striking thirteen.</i><br />
<b>he</b>: <i>ŚŚŚ ŚŚ€ŚšŚŚ ŚŠŚ ŚŚŠŚŚ Ś, ŚŚ©ŚąŚŚ ŚŚ ŚŚŠŚŚŠŚŚŚ Ś©ŚŚŚ©-ŚąŚ©ŚšŚ.</i><br />
<br />
can be considered as mutual translations. So do the rest of the sentence-pairs, as long as the translator is not too creative.</div>
<div dir="ltr">
<br /></div>
</div>
<div dir="ltr" style="text-align: left;">
<u>History Lesson</u><br />
Here's a nice anecdote about using a <i>parallel corpus </i>for translation <i>- </i>it's actually not a modern technique at all. It has been here since the 19th century. The <a href="https://en.wikipedia.org/wiki/Rosetta_Stone">Rosetta Stone</a> is an ancient Egyptian stone inscribed with a decree issued at Egypt, in 196 BC. The text on the stone is written in three scripts: ancient Egyptian hieroglyphs, Demotic script, and ancient Greek. Ancient Egyptian hieroglyphs were used until the end of the fourth century, after which the knowledge of how to read them was lost. For hundreds of years, scholars have tried to decode the ancient Egyptian hieroglyphs. In 1799, the Rosetta stone was rediscovered near the town of Rosetta in the Nile, and brought with it a major advancement in the decoding of the ancient Egyptian hieroglyphs. It was the recognition that the stone offered three versions of the same text that enabled the advancement, making it the first parallel corpus used for statistical translation (at this time, without machines). It was finally decoded in 1822 by the French scholar Jean-François Champollion. The stone is on public display at the British Museum (and is the most interesting exhibit there, in my opinion).<br />
<div>
<br /></div>
</div>
<div dir="ltr" style="text-align: left;">
<div class="separator" style="clear: both; text-align: center;">
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQwOyN-Or6X9A3o0apgH6kqRRWF5L9eFKWKhAh8cnKClOAwrqy7ulHs4MDmhxo9eBrRuw-NjSASkduNxQGVS0X9fS28qqEf1E9ImfvSza3oiL8YVF1lbglX28NZ0UlizQQpfziXXOe1dU/s1600/IMG_6488.jpg" style="margin-left: auto; margin-right: auto;"><img border="0" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQwOyN-Or6X9A3o0apgH6kqRRWF5L9eFKWKhAh8cnKClOAwrqy7ulHs4MDmhxo9eBrRuw-NjSASkduNxQGVS0X9fS28qqEf1E9ImfvSza3oiL8YVF1lbglX28NZ0UlizQQpfziXXOe1dU/s320/IMG_6488.jpg" width="320" /></a></td>
<td><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgg0eyoW-ydndrlTZgcRhxbQXwnirrNSJp4P2t1DuCTNxIt09yAlW66BvW-ZyT2SO4nACLTKOtZ7VUmBYYYTdcQc8pV8rqKdrKYHPhqV841glrDrkVq8OjPvSw_HQ8LXnm504HTfJcaDFo/s1600/Rosetta_Stone.JPG" imageanchor="1" style="font-size: 12.8px; margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgg0eyoW-ydndrlTZgcRhxbQXwnirrNSJp4P2t1DuCTNxIt09yAlW66BvW-ZyT2SO4nACLTKOtZ7VUmBYYYTdcQc8pV8rqKdrKYHPhqV841glrDrkVq8OjPvSw_HQ8LXnm504HTfJcaDFo/s200/Rosetta_Stone.JPG" width="170" /></a></td>
</tr>
<tr>
<td class="tr-caption" colspan="2" style="text-align: center;">The Rosetta Stone<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
</td>
</tr>
</tbody></table>
<div style="text-align: center;">
<br /></div>
<u>Learning the translation model</u><br />
Using sentence pairs from a parallel corpus as a translation table is nice, but not enough. You can always generate a sentence in the source language that didn't occur in the corpus, so it wouldn't be in the table. However, a sentence is composed of phrases (words and multi-word expressions), so instead of constructing a sentence translation table, a phrase translation table could be built, enabling a phrase-by-phrase translation. If the corpus is large enough, you can assume that it covers at least most of the common words and phrases in these languages.<br />
<br />
This is what an excerpt from a phrase table from English to Hebrew might look like:<br />
<br />
<table align="center" border="1"><tbody>
<tr><td><b>source</b></td><td><b>target</b></td><td><b>score</b></td></tr>
<tr><td>day</td><td>ŚŚŚ</td><td>1.0</td></tr>
<tr><td>April</td><td>ŚŚ€ŚšŚŚ</td><td>1.0</td></tr>
<tr><td>bright</td><td>ŚŠŚ</td><td>0.58</td></tr>
<tr><td>bright</td><td>ŚŚŚŚš</td><td>0.42</td></tr>
<tr><td>cold</td><td>Ś§Śš</td><td>0.7</td></tr>
<tr><td>cold</td><td>ŚŠŚŚ Ś</td><td>0.3</td></tr>
<tr><td>thirteen</td><td>Ś©ŚŚŚ© ŚąŚ©ŚšŚ</td><td>0.41</td></tr>
<tr><td>thirteen</td><td>Ś©ŚŚŚ©Ś ŚąŚ©Śš</td><td>0.21</td></tr>
<tr><td>thirteen</td><td>13</td><td>0.38</td></tr>
</tbody></table>
<ul dir="ltr"><ul>
</ul>
</ul>
<div>
Each entry contains a source language phrase, a target language phrase and the score (probability) of translating the source phrase to the target phrase. These are not trivial to compute, since the corpus is aligned at the sentence level. All we know is that "<i>ŚŚŚ ŚŚ€ŚšŚŚ ŚŠŚ ŚŚŠŚŚ Ś, ŚŚ©ŚąŚŚ ŚŚ ŚŚŠŚŚŠŚŚŚ Ś©ŚŚŚ©-ŚąŚ©ŚšŚ</i>" is a (possible) translation of "<i>It was a bright cold day in April, and the clocks were striking thirteen"</i>, but we don't know which words in English are translated to which words in Hebrew. The assumption is that each word in the source sentence is translated to 0, 1 or more words in the target language. In the simple case, it is translated to one word. In other cases, a word may disappear in translation (for example, the determiner "<i>a</i>" in English doesn't exist in Hebrew) or be translated to a multi-word phrase (e.g. the word "<i>thirteen</i>" is translated to "<i>Ś©ŚŚŚ© ŚąŚ©ŚšŚ</i>").<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiV2siT4nj9TUxYtWQmpie91dM_bfqL_c7YTJjvlrtkxtRUtYrGjKuj_he-pBNQ6C_h3mFumXCU1K2tAj19oR1J-48MKA60EZ9Dgw8LjO8afozIYmvnkMfaZ-11XcC5eDsQtcqvw0NM7II/s1600/alignment.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiV2siT4nj9TUxYtWQmpie91dM_bfqL_c7YTJjvlrtkxtRUtYrGjKuj_he-pBNQ6C_h3mFumXCU1K2tAj19oR1J-48MKA60EZ9Dgw8LjO8afozIYmvnkMfaZ-11XcC5eDsQtcqvw0NM7II/s1600/alignment.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The word-level alignment of a sentence-pair.</td></tr>
</tbody></table>
The solution is, again, to use statistical methods. In particular, aligning these sentence pairs at the word level using the corpus statistics. The most basic alignment model is <b>IBM model 1</b>. It goes over all the sentence pairs in the corpus, and counts for each source word its occurrences in the same sentence pair with target words - since every target word could be its translation. In the example sentences-pair, the Hebrew word <i>ŚŚŚ</i> is counted once with every one of the English words <i>It, was, a, bright, cold, day, in, April, and, the, clocks, were, striking, thirteen.</i> If it appears in another sentence pair, for example, "<i>ŚŚŚŚ ŚŚŚ ŚŚ€Ś</i>" and "<i>what a beautiful day</i>", the word <i>day</i> will have two occurrences with <i>ŚŚŚ. </i>Since this is the true translation, the word <i>day</i> will occur in every sentence pair in which the word <i>ŚŚŚ </i>occurs. These counts are used to estimate the probability of translating the source word to a target word. In some cases, an English word may have several possible translations, such as <i>cold</i> that could be translated both to <i>ŚŠŚŚ Ś</i> and <i>Ś§Śš. </i>In this case, the English word <i>cold</i> will appear in some cases with <i>ŚŠŚŚ Ś</i> and in others with <i>Ś§Śš. </i>The probability will be computed accordingly (and will be higher for the more common translation).<br />
<br />
This is the basic model, and there are other IBM models (2-5) that handle some of the problems that the basic model doesn't solve (e.g. considering the distance between aligned words). This phase's output is a word-to-word table, and then another algorithm is applied to create a phrase table, merging multi-word expressions to one phrase (e.g. "<i>hot dog</i>" which is translated differently from "<i>hot</i>" and "<i>dog</i>"). </div>
<br />
<u>Putting it all together</u><br />
The decoder is responsible for performing the actual translation: given the source sentence, it constructs a new sentence in the target language, using the translation model to offer phrase translations and their scores, and the language model to rank the fluency of the translation.<br />
<br />
There are multiple ways to segment the source sentence to phrases (e.g., should "<i>hot dog</i>" be regarded as a phrase, or segmented to "<i>hot</i>" and "<i>dog</i>"?), and in most cases there are also multiple ways to translate each phrase in the source language to a phrase in the target language (e.g., should "<i>cold</i>" be translated to "<i>ŚŠŚŚ Ś</i>" or to "<i>Ś§Śš</i>"?). In addition, the phrases in the target language may be re-ordered to follow grammar rules in the target language (e.g. adjective before noun in English, but after noun in many languages such as Hebrew, Romanian and French). The decoder tries many of these segmentations, translations and orders and produces candidate translations.<br />
<br />
Each candidate translation is scored by three components: the language model scores the translation according to its fluency in the target language. The re-ordering model (which we haven't discussed in details) gives a score based on the changes in the order of words in both languages. The last score is the one given by the translation model. Each phrase-to-phrase translation score is the probability to translate one phrase to the other. So the translation model's score for the entire sentence is the product of all phrase translation scores, for example, if the source sentence is "It's not cold in April":<br />
<br />
<b>score(<i>ŚŚ Ś§Śš ŚŚŚ€ŚšŚŚ</i>)</b> = TM(<i>ŚŚ,not</i>) TM(<i>Ś§Śš</i>,<i>cold</i>) TM(<i>Ś</i>,<i>in</i>) TM(<i>ŚŚ€ŚšŚŚ</i>,<i>April</i>) LM(<i>ŚŚ Ś§Śš ŚŚŚ€ŚšŚŚ</i>) RM(<i>ŚŚ Ś§Śš ŚŚŚ€ŚšŚŚ</i><i>, It's not cold in April</i>)<br />
<br />
And eventually the decoder would select the candidate translation with the highest score it could find.<br />
<div>
<div>
<br /></div>
<div>
<br /></div>
<div>
As always, I'll end the post with hedging myself by saying that I really haven't presented the entire world of translation, just gave you a taste of it. I tried to simplify the basic models that I told you about, but they are a bit less simple than I described. Also, there are newer and more accurate models that involve machine learning techniques, or consider the syntax of the source and target sentences. I hope I could convey the basics clearly and interestingly enough :)</div>
</div>
</div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com0tag:blogger.com,1999:blog-9145120678290195131.post-20098040950096747992015-09-12T19:26:00.000+03:002015-09-12T20:21:43.085+03:00Language Models<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="ltr" style="text-align: left;">
</div>
<div dir="ltr" style="text-align: left;">
<span style="background-color: white; font-family: Trebuchet MS, sans-serif; line-height: 18.4799995422363px;">In my previous post about </span><a href="http://veredshwartz.blogspot.co.il/2015/08/machine-translation-1-overview.html" style="font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;">Machine Translation</a>, <span style="background-color: white; font-family: Trebuchet MS, sans-serif; line-height: 18.4799995422363px;">I mentioned how language models are used in </span><span style="background-color: white; font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;">statistical machine translation</span><span style="background-color: white; font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;">. Language models are used in many NLP applications. In this post, I will explain about language models in general, and how to learn a certain kind of language models: n-gram language models.</span><br />
<span style="font-family: 'Trebuchet MS', sans-serif;"><br /></span>
<span style="font-family: 'Trebuchet MS', sans-serif;">A language model is for a specific language, for example, an English language model. It receives as input a sequence of words in English (sentence / phrase / word). For simplicity, let's say it receives a sentence. The language model score for a sentence </span><i style="font-family: 'Trebuchet MS', sans-serif;">s</i><span style="font-family: 'Trebuchet MS', sans-serif;">, P(s), is a score between 0 and 1, that can be interpreted as the</span><span style="background-color: white; font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;"> <a href="http://veredshwartz.blogspot.co.il/2015/08/probability_21.html">probability</a> of composing this sentence in English</span><span style="background-color: white; font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;">. This score determines </span><span style="background-color: white; font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;">how fluent <i>s</i> is in English; the higher the score, the more
fluent the sentence is. Language models can capture some interesting language </span><span style="background-color: white; font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">phenomena:</span> </span><br />
<ul style="text-align: left;">
<li><span style="font-family: Trebuchet MS, sans-serif;"><span style="background-color: white; line-height: 18.4799995422363px;">Which sentence is grammatically correct? - </span><span style="background-color: white; line-height: 18.4799995422363px;">P("he eat pizza") < P("he eats pizza")</span><span style="background-color: white; line-height: 18.4799995422363px;"> </span></span></li>
</ul>
<ul style="text-align: left;">
<li><span style="font-family: Trebuchet MS, sans-serif;"><span style="background-color: white; line-height: 18.4799995422363px;">Which word order is correct? - </span><span style="background-color: white; line-height: 18.4799995422363px;">P("love I cats") < P("I love cats")</span></span></li>
</ul>
<div>
<span style="background-color: white; font-family: Trebuchet MS, sans-serif; line-height: 18.4799995422363px;">and even some logic and world knowledge:</span></div>
<ul style="text-align: left;">
<li><span style="background-color: white; font-family: Trebuchet MS, sans-serif; line-height: 18.4799995422363px;">What is more likely? - P("good British food") < P("good Italian food")</span></li>
</ul>
<div>
<div dir="ltr">
<span style="background-color: white; font-family: Trebuchet MS, sans-serif; line-height: 18.4799995422363px;"><br /></span>
<span style="background-color: white; font-family: Trebuchet MS, sans-serif; line-height: 18.4799995422363px;">It can also tell you that pdf is the fourth largest religion:</span><br />
<span style="background-color: white; font-family: Trebuchet MS, sans-serif; line-height: 18.4799995422363px;"><br /></span>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQu3ij6FdLWssBT__mKR2XnCT_wArQpJSCWRlArDLcQ2sa3V2ibrgGpzoikfD7gUuq0SU2YGS2owxIk-866XI962PodJi4-TesTXgtgg2OBjsluz0oniDhPa4dNU-fcn6paInC6Asi-Fk/s1600/5flmIpp.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQu3ij6FdLWssBT__mKR2XnCT_wArQpJSCWRlArDLcQ2sa3V2ibrgGpzoikfD7gUuq0SU2YGS2owxIk-866XI962PodJi4-TesTXgtgg2OBjsluz0oniDhPa4dNU-fcn6paInC6Asi-Fk/s1600/5flmIpp.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td class="tr-caption" style="font-size: 12.8000001907349px;">Google suggests words that are likely to complete the query. From <a href="http://imgur.com/gallery/5flmIpp">here</a>.</td></tr>
</tbody></table>
</td></tr>
</tbody></table>
<br />
<u style="font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;">Learning a language model</u><br />
<span style="font-family: Trebuchet MS, sans-serif;"><span style="background-color: white; line-height: 18.4799995422363px;">What does it take to build such a language model? Just a large English text corpus (</span><span style="line-height: 18.4799995422363px;">a large and structured set of texts). We are interested in the probability of sentences, words and phrases in the language, but we don't know the <b>real </b></span><span style="line-height: 18.4799995422363px;">distribution of words and sentences in the language. We can use a large-enough corpus to <b>estimate </b>this probability. The basic method is to use relative frequency (<a href="http://veredshwartz.blogspot.co.il/2015/08/probability_21.html">Maximum Likelihood</a>). </span><span style="line-height: 18.4799995422363px;">The probability of a certain word </span><i style="line-height: 18.4799995422363px;">w</i><span style="line-height: 18.4799995422363px;"> to occur in English, p(w) is approximated by the ratio of the occurrences of </span><i style="line-height: 18.4799995422363px;">w</i><span style="line-height: 18.4799995422363px;"> in the corpus (the number of occurrences of </span><i style="line-height: 18.4799995422363px;">w</i><span style="line-height: 18.4799995422363px;"> / the number of any word occurrence). For example: </span></span><br />
<ul style="text-align: left;">
<li><span style="font-family: Trebuchet MS, sans-serif;"><span style="line-height: 18.4799995422363px;">The word <i>cat</i> occurred 3853 times, out of total of 100,000,000 words, so its estimated probability is</span><span style="line-height: 18.4799995422363px;"> </span>0.00003853.</span></li>
<li><span style="font-family: Trebuchet MS, sans-serif;">The word <i>no</i>, on the other hand, occurs more frequently: 226,985 times. So its probability is <span class="cwcot" id="cwos">0.00</span><span class="cwcot" id="cwos">226985, and therefore when you compose a sentence in English, you are much more likely to say the word <i>no</i> than <i>cat</i>.</span></span></li>
</ul>
<span style="font-family: Trebuchet MS, sans-serif;"><span style="line-height: 18.4799995422363px;">But we are also interested in computing the probability of multi-word expressions, phrases and sentences. Since any of them is simply a sequence of words, we can use the chain rule to compute the probability:</span><sup style="line-height: 18.4799995422363px;"><a href="http://veredshwartz.blogspot.co.il/2015/09/language-models.html#1" name="top1">1</a> </sup></span></div>
<div dir="ltr">
<span style="font-family: Trebuchet MS, sans-serif;"><br /></span></div>
<div dir="ltr" style="text-align: center;">
<span style="font-family: Trebuchet MS, sans-serif; line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">(1) P(A<sub>1,</sub></span><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">2,...,</sub><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">m</sub><span style="line-height: 18.4799995422363px;">) = </span><span style="line-height: 18.4799995422363px;">P(A</span><sub style="line-height: 18.4799995422363px;">1</sub><span style="line-height: 18.4799995422363px;">) P(</span><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">2</sub><span style="line-height: 18.4799995422363px;">|</span><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">1</sub><span style="line-height: 18.4799995422363px;">) ... P(</span><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">n</sub><span style="line-height: 18.4799995422363px;">|</span><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">1</sub><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">2,...,</sub><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">m-1</sub><span style="line-height: 18.4799995422363px;">)</span></span><br />
<span style="font-family: Trebuchet MS, sans-serif;"><br />
</span><br />
<div style="text-align: left;">
<span style="font-family: Trebuchet MS, sans-serif;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">where </span></span><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">P(</span><span style="line-height: 18.4799995422363px;">A</span><sub>i</sub><span style="line-height: 18.4799995422363px;">|</span><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">1</sub><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">2,...,</sub><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">i-1</sub><span style="line-height: 18.4799995422363px;">) denotes the probability that the word </span></span></span></span></span><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">A</span><sub>i</sub></span></span></span><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"> is the next word in the sequence </span></span></span></span><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">1</sub><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">2,...,</sub><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">i-1</sub></span></span></span><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">.</span><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"> For example, P(</span></span></span><span style="background-color: white; line-height: 18.4799995422363px;">I love my cat) = P(I) P(love|I) P(my |I love) P(cat |I love my). We can assume that the words are independent of each other, and get a much simpler formula: </span></div>
</div>
<div dir="ltr">
<span style="font-family: Trebuchet MS, sans-serif;"><br /></span></div>
<div dir="ltr" style="text-align: center;">
<span style="font-family: Trebuchet MS, sans-serif; line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">(2) P(A<sub>1,</sub></span><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">2,...,</sub><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">m</sub><span style="line-height: 18.4799995422363px;">) = </span><span style="line-height: 18.4799995422363px;">P(A</span><sub style="line-height: 18.4799995422363px;">1</sub><span style="line-height: 18.4799995422363px;">) P(</span><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">2</sub><span style="line-height: 18.4799995422363px;">) ... P(</span><span style="line-height: 18.4799995422363px;">A</span><sub>m</sub><span style="line-height: 18.4799995422363px;">)</span></span></div>
<div dir="ltr">
<span style="font-family: Trebuchet MS, sans-serif;"><br /></span></div>
<div dir="ltr">
<span style="font-family: Trebuchet MS, sans-serif;"><span style="background-color: white; line-height: 18.4799995422363px;">So whenever you pick an extra word to continue your sentence, you choose it by its distribution in the language and regardless of the previous words. This doesn't make much sense though. The probability of the word <i>cat </i>is</span> lower than that of the word <i>no</i>. However, in the context of the incomplete sentence "I love my", the word <span style="background-color: white; line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><i>cat </i>is</span></span> much more likely to complete the sentence than the word <i>no</i>. </span><br />
<span style="font-family: Trebuchet MS, sans-serif;"><br />
<span style="background-color: white; line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">To estimate the conditional probability of a word </span></span><span style="background-color: white; line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">A<sub>i</sub></span> (<i>cat</i>) given any number of preceding words </span></span><span style="background-color: white; line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">A<sub>1,</sub></span><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">2,...,</sub><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">i-1</sub></span> (<i>I love my</i>), we need to count the number of occurrences of</span></span> <span style="background-color: white; line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">A<sub>i</sub></span></span></span><span style="background-color: white; line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"> after </span></span><span style="background-color: white; line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">A<sub>1,</sub></span><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">2,...,</sub><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">i-1</sub></span> (how many times the sentence "<i>I love my cat</i>" appears in the corpus) and divide it in the number of times that </span></span><span style="background-color: white; line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">A<sub>1,</sub></span><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">2,...,</sub><span style="line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">i-1</sub></span> occurred with any following word </span></span><span style="background-color: white; line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">(how many times the sentence "<i>I love my *</i>" appears in the corpus for any word *). You would expect that P(cat|I love my) would be higher than P(no|I love my).</span></span></span><br />
<span style="font-family: Trebuchet MS, sans-serif;"><span style="background-color: white; line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">You would also see that the conditional probability </span></span><span style="background-color: white; line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">P(cat|I love my) is different from the prior probability P(cat). I'm not sure if it would be higher though; but I'm sure that P(cat|Persian) > P(cat): you are more likely to say "cat" if you already said "Persian", than just like that out of the blue.</span></span></span></div>
</div>
<span style="font-family: Trebuchet MS, sans-serif;"><span style="background-color: white; line-height: 18.4799995422363px;"><span style="background-color: white; line-height: 18.4799995422363px;"><br /></span></span>
</span><br />
<div dir="ltr">
<span style="font-family: Trebuchet MS, sans-serif;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">However, assuming that every word in the sentence depends on all the previous words is not necessary, and it causes a problem of <i>sparsity</i>. In simple words, there is not enough data to estimate the probabilities. In order to compute the probability of the word <i>cat</i> to complete the sentence <i>My friend John once had a black</i>, you would need the sequences "</span></span></span><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><i>My friend John once had a black" </i>and "</span></span></span><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><i>My friend John once had a black cat" </i>to actually appear in the corpus. The corpus is big, but it doesn't contain any sentence that anyone has ever said. </span></span></span></span></div>
<div dir="ltr">
<span style="font-family: Trebuchet MS, sans-serif;"><br /></span></div>
<div dir="ltr">
<span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="font-family: Trebuchet MS, sans-serif; line-height: 18.4799995422363px;">What's the solution? <i>Markov assumption</i>. We can assume every word only depends on k preceding words. For example, if k=1, we get:</span></span></span></div>
<div dir="ltr">
<span style="font-family: Trebuchet MS, sans-serif;"><br /></span></div>
<div dir="ltr" style="text-align: center;">
<span style="font-family: Trebuchet MS, sans-serif;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">(3) P(</span></span></span><span style="background-color: white; line-height: 18.4799995422363px;">I love my cat) = P(I) P(love|I) P(my|love) P(cat|my)</span></span></div>
<div dir="ltr">
<span style="font-family: Trebuchet MS, sans-serif;"><br /></span></div>
<div dir="ltr">
<span style="font-family: Trebuchet MS, sans-serif;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">This kind of language model is called <i>n-gram language model</i>, where an<i> </i><i></i></span></span></span><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><i><span class="st">n-gram </span></i><span class="st">is a contiguous sequence of n words.<sup style="line-height: 18.4799995422363px;"><a href="http://veredshwartz.blogspot.co.il/2015/09/language-models.html#2" name="top2">2</a> </sup> The model works with n-grams, so the assumption is that every word depends on the preceding (n-1) words. </span>For example, a <i>unigram </i>(n=1) language model considers the words independent of each other (</span></span></span><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">P(</span></span></span><span style="background-color: white; line-height: 18.4799995422363px;">I love my cat) = P(I) P(love) P(my) P(cat)). A </span><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><i>bigram</i> language model assumes that every word depends on the previous word (P(</span></span></span><span style="background-color: white; line-height: 18.4799995422363px;">I love my cat) = P(I) P(love|I) P(my|love) P(cat|my)). There are also trigram (n=3) and 4-gram language models; larger <i>n</i>s are less commonly used, to the best of my knowledge.</span></span></div>
<div dir="ltr">
<span style="font-family: Trebuchet MS, sans-serif;"><span style="line-height: 18.4799995422363px;"><br /></span>
<u><span style="line-height: 18.4799995422363px;">Smoothing</span></u><span style="line-height: 18.4799995422363px;"><br /></span><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">While choosing a small <i>n</i> reduces the <i>sparsity</i>, it doesn't solve the problem completely. Some rare words (e.g. </span></span></span><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><i>absorbefacient, </i>yes, it's an actual <a href="http://www.thefreedictionary.com/absorbefacient">word</a>) or n-grams (e.g. <i>blue wine</i>) may never occur in the corpus, but still be valid in the language. If we use a word's relative frequency as its probability, a word that never occurs in the corpus receives zero probability. If a sentence contains such a word (e.g. <i>I went to the pub and ordered a glass of blue wine</i>), its probability will be zero. While we would probably like this sentence to have a very low probability, we wouldn't want it to be zero; we are aware of the fact that our corpus may be missing some valid English words.</span></span></span></span><br />
<span style="font-family: Trebuchet MS, sans-serif;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><br /></span></span></span></span>
<span style="font-family: Trebuchet MS, sans-serif;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">Smoothing solves this problem. The simplest smoothing technique "hallucinates" additional <i>k</i> occurrences of every word in the sentence. For example, <i>add-1</i> <i>smoothing</i> would consider that the word </span></span></span></span><i style="font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;">absorbefacient</i><span style="font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;"> occurred once (if it hasn't occurred at all in the corpus), and that the word <i>cat</i> occurred </span><span style="font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;">3854 times (<span style="font-family: 'Trebuchet MS', sans-serif;">when</span> it actua<span style="font-family: 'Trebuchet MS', sans-serif;">lly</span> occurred </span><span style="font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;">3853</span><span style="font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;"> times). The new probability is:</span><br />
<span style="font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;"><br /></span>
<br />
<div style="text-align: center;">
<span style="font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;">(4) P(cat) = (3853 + 1) / (100,000,000 + V)</span></div>
<span style="font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;"><span style="font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;"><br /></span>
<span style="font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;">where V is the size of the vocabulary (the number of words we "added" to the corpus).</span><br />
The same thing applies for n-grams. With this new formula, the probability of unseen words (and n-grams) is small, but never zero.</span><br />
<span style="font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;"><br /></span>
<span style="font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;">And as always, there are more complex smoothing techniques (<span style="font-family: 'Trebuchet MS', sans-serif;"><a href="https://en.wikipedia.org/wiki/Katz%27s_back-off_model">Back-</a><span style="font-family: 'Trebuchet MS', sans-serif;"><a href="https://en.wikipedia.org/wiki/Katz%27s_back-off_model">off</a>, </span></span><a href="https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing">Kneser<span style="font-family: 'Trebuchet MS', sans-serif;">-</span>Ne</a><span style="font-family: 'Trebuchet MS', sans-serif;"><a href="https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing">y</a><span style="font-family: 'Trebuchet MS', sans-serif;"><span style="font-family: 'Trebuchet MS', sans-serif;">, </span></span></span>etc.), that I will not discuss in this post.</span><br />
<br />
<hr />
<span style="background-color: white; font-family: Trebuchet MS, sans-serif; line-height: 18.4799995422363px;">Do you want to try it yourself? I implemented a simple language model for this post.<sup style="line-height: 18.4799995422363px;"><a href="http://veredshwartz.blogspot.co.il/2015/09/language-models.html#3" name="top3">3</a> </sup>Type a sentence, hit the button and you'll get the probability of the sentence (after a while...). Try it!</span><br />
<span style="background-color: white; font-family: Trebuchet MS, sans-serif; line-height: 18.4799995422363px;"><br /></span><iframe height="100" src="https://u.cs.biu.ac.il/~havivv/LanguageModel/LM.html" width="100%"></iframe><span style="background-color: white; font-family: Trebuchet MS, sans-serif; line-height: 18.4799995422363px;"><br /></span><br />
<div>
<br />
<hr />
<br />
<u><span style="font-family: Trebuchet MS, sans-serif;"></span></u>
<u><span style="font-family: Trebuchet MS, sans-serif;">What can you do with language models?</span></u><br />
<span style="font-family: Trebuchet MS, sans-serif;">As the demo shows, you can compute the probability of a sentence in a certain language.</span><br />
<span style="font-family: Trebuchet MS, sans-serif;">As I explained in the <a href="http://veredshwartz.blogspot.co.il/2015/08/machine-translation-1-overview.html">previous post</a>, statistical machine translation systems use a language model of the target language to prefer translations that are more fluent in the target language.</span><br />
<span style="font-family: Trebuchet MS, sans-serif;"><br />In the other direction, language models can be used to </span><span style="font-family: Trebuchet MS, sans-serif;">generate the next word in a sequence of words, by sampling from the distribution of words (given the previous word). It can complete your search query or suggest corrections to your text messages. One of the funnest things is to generate a whole new sentence. To illustrate, I used my language model and generated the not-very-sensible sentences</span><sup style="font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;"><a href="http://veredshwartz.blogspot.co.il/2015/09/language-models.html#4" name="top4">4</a></sup><span style="font-family: 'Trebuchet MS', sans-serif;">:</span><br />
<span style="font-family: Trebuchet MS, sans-serif;"><br /></span>
<br />
<table border="1"><tbody>
<tr><td><i>Who indulge yourself.</i></td></tr>
<tr><td><i>The american opinion have attachments of miners from hard lump sum of his search far as to ? and considerable number of the cavity.</i></td></tr>
<tr><td><i>As in Massachusetts.</i></td></tr>
<tr><td><i>To start a pretext for which the orbit and rostov smiled at mrs. rucastle seemed to give the society it must be the European settlement?'' said he was all men were to himself to your monstrosity such things he had already passed in the contrary to which alone , beyond the tarsal bones is she 's eyes were drawn up because he wanted.</i></td></tr>
<tr><td><i>It was about 87,000 soldiers.</i></td></tr>
</tbody></table>
<div style="text-align: center;">
<span style="font-family: Trebuchet MS, sans-serif; font-size: x-small;">sentences generated with a bigram language model</span></div>
<span style="font-family: Trebuchet MS, sans-serif;"><br /></span>
<span style="font-family: Trebuchet MS, sans-serif;">Better sentences could be generated with larger <i>n</i> or with smarter (not n-gram) language models. Anyway, generating sentences can fill up hours of fun.</span><br />
<br />
<span style="font-family: Trebuchet MS, sans-serif;"><span style="background-color: white; line-height: 18.4799995422363px;">Choosing the corpus from which you learn the language model greatly affects the final outcome. Needless to say, choosing the language of the corpus is crucial. If you want a French language model, you need a French corpus, etc. Furthermore, if you base your language model on Shakespeare's writings, and then try to use it to estimate the probabilities of recent posts by your Facebook friends, they will probably be very unlikely. If your corpus is from the medical domain, sentences with medical terms will have a higher probability than those discussing rock bands. So you must choose your corpus carefully according to your needs. For purposes such as machine translation, the corpus should be general and contain texts from diverse domains. However, if you develop a machine translation system for a specific application, e.g. a medical application, you may want your corpus to contain relevant documents, for instance medical documents.</span></span><br />
<br />
<span style="font-family: Trebuchet MS, sans-serif;"><span style="background-color: white; line-height: 18.4799995422363px;">Having trained your language model on a very specific corpus, e.g. a corpus of recipes or a corpus of all the songs by The Smiths, you can go along and generate a whole new sequence of words. If your language model is good enough, you might get a brand new recipe (dare to try it?) or a song by the Smiths that Morrissey has never heard about.<sup style="line-height: 18.4799995422363px;"><a href="http://veredshwartz.blogspot.co.il/2015/09/language-models.html#5" name="top5">5</a> </sup> In fact, language models don't have to be trained on a text corpus at all. You can train them on musical notes and compose a new melody. <a href="http://people.csail.mit.edu/yalesong/6.863-Music.Improviser/">Here</a> are some examples for melodies that were generated from musical notes n-gram language model</span></span><span style="background-color: white; font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;">.</span><sup style="font-family: 'Trebuchet MS', sans-serif; line-height: 18.4799995422363px;"><a href="http://veredshwartz.blogspot.co.il/2015/09/language-models.html#6" name="top6">6</a> </sup></div>
</div>
<hr />
<div dir="ltr" style="text-align: left;">
</div>
<div dir="ltr" style="text-align: left;">
<span style="font-family: Trebuchet MS, sans-serif;"><span style="background-color: white; line-height: 18.4799995422363px;"><span style="background-color: white; line-height: 18.4799995422363px;"></span></span>
</span></div>
<div dir="ltr" style="text-align: left;">
<span style="font-family: Trebuchet MS, sans-serif;">
<br />
</span><br />
<span style="font-family: Trebuchet MS, sans-serif;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;">And just so you won't think that n-gram language models are the state-of-the-art: there are other language models, some perform much better. Maybe I'll mention some of them in other posts.</span></span></span></span><br />
<span style="font-family: Trebuchet MS, sans-serif;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><span style="line-height: 18.4799995422363px;"><br /></span></span></span></span>
</div>
<div dir="ltr" style="text-align: left;">
<hr />
<span class="Apple-style-span" style="font-size: x-small;"><span style="font-family: Trebuchet MS, sans-serif;">
<a href="https://www.blogger.com/null" name="1"><b>1</b></a> To be more accurate, P(A<sub>1</sub>) represents the prior probability of A<sub>1</sub> (the probability of the word A<sub>1</sub> to occur in English), while we are interested in the conditional probability of A<sub>1</sub> given the beginning of a sentence. Therefore, the beginning of each sentence in the corpus is marked with a special sign <S>, and P(A<sub>1</sub>) is replaced by P(A<sub>1</sub>|<S>). This was omitted from the rest of the formulas for simplicity. <a href="http://veredshwartz.blogspot.co.il/2015/09/language-models.html#top1"><sup>â©</sup></a></span></span><br />
<br />
<span class="Apple-style-span" style="font-size: x-small;"><span style="font-family: Trebuchet MS, sans-serif;"><a href="https://www.blogger.com/null" name="2"><b>2</b></a> A good source for n-grams count is <a href="https://books.google.com/ngrams">Google Ngrams</a>, extracted from Google Books. <sup><a href="http://veredshwartz.blogspot.co.il/2015/09/language-models.html#top2">â©</a></sup></span></span><br />
<br />
<span class="Apple-style-span" style="font-size: x-small;"><span style="font-family: Trebuchet MS, sans-serif;"><a href="https://www.blogger.com/null" name="3"><b>3</b></a> The language model in the demo is a bigram language model with add-1 smoothing. I trained it using the corpus <a href="http://norvig.com/big.txt">big.txt</a> from Peter Norvig's website. <a href="http://veredshwartz.blogspot.co.il/2015/09/language-models.html#top3"><sup>â©</sup></a></span></span><br />
<br />
<a href="https://www.blogger.com/null" name="4" style="font-family: 'Trebuchet MS', sans-serif; font-size: x-small;"><b>4</b></a><span style="font-family: 'Trebuchet MS', sans-serif; font-size: x-small;"> I started with the special sign <S> and sampled the next word from the distribution given the previous word, until period was sampled. </span><sup style="font-family: 'Trebuchet MS', sans-serif;"><a href="http://veredshwartz.blogspot.co.il/2015/09/language-models.html#top4">â©</a></sup>
<br />
<br />
<span class="Apple-style-span" style="font-size: x-small;"><span style="font-family: Trebuchet MS, sans-serif;"><a href="https://www.blogger.com/null" name="5"><b>5</b></a> In fact, I tried it, but it didn't work well because the corpus was too small. The Smiths were only active for 5 years and they don't have enough songs. <sup><a href="http://veredshwartz.blogspot.co.il/2015/09/language-models.html#top5">â©</a></sup></span></span><br />
<span class="Apple-style-span" style="font-size: x-small;"><span style="font-family: Trebuchet MS, sans-serif;"><br /></span></span>
<span class="Apple-style-span" style="font-size: x-small;"><span style="font-family: Trebuchet MS, sans-serif;">
<a href="https://www.blogger.com/null" name="6"><b>6</b></a> These examples are taken from <a href="http://www.mit.edu/~6.863/fall2012/projects/writeups/musicngram.pdf">Implementing A Music Improviser Using N-Gram Models</a> by Kristen Felch and Yale Song. They were not the first to implement a musical n-gram model (I found a previous <a href="http://www.academia.edu/3499500/Evolving_Musical_Sequences_with_N-Gram_Based_Trainable_Fitness_Functions">work</a>, and I'm sure there are others), but they published some sample songs that are pretty good. <a href="http://veredshwartz.blogspot.co.il/2015/09/language-models.html#top6"><sup>â©</sup></a></span></span><br />
<br /></div>
</div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com3tag:blogger.com,1999:blog-9145120678290195131.post-40064814802061875132015-08-31T15:56:00.000+03:002015-09-29T18:20:31.156+03:00Machine Translation Overview<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="ltr" style="text-align: left;">
<div>
Imagine you are at a restaurant in a foreign country, and by trying to avoid tourist traps, you found yourself at a nice local restaurant in a quiet neighborhood, no tourists except for you. The only problem is that the menu is in a foreign language... no English menu. What's the problem, actually? Pick your favorite machine translation system (<a href="https://translate.google.com/#">Google Translate</a>, <a href="http://www.bing.com/translator/">Bing Translator</a>, <a href="https://www.babelfish.com/">BabelFish</a>, etc.) and translate the menu to a language you understand!</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
So, there's no need to elaborate on the motivation for translation. What I would like to do is give you an overview of how this magic works, and some idea of why it doesn't always work as well as you would expect.</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIRsdM2qemRBjRK3blGgHxUVYS1lhFisZW_zvKZekCPj62n919Na0Ln9-9CfIOBYQ6GA2Ab7EZOAw-nFWYvyKmeTGJ_jt5uJbM0lXAjKcaRs-d4ECmEGO4_cC25smU1Acmt2ZNNRzdKMg/s1600/Tower+of+Babel.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIRsdM2qemRBjRK3blGgHxUVYS1lhFisZW_zvKZekCPj62n919Na0Ln9-9CfIOBYQ6GA2Ab7EZOAw-nFWYvyKmeTGJ_jt5uJbM0lXAjKcaRs-d4ECmEGO4_cC25smU1Acmt2ZNNRzdKMg/s1600/Tower+of+Babel.jpg" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><div style="direction: ltr;">
<a href="https://en.wikipedia.org/wiki/The_Tower_of_Babel">The Tower of Babel</a>, by Pieter Bruegel the Elder. Oil on board, 1563.</div>
</td></tr>
</tbody></table>
<div dir="ltr" style="text-align: left;">
I'm going to focus on statistical machine translation. <i>Translation </i>means taking a sentence in one language (the source language) and producing a sensible sentence in another language (the target language) that has the same meaning. <i>Machine </i>means that it's done by software rather than a human translator. What does <i>statistical </i>mean?</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
It means that rather than coding expert knowledge into software and creating a lexicon and grammatical rules for translation between two specific languages, these systems are based on statistics on texts in the source and target languages. This is what makes it possible to produce translation between any source and target languages without additional effort, and without having to hire someone that actually speaks these languages. The only thing you need is a large amount of text in both languages.<br />
<br /></div>
<div dir="ltr" style="text-align: left;">
<div>
<span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><u>Statistical Machine Translation</u></span></div>
<div>
What makes a translation good?</div>
<div>
<ul style="text-align: left;">
<li>It is as similar as possible in meaning to the original sentence in the source language.</li>
<li>It sounds correct in the target language, e.g., grammatically.</li>
</ul>
</div>
The first demands that the translation is <i>adequate</i> and the second that it is <i>fluent</i>.<br />
<br />
SMT systems have a component for each of these requirements. The <i>translation model</i> makes sure that the translation is <i>adequate</i> and the <i>language model</i> is responsible for the <i>fluency </i>of the translation in the target language.<br />
<br />
<div dir="ltr">
<u>Language Model</u></div>
<div dir="ltr" style="-webkit-text-stroke-width: 0px; font-style: normal; font-variant: normal; letter-spacing: normal; orphans: auto; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal; margin: 0px;">
<span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; font-weight: normal; line-height: 18.4799995422363px;">I mentioned language models in my <a href="http://veredshwartz.blogspot.co.il/2015/07/natural-language-processing.html">NLP overview post</a>. They are used for various NLP applications. A language model (of a specific language, say English) receives as input a sentence in English and returns the</span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; font-weight: normal; line-height: 18.4799995422363px;"> <a href="http://veredshwartz.blogspot.co.il/2015/08/probability_21.html">probability</a> of composing this sentence in the language</span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; font-weight: normal; line-height: 18.4799995422363px;">. This is a score between 0 and 1, </span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; font-weight: normal; line-height: 18.4799995422363px;">determining how fluent a sentence is in English - the higher the score, the more fluent the sentence is. Language models (LM) can capture grammatical rules </span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">(LM("she eat pizza") < LM("she <b>eats </b>pizza"))</span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; font-weight: normal; line-height: 18.4799995422363px;">, correct word order </span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; font-weight: normal; line-height: 18.4799995422363px;">(LM("love I cats") < LM("I love cats"))</span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; font-weight: normal; line-height: 18.4799995422363px;">, better word choice (</span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">LM("<b>powerful </b>coffee") < LM("<b>strong </b>coffee"))</span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">,</span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> and even some logic and world knowledge (LM("good </span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><b>British </b></span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">food") < LM("good </span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><b>Italian </b></span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">food")). </span><br />
<span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">These models are obtained from a large corpus (structured set of texts) in the target language (e.g. English). <strike>In the next post I will elaborate on how this is done</strike> (edit 12/09/15: you can read in the <a href="http://veredshwartz.blogspot.co.il/2015/09/language-models.html">next post</a> how this is done).</span></div>
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; font-weight: normal; line-height: normal; margin: 0px;">
<span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></div>
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; font-weight: normal; line-height: normal; margin: 0px;">
<span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><u>Translation Model</u></span></div>
<div style="margin: 0px;">
<div style="font-weight: normal;">
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="background-color: white; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">A translation model (from the source language to the target language) </span></span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">receives as input a pair of sentences / words / phrases, </span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">one in each language, and returns the probability of translating the first sentence to the second. As the language model, it also gives a score </span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">between 0 and 1, </span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">determining how adequate a translation is </span></span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> </span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">- the higher the score, the more </span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">adequate the translation is</span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">.</span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> </span></div>
<span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">These models are obtained from a parallel corpora (plural of corpus) - each corpus contains the same texts in a different language (one in the source language and one in the target language). <strike>I will elaborate on how this is done in another post</strike> </span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.2px; line-height: 18.48px;">(edit 29/09/15: you can read in <a href="http://veredshwartz.blogspot.co.il/2015/09/translation-models.html">this post</a> </span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.2px; line-height: 18.48px;">how it is done)</span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.2px; line-height: 18.48px;">.</span></div>
<div style="font-weight: normal; margin: 0px;">
<div style="color: black; font-family: 'Times New Roman'; font-size: medium; line-height: normal;">
<span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></div>
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="background-color: white; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><u>Translating</u></span></span></div>
<div style="margin: 0px;">
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif; font-weight: normal;"><span style="background-color: white; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">Given these two components, the language model and the translation model, how does the translation work? </span></span><span style="background-color: white; color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">The translation model provides a table with words and phrases in the source language and their possible translations to the target language, each with a score. </span><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif; font-weight: normal;"><span style="background-color: white; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">Given a sentence in the source language, the system uses this table to translate phrases from the source sentence to phrases in the target language. </span></span><br />
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif; font-weight: normal;"><span style="background-color: white; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></span>
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="background-color: white; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">There are multiple possible translations for the source sentence; first, because the source</span></span><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> sentence could be segmented to phrases in multiple ways. For example, take the sentence </span></span><i style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">Machine translation is a piece of cake. </i><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">The most intuitive thing to do would be to split it to words. This will yield a very literal translation (in Hebrew: ŚȘŚšŚŚŚ ŚŚŚŚ Ś ŚŚŚ ŚŚȘŚŚŚ Ś©Ś ŚąŚŚŚ), which doesn't make much sense. The translation table probably also has an entry for the phrase <i>piece of cake</i>, translating it to a word or an idiom with the same meaning in the target language (in Hebrew: Ś§ŚŚ Ś§ŚŚŚȘ. Ask Google).</span><br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span>
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">Second, even for a certain segmentation of the source sentence, some phrases have multiple translations in the target language. It happens both because the word in the source language is </span></span><i>polysemous </i><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">(has more than one meaning) (e.g. <a href="https://www.google.co.il/search?q=piece+definition">piece</a>), and because one word in the source language can have many synonyms in the target language (e.g. <a href="https://translate.google.com/#en/iw/cake">cake</a>).</span></span><br />
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></span>
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">The translation system chooses how to segment the source sentence and how to translate each of its phrases to the target language, using the scores that the two models give the translation. It multiplies the translation score for each phrase, and the language model score for the entire target sentence, f</span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">or example:</span><br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;"><br /></span>
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">P(ŚȘŚšŚŚŚ ŚŚŚŚ Ś ŚŚŚ ŚŚȘŚŚŚȘ ŚąŚŚŚ|</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">Machine translation is a piece of cake) = TM(ŚȘŚšŚŚŚ,translation) TM(ŚŚŚŚ Ś,machine) TM(ŚŚŚ,is) TM(ŚŚȘŚŚŚȘ,piece) TM(ŚąŚŚŚ,cake) LM(ŚȘŚšŚŚŚ ŚŚŚŚ Ś ŚŚŚ ŚŚȘŚŚŚȘ ŚąŚŚŚ)</span><br />
<div style="text-align: center;">
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></div>
<div style="text-align: left;">
<div style="text-align: left;">
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">This score could be understood as the conditional probability of translating </span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;"><i>Machine translation is a piece of cake</i> to </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">ŚȘŚšŚŚŚ ŚŚŚŚ Ś ŚŚŚ ŚŚȘŚŚŚȘ ŚąŚŚŚ, but </span><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">I'll spare you the formulas. The intuition behind multiplying the scores for the different translation components is joint probability of independent events.</span></span></div>
<div style="text-align: left;">
<br /></div>
</div>
<div style="text-align: left;">
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">Some things to note: the word <i>of</i> disappeared from the translation, and the words <i>machine</i> and <i>translation</i> switched places in the target sentence. These things happen and are allowed. Machine translation is a bit more complex than what I've told you. Just a bit :)</span></span></div>
<div style="text-align: left;">
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></span></div>
<div style="text-align: left;">
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">So each possible translation receives a final score, indicating both how adequate the translation is and how fluent it is in the target language, and the system chooses the translation with the highest score. In this case, Google ironically gets this one wrong.</span></div>
<div style="text-align: left;">
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8_KC5SI47gqoId-h0e3bOIqpAj8O_AkjHSX1Ylk6WiAgzhA1lwJy0Oam9_fuZVQN1P6fa4AxdgBwM10jPtuKlUWCQOosFYEORAKtP3bBXDX6je_-4P2UNF1O4wmuf2IYW_J7viqkvhRg/s1600/piece+of+cake.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="74" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8_KC5SI47gqoId-h0e3bOIqpAj8O_AkjHSX1Ylk6WiAgzhA1lwJy0Oam9_fuZVQN1P6fa4AxdgBwM10jPtuKlUWCQOosFYEORAKtP3bBXDX6je_-4P2UNF1O4wmuf2IYW_J7viqkvhRg/s320/piece+of+cake.PNG" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Google ironically translates "Machine translation is a piece of cake" incorrectly.</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<b>Why is it a real bad idea to rely on machine translation when you wish to speak / write in a language that you don't speak?</b></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Because you may say things that you don't mean. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
</div>
<div style="text-align: center;">
<iframe allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/akbflkF_1zY/0.jpg" frameborder="0" height="266" src="https://www.youtube.com/embed/akbflkF_1zY?feature=player_embedded" width="320"></iframe></div>
<br />
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
I'll give some examples of problems in translation.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
</div>
</div>
</div>
<div>
<div dir="ltr" style="text-align: left;">
<u style="text-align: left;">Ambiguity</u><span style="text-align: left;"> </span><span style="text-align: left;">- as you probably remember, this problem keeps coming back in every NLP task. In translation, the problem is that a </span><i style="text-align: left;">polysemous </i><span style="text-align: left;">word in the source language may be translated to different words in the target language for different senses. For example, <i>wood </i>can be translated in Hebrew to ŚąŚ„ (a piece of a tree) or to ŚŚąŚš (a geographical area with many trees). While a human translator can pick the correct translation according to context, machines find it more difficult.</span></div>
<div dir="ltr" style="text-align: left;">
<br /></div>
<div dir="ltr" style="text-align: left;">
It gets even worse when you use a polysemous word in its less common meaning; A few months ago I needed to send an email to the PC (program committee) chairs of the conference in which I published my paper. I've noticed something funny about my mail, and I had to check how Google Translate can handle it. My mail started with "Dear PC chairs". I translated it to Hebrew (and back to English, for the non-Hebrew speakers in the audience):</div>
<div dir="ltr" style="text-align: left;">
<br style="text-align: left;" />
<span style="text-align: left;">Dear PC chairs => ŚŚĄŚŚŚȘ ŚŚŚ©Ś ŚŚ§ŚšŚŚ => </span><span class="short_text" id="result_box" lang="en" style="text-align: left;" tabindex="-1"><span class="hps"><span class="short_text" id="result_box" lang="en" tabindex="-1"><span class="hps">expensive c</span></span>omputer</span> <span class="hps">chairs</span></span><br />
<span class="short_text" lang="en" style="text-align: left;" tabindex="-1"><span class="hps"><br /></span></span>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8bLj-i_YNyQ7-WrRWEC1AHD6F8u7r7cnokqIzDSyL0q1MlxSeD-6LLhNT1XBNMf_TqcRTU1sYGQtBPXG32BlvA90WHKOdceYOaeazYM6N3zkRVoqBp4skVAwVFc3vnPGNQsfw293BFvM/s1600/dear+pc+chairs.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="76" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8bLj-i_YNyQ7-WrRWEC1AHD6F8u7r7cnokqIzDSyL0q1MlxSeD-6LLhNT1XBNMf_TqcRTU1sYGQtBPXG32BlvA90WHKOdceYOaeazYM6N3zkRVoqBp4skVAwVFc3vnPGNQsfw293BFvM/s320/dear+pc+chairs.PNG" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Don't expect SMT systems to always understand what you mean</td></tr>
</tbody></table>
<br />
So what happened here? The word <i>chair</i> has two meanings; I meant the less common one, <i>chairman</i>, while Google translated it to the more common sense (<i>furniture</i>). Acronyms are much worse when it comes to polysemy, and PC refers, almost 100% of the times, to <i>Personal Computer</i>. On top of that, the adjective <i>dear</i> is translated in Hebrew to ŚŚ§Śš, which means both <i>dear</i> and <i>expensive</i>. Google chose the wrong sense, creating a funny translation. However, given the knowledge about how SMT systems work, it's understandable that selecting the more common senses of words yields better scores for both the language model and the translation model. I can't blame Google for this translation.<br />
<br />
This is just one example of a problem in machine translation. There are so many other problems: different languages have different word order (e.g. adjective before the noun in English, but after the noun in Hebrew, French and many other languages); in some languages nouns have gender while in others they don't; idioms are really tough for SMT systems - sometimes they are translated literally, like the <i>piece of cake</i> example (when it was a part of a sentence).<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0RMz6zyu_yRjwT7HO5_ztR662AiRkJcy_UYmYbhyDxyAHUnieDQA4VrpmzfZirmLhRsYRGYu1G9OTIOhVITItRmLPyrP2fq3T1DGLANijhTminf85KEpeWsDkftpGHSAKU9zg6-4q1bg/s1600/out+of+sight.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="91" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0RMz6zyu_yRjwT7HO5_ztR662AiRkJcy_UYmYbhyDxyAHUnieDQA4VrpmzfZirmLhRsYRGYu1G9OTIOhVITItRmLPyrP2fq3T1DGLANijhTminf85KEpeWsDkftpGHSAKU9zg6-4q1bg/s320/out+of+sight.PNG" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">A good translation for an idiom.</td></tr>
</tbody></table>
</div>
<div dir="ltr" style="text-align: left;">
<span style="text-align: left;">These problems are handled by more complex machine translation systems, that enable word re-ordering and translation at the syntactic level. Nevertheless, as you probably encounter from time to time, this task is not yet performed perfectly.</span></div>
<ul dir="ltr" style="text-align: left;"><ul>
</ul>
</ul>
<div dir="ltr" style="text-align: left;">
<hr />
<b><br /></b>
Since machine translation systems are not very accurate, it is very funny to translate a sentence to a random foreign language and back to English several times and see how you often get a totally different sentence (sometimes meaningless) in the end of this process. This is what <a href="http://ackuna.com/badtranslator">Bad translator</a> does. I tried it several times, and it was very amusing. Their example from the Ten Commandments inspired me to try other commandments, resulting in very funny bad translations:<br />
<br />
<span style="color: #274e13;">Thou shalt not make unto thee any graven image </span>=><span style="color: #274e13;"> </span><span style="color: red;">You can move the portrait</span></div>
<div dir="ltr" style="text-align: left;">
<div>
<span style="color: #274e13;">Thou shalt not kill </span>=><span style="color: #274e13;"> </span><span style="color: red;">You must remove.</span><br />
<div>
<span style="color: #274e13;">Thou shalt not commit adultery </span>=><span style="color: #274e13;"> </span><span style="color: red;">Because you're here, try three</span></div>
<div>
<span style="color: #274e13;">Thou shalt not steal </span>=><span style="color: #274e13;"> </span><span style="color: red;">woman</span></div>
<br />
And some good ones:<br />
<br />
<span style="color: #274e13;">Remember the sabbath day, to keep it holy</span> =><span style="color: #274e13;"> </span><span style="color: #0b5394;">Don't forget to consider Saturday.</span></div>
<div>
<span style="color: #274e13;">Honour thy father and thy mother </span>=><span style="color: #274e13;"> </span><span style="color: #0b5394;">honor your father and mother</span></div>
<div>
<br /></div>
<div>
You are welcome to try it and post your funny bad translations in the comments!</div>
</div>
<div dir="ltr" style="text-align: left;">
<br /></div>
</div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com4tag:blogger.com,1999:blog-9145120678290195131.post-2131191192708466192015-08-21T15:43:00.001+03:002015-08-21T17:46:27.273+03:00Probability<div dir="rtl" style="text-align: right;" trbidi="on">
<div style="text-align: left;">
<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="ltr" style="text-align: left;">
<div style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">
How likely are you to read this through? If your answer is a <span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">numerical </span>value between 0 and 1, you may skip this post. You already know the material.<br />
<br /></div>
<div style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">
Why am I writing about probability? First of all, because I really LOVE probability. If you don't, I hope that by the end of this post you would like it a little bit more. Second, we use probability in everyday life: when we plan an outdoor activity, we estimate the probability of rain. When we make life decisions, we think of the probable consequences, since we can't tell the future... Unfortunately, most people use it wrong, without basic understanding of probability.</div>
<div style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">
And last, I was about to write a post about Machine Translation, and realized that I can't explain anything without first introducing probability. Probability is widely used in NLP, as in many other computer science fields.</div>
<div style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">
<br /></div>
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">Probability is a numerical value between 0 and 1 measuring the likeliness that an event will occur. 0 means that the event will not occur, and 1 means that the event will certainly occur. An intuitive example is tossing a coin; there are two possible <i>outcomes</i>: "heads" and "tails". If the coin is fair, the probability (chance) of each outcome is </span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">œ</span><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> (50%): P(heads) = P(tails) = œ.</span></span><br />
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></span>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgcK-_OvZeLdpdIK2dbG3TqFlLSFm1X__BJMdsWMp3PsYwPpJHWYnx4JmlMeNOlQ4WyC1RAkjPGVObh8ctjRQ5cZ92n5Ja0m7soqhwBF5GYR69n4EfF64Z0KisgYBPnPT2ZqG01IMkpB9I/s1600/coin.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgcK-_OvZeLdpdIK2dbG3TqFlLSFm1X__BJMdsWMp3PsYwPpJHWYnx4JmlMeNOlQ4WyC1RAkjPGVObh8ctjRQ5cZ92n5Ja0m7soqhwBF5GYR69n4EfF64Z0KisgYBPnPT2ZqG01IMkpB9I/s1600/coin.jpg" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">A fair coin and the two outcomes of its tossing.</td></tr>
</tbody></table>
<br />
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">In general, when conducting an experiment, such as tossing a coin, there are several possible outcomes (e.g. { heads, tails }). Every outcome's event (e.g. "the coin landed on heads") has a probability between 0 and 1. Since every experiment must have an outcome, the probability that any of the possible outcomes occurred is 1. In this example, P("heads or tails") = 1. This event represents the entire <i>"probability space"</i>.</span></span><br />
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></span>
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">As you can see from the example above, an event can be composed of several outcomes. Think about another experiment: rolling a die. The possible outcomes are: {1, 2, 3, 4, 5, 6}. The event "the outcome is an odd number" is composed of the outcomes 1, 3, and 5. We can write it as A={1,3,5}.</span></span><br />
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></span>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBx3Ux59_k6Iy5KumuJ4aQ8I_NG7Yzt5fZ1lm_lA8TdJbdQV5Ok2lyezt2FtlPpkiZOu9pM0_yLw_imL2ekXaH5Kj9ZFccrRKHMg4yCEGvtmlcbAjv4gmPxN19rp6ENxKbzNV_Ld1FzYc/s1600/die.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="128" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBx3Ux59_k6Iy5KumuJ4aQ8I_NG7Yzt5fZ1lm_lA8TdJbdQV5Ok2lyezt2FtlPpkiZOu9pM0_yLw_imL2ekXaH5Kj9ZFccrRKHMg4yCEGvtmlcbAjv4gmPxN19rp6ENxKbzNV_Ld1FzYc/s200/die.jpg" width="200" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">A die with 3 out of 6 possible outcomes showing.</td></tr>
</tbody></table>
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span>
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">If two events have no common outcomes, they are called <i>disjoint</i>, and the probability that any one of them will occur is the sum of probabilities that each of them will occur. For example, A={1} and B={2}. P(A or B), denoted P(A </span><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">âȘ B),</span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> is the probability that either A or B occurred. We know that a die can only show one number, so A and B can't both occur, and:</span><br />
<div style="text-align: center;">
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: left;">P(A </span><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif; text-align: left;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">âȘ B)</span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> = P(A) + P(B).</span></div>
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></span>
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">We already know that the probability of the entire <i>probability space</i> (all possible outcomes) is 1, so: </span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">P({1, 2, 3, 4, 5, 6}) = 1. We also know that the events are disjoint, therefore P({1, 2, 3, 4, 5, 6}) equals to the sum of probabilities of all events. If the die is fair (it is not biased towards a certain outcome), then the probability of all outcomes is equal. Therefore, P(1) = ... = P(6) = â
. This is called <i>uniform distribution</i>. In most real-world examples, this is not the case, otherwise, probability would have been boring (and probability is fascinating! Really!).</span><br />
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></span>
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">Every event has a <i>complement</i>. For example, the event A="the die shows an odd number"={1, 3, 5} has a complement AÌ="the die shows an even number"={2, 4, 6}. The event B={1} has a complement BÌ={2, 3, 4, 5, 6}. The complement of an event is all the other possible outcomes. Now you must notice that by definition, "A or AÌ" is the entire probability space, and A and AÌ</span><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> </span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">are disjoint. Using the two properties we've just discussed:</span><br />
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">1) the probability of the entire probability space is 1.</span></span><br />
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">2) the probability that any of disjoint events occurred is the sum of their probabilities.</span></span><br />
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">We can tell that P(A) + P(AÌ) = 1. So if you know the probability that it would rain tomorrow P(R), you also know the probability that it won't rain tomorrow: 1 - P(R).</span></span><br />
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></span>
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><u>Joint & </u></span></span><u style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">Conditional</u><u style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> Probabilities</u><br />
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">We can also discuss the <i>joint probability</i> of events A and B: this is the probability that both events will occur. For example, what is the probability of rolling an even number which is bigger than 2? Let's define two events. A is the event of even outcomes: A = {2, 4, 6}, and B is the event of outcomes larger than 2: B = {3, 4, 5, 6}. Then C is the intersection of A and B, denoted A â© B: it contains all the outcomes that are both even (in A) and larger than 2 (in B): C = {4, 6}. Since {4} and {6} are disjoint, and the probability of each outcome is â
, then P(C), which is also denoted as the joint probability of A and B, P(A, B) = P({4}) + P({6}) = â
+ â
= â
.</span></span><br />
<br />
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">Say that you know event A occurred, for example, you know that it rains today. Does it change the probability of another event B to occur, for example, that you will be late to work today? The probability of event B to occur, given that event A occurred, is the <i>conditional probability</i> P(B|A) (B given A). If A and B are </span></span><i style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">dependent,</i><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> this probability is different from the <i>prior probability</i> of B: P(B) (the probability of B, without having knowledge about A). The c</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">onditional probability of B given A is the ratio of how likely it is that A and B occur together, given that A has occurred: </span><br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span>
<br />
<div style="text-align: center;">
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">(1) P(B|A) = P(A,B) / P(A)</span></div>
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span>
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">For example, when rolling a die, let</span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">A={1,3,5} (odd outcome) and B={4,5,6} (outcome greater than 3). Then P(A,B) = P(</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">A â© B) = P({5}) = </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">â
. P(A) = </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">â
+ â
+ â
= œ. </span><br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">Therefore,</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> P(B|A) = P(A,B) / P(A) = </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">â
</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> / </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">Âœ = </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">â
< P(B) = P({</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">4, 5, 6}) = </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">â
+ â
+ â
= œ. So If you know that outcome was odd, the probability that it was greater than 3 has reduced.</span><br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span>
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">On the other hand, some events may be <i>ind</i></span><i style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">ependent</i><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">. For example, if </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">B is the event of outcomes larger than 2: B = {3, 4, 5, 6}, and A remains the same, then P</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">(A,B) = P(</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">A â© B) = P({3,5}) = </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">â
+ â
=</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">â
</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">. P(A) remains </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">Âœ</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">, and</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> P(B|A) = P(A,B) / P(A) = </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">â
</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> / </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">Âœ = â
</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> = P(B) = P({3, </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">4, 5, 6}) = </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">â
+</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">â
+ â
+ â
= </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">â
</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">. So knowing that the outcome was odd doesn't affect the chances the the outcome is greater than 2, and A and B are </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><i>ind</i></span><i style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">ependent.</i><br />
<i style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></i>
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">If two events A and B </span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">are </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><i>ind</i></span><i style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">ependent, </i><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">then </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">P(B|A)</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> = P(B), P(A|B) = P(A), and using equation</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> </span><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">(1) we get that P(A,B) = P(A)P(B). So if you know that two events are independent, and you want to know the probability that both of them will occur, you need to multiply the probabilities that each of them will occur. For example, what is the probability that a die will have an odd outcome (A) and a coin will show heads (B)? Intuitively, these two experiments are independent, so P(A) = </span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">œ, P(B) = </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">œ, and P(A,B) = </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">œ * </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">œ = Œ. But don't trust your intuition, and always make sure that these events are really independent. Sometimes two events seem independent, while they are actually not (as in the <a href="https://en.wikipedia.org/wiki/Butterfly_effect">butterfly effect</a>).</span></span><br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></span>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPtwIeJXN8VaffqnNXGSWzXuljFn-PGbm0NUdKiOvdZb34EwwDxI17M_nYAsKXL8FIyfr6nSBD-b78V27-tC_9_uxX-GeJeq85sH9iEj413pNHQaPwflS3R-BJZrt5C5bs92zTsuuw6jc/s1600/butterfly.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPtwIeJXN8VaffqnNXGSWzXuljFn-PGbm0NUdKiOvdZb34EwwDxI17M_nYAsKXL8FIyfr6nSBD-b78V27-tC_9_uxX-GeJeq85sH9iEj413pNHQaPwflS3R-BJZrt5C5bs92zTsuuw6jc/s200/butterfly.jpg" width="188" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">If this butterfly flapped his wings yesterday, what are the chances I will be late to work next week?</td></tr>
</tbody></table>
<i style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></i><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><u>Bayes Rule</u></span><br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">Using equation (1) we get that P(A,B) = P(A) * </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">P(B|A), and this gives us Bayes Rule:</span><br />
<div style="text-align: left;">
<br /></div>
<div style="text-align: center;">
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">(2) P(</span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: left;">A|B</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">) = </span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: left;">P(A) * </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">P(B|A) / P(B)</span></div>
<br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">This can be useful in some cases when you know the conditional probability in one direction and would like to compute the other. For example, let's say that there is a</span><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> clinical test that should diagnose a specific illness. This test is not a 100% accurate: if a person is ill, it will come out positive in 98% of the times. If the person is healthy, it will come out positive in 1% of the times. The ratio of ill people in the population is 2%. Say that someone took this test, and it came out positive. Does it necessarily mean that he is sick? No. It has some probability, that we can compute.</span></span><br />
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></span>
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">A is the event that a person has this illness. B is the event that the test came out positive. We know that P(B|A)=0.98 (the probability that the test came out positive for an ill person). We also know that P(A)=0.02 (the probability to suffer from this illness). We would like to compute the probability P(A|B). We can use equation (2), but we need to know P(B) - the probability that the test came out positive.</span></span><br />
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></span>
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">We can use the <i>law of total probability</i> according to which</span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">:</span><br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span>
<br />
<div style="text-align: center;">
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">(3) P(B</span><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">) = P(A,B) + P(</span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: left;">AÌ,B</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">) = </span><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">P(B|A</span><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">) P(A</span><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">) + </span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">P(B|</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: left;">AÌ</span>) P(</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: left;">AÌ</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">)</span><br />
<div>
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></span></div>
</div>
<div style="text-align: left;">
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">In this example, what is the probability that the test came out positive? There are two cases, one in which the person is ill, and another in which he is healthy. These events are disjoint (because a person is either ill or healthy but never both). </span></span></div>
<div style="text-align: left;">
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></span></div>
<div style="text-align: left;">
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">So we get that </span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;"><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">P(B</span><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">) </span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">= </span><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">P(B|A</span><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">) P(A</span><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">) + </span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">P(B|</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: left;">AÌ</span>) P(</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">AÌ</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">) = 0.98*0.02 + 0.01*(1-0.02) = </span><span style="text-align: center;"><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">0.0294, and using Bayes rule, </span></span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">P(</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">A|B</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">) = </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">P(A) * </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">P(B|A) / P(B) = </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">0.02*</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">0.98</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;"> / </span><span style="text-align: center;"><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">0.0294 </span></span></span><span style="text-align: center;"><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">= </span></span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">â
. Since the test is not very accurate, and the illness is so rare, if someone is tested positive for this illness, there is a probability of</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">â
that he is actually healthy!</span></div>
<div style="text-align: left;">
<br /></div>
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><u>The Chain Rule</u></span>
<br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">We've seen that P(A,B) = P(A) * </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">P(B|A). Sometimes it is useful in this direction. This is called the <i>chain rule</i>, and it can be extended to more than two events. </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">In some cases, we would like to compute the probability of multiple events, rather than just one or two. For example, say that you know the probabilities<i> </i>of private names in a certain country. When a child is born, he has a certain probability to be named John (P(John)) and other probabilities for other names. If you know the names of his older siblings, this may affect the probability of his name; if his older sibling is called John, it reduces the probability that he will also be named John. And if his sister's name is </span><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">Ablah</span></span><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">, then he is more likely to be named </span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">Mohammad</span><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> than David.</span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> If you want to compute the probability of the names of all children in the family, for example P(John, Jane, David), you can use the chain rule. You will need to know the <i>prior probability </i>of the name John (what is the probability that a kid is called John, if you don't have any knowledge about his siblings, or if he is the first child). Then, you will need to know the probability of a girl being called Jane, given that her brother's name is John. Last, you will need to know the probability of a boy being called David, given that he has two siblings named John and Jane. In general, </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">the probability that events </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">A<sub>1,</sub></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px;">2,...,</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px;">n </sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">occurred is the multiplication of the probability of each event, given that the previous events occurred:</span><br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span>
<br />
<div style="text-align: center;">
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">(4) P(A<sub>1,</sub></span><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">2,...,</sub><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">n</sub><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">) = </span><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">P(A</span><sub style="line-height: 18.4799995422363px;">1</sub><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">) P(</span><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">2</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif;">|</span></span><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">1</sub><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">) ... P(</span><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">n</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">|</span><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">1</sub><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">2,...,</sub><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">A</span><sub style="line-height: 18.4799995422363px;">n-1</sub><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">)</span></span></div>
<br />
<br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif;"><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">This, again, is called </span></span></span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">the </span><i style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">chain rule.</i><br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif;"><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></span></span></span>
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">In some cases, we can make a </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><i>Markov</i></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><i> assumption</i> that the </span><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">probability of an event depends only on the preceding <i>k</i> events (for some fixed number k). For example, if a family has 5 children, the probability of their names will be: </span></span><br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">P(A<sub>1,</sub></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">2,</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A<sub>3,</sub></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">4,</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">5</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">) = </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">P(A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">1</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">) P(</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">2</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">|</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">1</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">) </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">P(</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">3</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">|</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">1</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">2</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">)</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;"> </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">P(</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">4</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">|</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">1</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">2</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">3</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">)</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;"> </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">P(</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">5</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">|</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">1</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">2</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">3</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">4</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">)</span><br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">But if we assume that </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">a child's name only depends on his two immediate older siblings' names,</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> then we get:</span><br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">P(A<sub>1,</sub></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">2,</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A<sub>3,</sub></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">4,</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">5</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">) = </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">P(A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">1</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">) P(</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">2</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">|</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">1</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">) </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">P(</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">3</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">|</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">1</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">2</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">)</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;"> </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">P(</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">4</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">|</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">2</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">3</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">)</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;"> </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">P(</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">5</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">|</span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">3</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">A</span><sub style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; line-height: 18.4799995422363px; text-align: center;">4</sub><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px; text-align: center;">)</span><br />
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">which is easier to compute.</span></span><br />
<br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><u>Approximation</u></span><br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif;">
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">Let's return to the names example. What if you don't know the probability function of names, but you do have access to the list of all given names in a certain time and country? You can <i>estimate </i>(approximate) the probability. One simple way of doing this is by counting. This method is called <i>Maximum Likelihood Estimation</i> (MLE). If you want to know what's the probability of a child being named John, check what's the ratio of people called John in the entire population, so that: </span></span></span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">P*(John) = #John/N, where N is the number of people in the list you have. Since this is not a real probability but an approximation, it is denoted by </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">P* and not by P.</span><br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif;"><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span></span></span></span>
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif;"><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">If you want to know the probability of someone being called Jane given that her </span></span></span></span><span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">immediate older sibling's name is John, count all the pairs of John followed by Jane and divide by all pairs of John followed by any name: </span></span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"> </span><span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">P*(Jane|John) = #(John,Jane)/#(John,*).</span><br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span>
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">There are more complex methods to approximate a probability function, but I think that's enough for one post. </span><br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;"><br /></span>
<span style="color: #666666; font-family: Trebuchet MS, Trebuchet, Verdana, sans-serif;"><span style="font-size: 13.1999998092651px; line-height: 18.4799995422363px;">So, given that you are reading this sentence now, what is the probability that you read the entire post?</span></span><br />
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">
<span style="color: #666666; font-family: 'Trebuchet MS', Trebuchet, Verdana, sans-serif; font-size: 13.1999998092651px; line-height: 18.4799995422363px;">
</span></span></div>
</div>
</div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com0tag:blogger.com,1999:blog-9145120678290195131.post-5676976371688059812015-08-10T17:05:00.000+03:002015-08-10T17:10:02.587+03:00Supervised Learning<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: left;">
If mathematics is both the queen and servant of science, then <a href="https://en.wikipedia.org/wiki/Machine_learning">machine learning</a> must be the princess and the maid of AI (artificial intelligence). The main goal in AI is to develop software capable of intelligent behavior, for example, a <a href="https://en.wikipedia.org/wiki/Autonomous_car">self-driving car</a>. One of the definitions of intelligent behavior is the ability to <i>learn </i>from "experience" (given data) and develop a model that can understand and react to new data. This goal is achieved with <i>machine learning</i>.</div>
<div style="text-align: left;">
<br />
This post is a high-level overview of <i>supervised learning</i>, which is the simplest category of machine learning. In future posts, I can elaborate on the actual algorithms and I might also write about other categories, such as unsupervised learning and deep learning -- but I have to admit that I'll have to study them in more depth for that.<br />
<br />
I think the best way to explain supervised learning is with a motivating example: <a href="https://en.wikipedia.org/wiki/Anti-spam_techniques">spam detection</a>. Your mail box contains a spam folder, and when spam is detected, it is stored in the spam folder rather than in the inbox. How does your mail provider recognize that a certain email is spam?<br />
<br />
Suppose you had to manually decide whether a certain email is a spam or not. You would probably check if the sender is known or unknown and whether the message contains suspicious words and phrases such as "free", "cash bonus", and other <a href="http://blog.hubspot.com/blog/tabid/6307/bid/30684/The-Ultimate-List-of-Email-SPAM-Trigger-Words.aspx">spam triggering words</a>. Then you can define rules on top of these observations, for instance: "classify as spam if the message is sent from an unknown sender and contains at least 2 spam triggering words, and as non-spam otherwise". In the same way, you can define these rules and let the software apply them automatically to new mails. This approach is called <i>rule-based</i>. While it can lead to accurate results, it requires the effort of defining the rules, which in some tasks should be performed by experts (spam detection is a relatively easy task).<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiS2_W87wv3DF2He1u9vE7w40r0Y8l8YThyphenhyphenS65eDe9a8JL2uidK-wdgaVxTomzUktzNwUVNGYtgeSVouhghf3kWnjlO7RPqsqGQdMRibkya1xsgpWqMaK0WQy2x3HTgQ4QD2sQMm9Fucps/s1600/My+spam.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiS2_W87wv3DF2He1u9vE7w40r0Y8l8YThyphenhyphenS65eDe9a8JL2uidK-wdgaVxTomzUktzNwUVNGYtgeSVouhghf3kWnjlO7RPqsqGQdMRibkya1xsgpWqMaK0WQy2x3HTgQ4QD2sQMm9Fucps/s400/My+spam.PNG" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: 12.8000001907349px;">An example for accurate spam detection</span></td></tr>
</tbody></table>
<br />
The common solution is to let the machine learn a model (function) that receives as input an email, and returns whether it is a spam message or not, based on observed features of the message, such as the sender and content.<br />
<br />
In supervised learning, the machine is provided with a set of labeled examples, called <i>training set</i>. This is a small set of data items that their classification is known in advance (e.g. annotated by humans). Each instance describes one data item (e.g. an email message), using predefined <i>features</i> that are relevant for the certain task (e.g. the sender address, each of the words that occur in the subject of the message, etc). In addition, in the training set, each item has a matching <i>true label</i>. which is the item's known class (e.g. spam / non-spam). The machine performs a <i>learning</i> phase, during which it learns a function (model) that receives an unlabeled instance (e.g. new email message) and returns its <i>predicted label</i> (e.g. spam / non-spam). <br />
<br />
The learning phase is performed once, and then the model is ready to use for <i>inference</i> as many times as you want. You can give it a new unlabeled instance (e.g. a new email message that just arrived) and it will predict its <i>class </i>(e.g. spam / non-spam) by applying the learned function.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdT3prKZOzKyXj-cB5wcDpKGAD_klBbVMucUEAKtaucYNjU3PgUo_KFUn8J_YZLP11H7a2smybxso72hGXv7WJihVuw7_UAbgIIC0bChlm-t3VM9RbjUdkYphIohnbtj7mK6GqjG7gW20/s1600/pipeline.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="183" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdT3prKZOzKyXj-cB5wcDpKGAD_klBbVMucUEAKtaucYNjU3PgUo_KFUn8J_YZLP11H7a2smybxso72hGXv7WJihVuw7_UAbgIIC0bChlm-t3VM9RbjUdkYphIohnbtj7mK6GqjG7gW20/s400/pipeline.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: 12.8000001907349px;">Supervised learning pipeline (picture taken from <a href="http://www.cse.iitk.ac.in/users/se367/10/presentation_local/Binary%20Classification.html">here</a>).</span></td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
As you may have noticed, spam detection does not perform perfectly; sometimes a spam message is missed and stays in the inbox (the model classifies a "positive" spam as "negative" non-spam - <i>false negative</i>). In other times, a valid message unjustifiably finds its way to the spam folder (the model classifies a "negative" non-spam as "positive" spam - <i>false positive</i>).<br />
<br />
In order for the algorithm to perform well, it needs to learn a model that best describes the training set, with the assumption that the training set is representative of the real-world instances. In order to assess how successful a learned model is (in comparison with other models or in general), an <i>evaluation</i> is performed. This requires an additional set of labeled examples, used to test the model, which is called the <i>test set</i>. This set is disjoint from the <i>training set </i>and not used during the learning phase. The model is applied for each of the instances in the test set, and the predicted label is compared with the true label (<i>gold standard</i>) given in the test set. An <i>evaluation measure</i> is then computed - for example precision<a href="http://veredshwartz.blogspot.co.il/2015/08/supervised-learning.html#1" name="top1"><sup>1</sup></a>, recall<a href="http://veredshwartz.blogspot.co.il/2015/08/supervised-learning.html#2" name="top2"><sup>2</sup></a> or F1<a href="http://veredshwartz.blogspot.co.il/2015/08/supervised-learning.html#3" name="top3"><sup>3</sup></a>.<br />
<br />
Of course that spam detection is only one example out of many examples of supervised learning. Other examples are:<br />
<ul style="text-align: left;">
<li>Medical diagnosis - predict whether a patient suffers from a certain disease, based on his symptoms</li>
<li>Detecting credit card fraudulent transactions</li>
<li><a href="http://veredshwartz.blogspot.co.il/2015/07/lexical-inference.html">Lexical inference</a> - predict whether two terms hold a certain semantic relation, based on the relations between them in knowledge resources</li>
</ul>
In addition, there are more complex variations of supervised learning. The examples I gave were of <i>binary classification</i>, where each instance is classified to one of two classes: either positive (e.g. spam) or negative (e.g. non-spam). Other tasks require <i>multi-class classification</i>, in which every instance can be classified to one of several predefined classes; for instance, in <a href="https://en.wikipedia.org/wiki/Optical_character_recognition">optical character recognition</a> (OCR), each hand-written character should be classified as one of the possible characters or digits in the alphabet.<br />
<br />
In other tasks, any instance can be classified to multiple classes from a predefined set of classes, for instance, determining the different topics of a document, from a predefined set of topics (this post can be classified as computer science, machine learning, supervised learning, etc). This is called <i>multi-label classification</i>.<br />
<br />
More complex tasks require outputting a structure rather than a class - this is called <i>structured prediction</i>. One such task is <a href="https://en.wikipedia.org/wiki/Part-of-speech_tagging">part-of-speech (POS) tagging</a>: given a sentence, predict the part-of-speech of every word in the sentence (e.g. noun, verb, adjective). Rather than predicting the POS tag of every word separately, the sequence is predicted together, taking advantage of dependencies between preceding POS tags; e.g. if the previous word is tagged as a determiner, it is more likely that this word is a noun.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgwaygg8iA_mekhQbxmMqFwsPsROVDHGpWABT8d4lxS5ZhYGokR9y8lcFLnfQMJ_soDEKD73OdglatHTauxoT9Ykp_MAesiFxZpGeuFf2iKr68wBowP_4M2Jsn7HHTdkt9-cBKCrPVhuPM/s1600/tagger.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="205" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgwaygg8iA_mekhQbxmMqFwsPsROVDHGpWABT8d4lxS5ZhYGokR9y8lcFLnfQMJ_soDEKD73OdglatHTauxoT9Ykp_MAesiFxZpGeuFf2iKr68wBowP_4M2Jsn7HHTdkt9-cBKCrPVhuPM/s320/tagger.PNG" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">An example of POS tagging, from <a href="http://nlp.stanford.edu:8080/corenlp/">Stanford Parser</a></td></tr>
</tbody></table>
No post about machine learning is complete without mentioning <i>overfitting </i>and <i>regularization. </i>During the learning phase, the machine tries to learn a model that fits the training set. However, it might <i>overfit</i> the training set, by memorizing all the instances instead of learning the main trends in the data. In this case, the evaluation results when applied to the training set (trying to predict the labels of the instances without looking at the true label) will be very good. On a separate test set, however, they are expected to perform worse, since the algorithm learned a very specific function which is not good at handling unseen data. For example, suppose that our training set contains the following 6 emails (training sets are usually much larger, this is for simplicity):<br />
<br />
<table border="1" style="text-align: center;"><tbody>
<tr><th>Subject</th><th>True Label</th></tr>
<tr><td>earn extra cash</td><td>spam</td></tr>
<tr><td>our meeting on Monday</td><td>non-spam</td></tr>
<tr><td>the slides you requested</td><td>non-spam</td></tr>
<tr><td>get cash today</td><td>spam</td></tr>
<tr><td>hi</td><td>non-spam</td></tr>
<tr><td>cash bonus</td><td>spam</td></tr>
</tbody></table>
<br />
A good algorithm will learn that "cash" in the mail's subject is indicative of spam. A bad algorithm will only recognize emails with the exact subjects "earn extra cash", "get cash today" and "cash bonus" as spam. Then, if it sees a new mail with the subject "get your cash immediately", it won't know it is also spam.<br />
<br />
The solution is to apply <i>regularization</i>. Without going into too technical details, regularization is used to punish the algorithm for overfitting the training set, causing it to prefer learning a more general model.
<br />
<br />
This was just the tip of the iceberg of machine learning. Stay tuned for more about it!</div>
<hr dir="ltr" style="text-align: left;" />
<div dir="ltr" style="text-align: left;">
<span class="Apple-style-span" style="font-size: x-small;"><a href="https://www.blogger.com/null" name="1"><b>1</b></a> The fraction of instances that were classified as positive (e.g. predicted to be spam) that are actually positive (e.g. actual spam messages). A numeric value between 0 and 1, 1 being the best precision, in which there are no negative instances falsely classified as positive. <a href="http://veredshwartz.blogspot.co.il/2015/08/supervised-learning.html#top1"><sup>â©</sup></a></span></div>
<div dir="ltr" style="text-align: left;">
<span class="Apple-style-span" style="font-size: x-small;"><a href="https://www.blogger.com/null" name="2"><b>2</b></a> The fraction of positive instances (e.g. spam messages) that were also classified as positive (e.g. predicted to be spam). A numeric value between 0 and 1, 1 being the best recall, in which there are no positive instances falsely classified as negative. <a href="http://veredshwartz.blogspot.co.il/2015/08/supervised-learning.html#top2"><sup>â©</sup></a></span></div>
<div dir="ltr" style="text-align: left;">
<span class="Apple-style-span" style="font-size: x-small;"><a href="https://www.blogger.com/null" name="3"><b>3</b></a> A measure that balances between precision and recall. A numeric value between 0 and 1, 1 being the best F1. <a href="http://veredshwartz.blogspot.co.il/2015/08/supervised-learning.html#top3"><sup>â©</sup></a>
</span></div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com6tag:blogger.com,1999:blog-9145120678290195131.post-51009602641421808922015-07-13T17:46:00.000+03:002015-07-13T18:32:34.526+03:00Lexical Inference<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="rtl" style="text-align: right;" trbidi="on">
<div dir="ltr" style="text-align: left;">
After I dedicated the <a href="http://veredshwartz.blogspot.co.il/2015/07/natural-language-processing.html">previous post</a> to the awesome field of natural language processing, in this post I will drill down and tell you about the specific task that I'm working on: recognizing <b>lexical inference</b>. Most of the work that I will describe was done by other talented people. You can see references to their papers at the bottom of the post, in case you would like to read more about a certain work.</div>
<div dir="ltr" style="text-align: left;">
<br />
<b>What?</b></div>
<div dir="ltr" style="text-align: left;">
I'll start by defining what lexical inference is. We are given two terms, <i>x</i> and <i>y</i> (a term is a word such as <i>cat</i> or a multi-word expression such as <i>United States of America</i>). We would like to know whether we can infer the meaning of <i>y</i> from <i>x </i>(denoted by <i style="text-align: center;">x </i><span style="text-align: center;">â <i>y </i>throughout this post).</span><br />
<span style="text-align: center;"><br />For example, we can infer </span><i style="text-align: center;">animal</i><span style="text-align: center;"> from <i>cat</i>, because when we talk about a <i>cat</i> we refer to an <i>animal</i>. In general, </span><i>y</i> can be inferred from <i>x</i> if they hold a certain lexical or semantic relation; for example, if <i>y </i>is a <i>x</i> (<i>cat </i><span style="text-align: center;">â <i>animal</i>, <i>Lady Gaga </i></span><span style="text-align: center;">â <i>singer</i>), if <i>x</i> causes <i>y</i> (<i>flu </i></span><span style="text-align: center;">â <i>fever</i>), if <i>x </i>is a part of <i>y </i>(<i>London </i></span><span style="text-align: center;">â <i>England</i>), etc. </span><br />
<span style="text-align: center;"><br /><b>Why?</b></span><br />
<span style="text-align: center;">Now would be a good time to ask - why is this task important? We know that a <i>cat </i>is an <i>animal</i>. How would it help us if the computer can automatically infer that? I'll give a usage example. Let's say you use a search engine and type the query "actor Scientology" (or "actors engaged in </span><span style="text-align: center;">Scientology", if you don't search by keywords). </span><span style="text-align: center;">You expect the search engine to retrieve the following documents:</span><br />
<span style="text-align: center;"><br /></span>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjDjCd47en1__psgzSiKHpaE_1bMxiMW84IsZxwiQQIJBfb5P-c1JD9DK4Awo5QKDsD23A4CXdD7i9EljIuxbPtdlXe_NbK3DJuZ4xYysfPebUPea5JwoBY8GRDf7WsrIKRMcdFSNwXjMY/s1600/query_expansion.PNG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="125" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjDjCd47en1__psgzSiKHpaE_1bMxiMW84IsZxwiQQIJBfb5P-c1JD9DK4Awo5QKDsD23A4CXdD7i9EljIuxbPtdlXe_NbK3DJuZ4xYysfPebUPea5JwoBY8GRDf7WsrIKRMcdFSNwXjMY/s400/query_expansion.PNG" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 1: search results for the query "actor Scientology" that don't directly involve the word "actor"</td></tr>
</tbody></table>
<span style="text-align: center;"><br /></span>
<span style="text-align: center;">since they are talking about a certain actor (Tom Cruise or John Travolta) and </span><span style="text-align: center;">Scientology. However, what if these documents don't contain the word <i>actor</i>? The search engine needs to know that </span><span style="text-align: center;"><i>Tom Cruise </i></span><span style="text-align: center;"><i>â actor</i> to retrieve the first document, and that </span><span style="text-align: center;"><i>John Travolta</i></span><span style="text-align: center;"><i> </i></span><span style="text-align: center;"><i>â actor </i>to retrieve the second.</span><br />
There are many other applications, and in general, knowing that one term infers another term helps dealing with language variability (there is more than one way of saying the same thing).<br />
<br />
<b>How?</b><br />
People have been working on this task for many years. As many other NLP tasks, this one is also difficult. There are two main approaches to recognize lexical inference:<br />
<ul style="text-align: left;">
<li><b>Resource-based:</b> in this approach, the inference is based on knowledge from hand-crafted resources, that specify the semantic or lexical relations between words or entities in the world. In particular, the resource which is usually used for this task is <a href="http://wordnet.princeton.edu/">WordNet</a>, a lexical database of the English language. WordNet contains words which are connected to each other via different relations, such as (<i>tail</i>, part of,<i style="text-align: center;"> cat</i><span style="text-align: center;">) and </span>(<i style="text-align: center;">cat</i>, subclass of,<i style="text-align: center;"> feline</i><span style="text-align: center;">).<sup><a href="http://veredshwartz.blogspot.co.il/2015/07/lexical-inference.html#1" name="top1">1</a> </sup></span><span style="text-align: center;">See figure 2 for an illustration of WordNet.<br /><br />This approach is usually very precise (it is correct in most of the times that it says that </span><i style="text-align: center;">x </i><span style="text-align: center;">â <i>y</i>), because it relies on knowledge which is quite precise. However, its coverage (the percentage of times in which it recognizes that </span><i style="text-align: center;">x </i><span style="text-align: center;">â <i>y</i>, out of all the times that <i>x </i>â <i>y </i>is true) is limited, because some of the knowledge needed for the inference may be absent from the resource.</span><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnoAjV1p2Pe4rl7QMHYkUnhFFGyBBhB_95At4muNskJ7-ekigFWZkgdhyGXLugFtHPY0otR0CWdI0ZNRxf3opAxabASuU4maC9vh1sgc5jVbmzY7EvRLf_7lZ7TCEzKExS_edCuZZfXwQ/s1600/wordnet.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="224" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnoAjV1p2Pe4rl7QMHYkUnhFFGyBBhB_95At4muNskJ7-ekigFWZkgdhyGXLugFtHPY0otR0CWdI0ZNRxf3opAxabASuU4maC9vh1sgc5jVbmzY7EvRLf_7lZ7TCEzKExS_edCuZZfXwQ/s320/wordnet.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 2: an excerpt of WordNet - a lexical database of the English language</td></tr>
</tbody></table>
<br />
</li>
<li><b>Corpus-based:</b> this approach uses a huge text called "corpus" (e.g. all the English articles in Wikipedia) which is supposed to be representative of the language. The inference is based on the statistics of occurrences of <i>x</i> and <i>y </i>in the corpus. There are several ways to use a corpus to recognize lexical inference:<br /><br />
<ul>
<li><b>pattern-based approach</b> - there are some patterns such as "<i>x </i>and other <i>y</i>" or "<i>y</i> such as <i>x</i>" that indicate that <i style="text-align: center;">x </i><span style="text-align: center;">â <i>y</i>;</span> if you find it difficult to understand, think about "<i>animals such as cats</i>" and "<i>cat and other animals</i>" and ignore the plural/singular. If <i>x</i> and <i>y</i> frequently occur in the corpus in such patterns, this approach will recognize that <i style="text-align: center;">x </i><span style="text-align: center;">â <i>y</i>. It is not enough to observe one or two occurrences; think about the sentence "<i>my brother and other students</i>". It may occur in the corpus, but this is not a general phenomenon: <i>student</i> is not a common attribute of <i>brother</i>. Positive examples such as <i>cat</i> and <i>animal</i> will probably occur more frequently in these patterns in the corpus. </span><br /><br />The first method defined these patterns manually <sup><a href="http://veredshwartz.blogspot.co.il/2015/07/lexical-inference.html#ref1" name="top-ref1">[1]</a> </sup><span style="text-align: center;">. A later work found such patterns automaticall<span style="text-align: left;">y </span><sup style="text-align: left;"><a href="http://veredshwartz.blogspot.co.il/2015/07/lexical-inference.html#ref2" name="top-ref2">[2]</a></sup>. This work was highly referenced and used. It is quite precise and also has a good coverage. However, it requires that <i>x</i> and <i>y</i> occur together in the corpus, and some words tend not to occur together, even though they are highly related; for instance, synonyms (e.g. <i>elevator </i>and <i>lift</i>).<br /></span></li>
</ul>
<ul>
<li><b>distributional approach</b> - the second approach solves exactly this. It is based on a linguistic hypothesis <sup><a href="http://veredshwartz.blogspot.co.il/2015/07/lexical-inference.html#ref3" name="top-ref3">[3]</a></sup> that says that if words occur with similar neighboring words, then they tend to have similar meanings (e.g. <i style="text-align: center;">elevator </i><span style="text-align: center;">and </span><i style="text-align: center;">lift </i><span style="text-align: center;">will both appear next to <i>down</i>, <i>up</i>, <i>building</i>, <i>floor</i>, and <i>stairs</i>). There has been plenty of work in this approach: earlier methods defined some similarity measure between words which was based on the neighbors (the more common neighbors they share, the more similar they are)<span style="text-align: left;"> </span><sup style="text-align: left;"><a href="http://veredshwartz.blogspot.co.il/2015/07/lexical-inference.html#ref4" name="top-ref4">[4]</a>,<a href="http://veredshwartz.blogspot.co.il/2015/07/lexical-inference.html#ref5" name="top-ref5">[5]</a></sup>. In recent years, some automatic methods (that don't require defining a similarity measure) were developed (I might elaborate on these in another post, but it requires knowledge in topics that I haven't covered yet).<br /></span></li>
</ul>
<span style="text-align: center;">Corpus-based methods, and in particular distributional ones, have a much higher coverage than resource-based methods, because they utilize huge texts. The amount of texts available on the web is incredible, as opposed to structured knowledge. However, they are much less precise. The distributional <span style="text-align: left;">hypothesis says something about the <b>similarity </b>of <i>x</i> and <i>y</i> and it is a vague definition. Just because <i>x </i>and <i>y</i> are similar (what does that even mean?) it doesn't mean that we can infer <i>x</i> from <i>y</i> or vice versa; for instance, the words <i>football</i> and <i>basketball</i> are similar, and will probably share some common neighbors such as <i>ball</i>, <i>player</i>, <i>team</i>, <i>match</i>, and <i>win</i>. However, you can't infer one from the other. Moreover, </span>distributional methods may say that <i>hot</i> and <i>cold</i> are similar, because both occur with <i>weather</i>, <i>temperature</i>, <i>drink, </i><i>water</i>, etc. Now this is too much. Not only that <i>hot </i></span><span style="text-align: center;">â </span><span style="text-align: center;"><i>cold </i>and </span><span style="text-align: center;"><i>cold </i>â </span><span style="text-align: center;"><i>hot</i>, but they mean exactly the opposite!</span></li>
</ul>
<div>
<b><br /></b>
<b>So what have we been doing?</b><br />
We developed a new resource-based method for recognizing lexical inference <sup><a href="http://veredshwartz.blogspot.co.il/2015/07/lexical-inference.html#ref6" name="top-ref6">[6]</a></sup>. We weren't going to compromise on precision, but we still wanted to improve upon the coverage of prior methods. In particular, we found that prior methods are incapable of recognizing inferences that contain recent terminology (e.g. <i>social networks</i>) and named-entities (called <a href="https://en.wikipedia.org/wiki/Proper_name_(philosophy)">proper-names</a>, e.g. <i>Lady Gaga</i>). This simply happens because prior methods are based on WordNet, and these terms are absent from WordNet; WordNet is an "ontology of the English language", so by definition it's not supposed to contain world-knowledge about named entities. Also, it hasn't been updated in years, so it doesn't cover recent terminology.<br />
<br />
We used other structured knowledge resources that contain exactly this kind of information, are much larger than WordNet and are frequently updated. These resources contain information such as (<i>Lady Gaga</i>, occupation, <i>singer</i>) and (<i>singer</i>, subclass, <i>person</i>), that can indicate that <span style="text-align: center;"><i>Lady Gaga </i></span><span style="text-align: center;">â <i>singer </i>and </span><span style="text-align: center;"><i>Lady Gaga </i></span><span style="text-align: center;">â <i>person</i>. However, they may also contain information such as </span>(<i>Lady Gaga</i>, producer, <i>Giorgio Moroder</i>) but that does not indicate that <span style="text-align: center;"><i>Lady Gaga </i></span><span style="text-align: center;">â </span><i>Giorgio Moroder</i>. As in WordNet, we needed to define which relations in the resource are relevant for lexical inference. For instance, the occupation relation is relevant for lexical inference, because a person infers its occupation (<span style="text-align: center;"><i>Lady Gaga </i></span><span style="text-align: center;">â <i>singer</i>, </span><span style="text-align: center;"><i>Barack Obama </i></span><span style="text-align: center;">â <i>president</i></span><span style="text-align: center;">).</span><br />
<br />
As opposed to WordNet-based methods, which only need to select relevant relations out of the few relations WordNet defines, it would be excruciating to do the same for the resources we used. They contain thousand of relations. So we developed a method that automatically recognizes which resource relations are indicative of lexical inference. Then, if it finds that <i>x</i> and <i>y</i> are connected to each other via a path containing only relevant relations, it predicts that <i style="text-align: center;">x </i><span style="text-align: center;">â <i>y</i>. So in our previous example, since </span>occupation and subclass were found indicative of lexical inference, then <span style="text-align: center;"><i>Lady Gaga </i></span><span style="text-align: center;">â <i>person. </i></span><br />
<span style="text-align: center;"><br /></span>
<span style="text-align: center;">Similarly to the example, we've made successful inferences, and in particular inferences containing proper-names that were not captured by previous methods. We also maintained a very high precision. </span><span style="text-align: center;">This is basically the simplified version of our paper.</span><br />
<br /></div>
<div>
<b>So, is it perfect now?</b></div>
<div>
Well... not exactly. First of all, our coverage is still lower than that of the corpus-based methods (but with higher precision, usually). Second, there are still some open issues left. I'll give one of them as an example, as this post is already very long (and I challenge you to tl;dr it).<br />
<br />
Answer the following question:</div>
</div>
<div dir="ltr" style="text-align: left;">
<div class="" style="clear: both; text-align: center;">
<b>apple __ fruit?</b></div>
</div>
<div dir="ltr" style="text-align: center;">
(a) â<br />
(b) â<br />
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div style="text-align: left;">
Well, I know this seems like a trivial question, but the answer is - it depends! <br />
Are we talking about <a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjegW5MgHa9w5cnyZjlQLTYCDkikNg81jeoOmMsyLUf6ktuDInFaIdKvWTKKqDW9eaBznoDf3M0GY1XXgnZ-UhcW5Xh7K3CbvYnQdyX3BgD7_9lEgKcLUEVh0gFt2IT4jJT7Y3_hLSOEWw/s1600/apple.png" imageanchor="1" style="clear: left; display: inline !important; margin-bottom: 1em; margin-right: 1em; text-align: center;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjegW5MgHa9w5cnyZjlQLTYCDkikNg81jeoOmMsyLUf6ktuDInFaIdKvWTKKqDW9eaBznoDf3M0GY1XXgnZ-UhcW5Xh7K3CbvYnQdyX3BgD7_9lEgKcLUEVh0gFt2IT4jJT7Y3_hLSOEWw/s1600/apple.png" /></a> or about<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiwiCzfmTTK06R6xDUJJkvl5QbefwFYW321pnCgpN7ciqIkYe4nxBVGj-amgtw6nQNJOcN_RbMv9gBYoWO4K4-lQtSnoQHmfzpgKLNdOHfiK8KvhPfUcWmQDe3xdGr7D-ma1GmULG2peXM/s1600/apple-logo.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiwiCzfmTTK06R6xDUJJkvl5QbefwFYW321pnCgpN7ciqIkYe4nxBVGj-amgtw6nQNJOcN_RbMv9gBYoWO4K4-lQtSnoQHmfzpgKLNdOHfiK8KvhPfUcWmQDe3xdGr7D-ma1GmULG2peXM/s1600/apple-logo.png" /></a>?<br />
The problem in determining whether <i><span style="text-align: center;">apple </span><span style="text-align: center;">â </span></i><span style="text-align: center;"><i>fruit</i>, is that the word <i>apple </i>has two senses (meanings). In one of its senses, </span><i><span style="text-align: center;">apple </span><span style="text-align: center;">â </span></i><span style="text-align: center;"><i>fruit</i>, and in the other, </span><i><span style="text-align: center;">apple </span></i><span style="text-align: center;">â </span><span style="text-align: center;"><i>fruit</i>. </span><span style="text-align: center;">In order to decide correctly, we need to know which of the senses of </span><i style="text-align: center;">apple </i><span style="text-align: center;">is the one we are being asked about. </span><br />
<span style="text-align: center;"><br /></span>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgEhbWPRaaBj0wflOxmToRrD_VD2hXKYj1ZZk4cvTl4mJM3qNvloNWFR1z8UTa2UgpYzbnIaGvYoU30fPCoH2JvluBBXWD_v7u0ZEi8Djl2ky00rKBIfIiv2KGNUUIcbVNr3Of4W1llZi0/s1600/I+hate+apple.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="135" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgEhbWPRaaBj0wflOxmToRrD_VD2hXKYj1ZZk4cvTl4mJM3qNvloNWFR1z8UTa2UgpYzbnIaGvYoU30fPCoH2JvluBBXWD_v7u0ZEi8Djl2ky00rKBIfIiv2KGNUUIcbVNr3Of4W1llZi0/s320/I+hate+apple.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Figure 3: I've just seen this on my Facebook feed after publishing the post and I had to add it :)</td></tr>
</tbody></table>
<span style="text-align: center;"><br /></span>
<span style="text-align: center;">As I mentioned before, recognizing lexical inference is usually a component in some NLP application. In such application, <i>x</i>, <i>y </i>or both <i>x</i> and <i>y</i> are part of a text, and the application asks "does <i>x</i> infer <i>y</i>?", "what can we infer from <i>x</i>?" or "what infers <i>y</i>?". If <i>x=apple</i>, and we would like to know whether it infers <i>y=fruit</i>, the solution (for humans) would be to look at the texts. </span><br />
<br />
Say we have the sentence <i>I ate a green apple for breakfast. </i>We can easily understand that the correct sense of <i>apple </i>in this sentence is fruit. How did we know that? We noticed words like <i>ate</i>,<i> breakfast </i>and <i>green</i> that are related to <i>apple</i> in the sense of fruit (and unrelated to Apple the company). There are already automatic methods that do that (with some success, of course). So one of the next challenges is to incorporate them and apply context-sensitive lexical inference. In this case, infer that <i>I ate a fruit </i>and not that <i>I ate a company</i>. I promise to update in case I have any progress with that.</div>
</div>
<div dir="ltr" style="text-align: center;">
<br /></div>
</div>
<hr />
<div style="direction: ltr; text-align: left;">
<b>References:</b><br />
<div style="text-align: left;">
<span style="font-size: x-small;"><span style="color: black;"><a href="https://www.blogger.com/null" name="ref1">[1]</a> </span>Hearst, Marti A. "Automatic acquisition of hyponyms from large text corpora." Proceedings of the 14th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 1992. <a href="http://veredshwartz.blogspot.co.il/2015/07/lexical-inference.html#top-ref1"><sup>â©</sup></a> </span><br />
<span style="font-size: x-small;"><span style="color: black;"><a href="https://www.blogger.com/null" name="ref2">[2]</a> </span>Snow, Rion, Daniel Jurafsky, and Andrew Y. Ng. "Learning syntactic patterns for automatic hypernym discovery." Advances in Neural Information Processing Systems 17. 2004. <sup><a href="http://veredshwartz.blogspot.co.il/2015/07/lexical-inference.html#top-ref2">â©</a></sup></span><br />
<a href="https://www.blogger.com/null" name="ref3"><span style="font-size: x-small;">[3]</span></a><span style="font-size: x-small;"> </span><span style="font-size: x-small;">Harris, Zellig S. "Distributional structure." Word. 1954. </span><sup style="font-size: small;"><a href="http://veredshwartz.blogspot.co.il/2015/07/lexical-inference.html#top-ref3" style="font-size: small;">â©</a></sup><br />
<span style="font-size: x-small;"><a href="https://www.blogger.com/null" name="ref4">[4]</a> Weeds, Julie, and David Weir. "A general framework for distributional similarity." Proceedings of the 2003 conference on Empirical methods in natural language processing. Association for Computational Linguistics, 2003. </span><sup style="font-size: small;"><a href="http://veredshwartz.blogspot.co.il/2015/07/lexical-inference.html#top-ref4">â©</a></sup><br />
<a href="https://www.blogger.com/null" name="ref5"><span style="font-size: x-small;">[5]</span></a><span style="font-size: x-small;"> </span><span style="font-size: x-small;">Kotlerman, Lili, et al. "Directional distributional similarity for lexical inference." Natural Language Engineering 16.04: 359-389. 2010. </span><sup style="font-size: small;"><a href="http://veredshwartz.blogspot.co.il/2015/07/lexical-inference.html#top-ref5" style="font-size: small;">â©</a></sup><br />
<a href="https://www.blogger.com/null" name="ref6"><span style="font-size: x-small;">[6]</span></a><span style="font-size: x-small;"> Shwartz, Vered, Omer Levy, Ido Dagan, and Jacob Goldberger. "Learning to Exploit Structured Resources for Lexical Inference." Proceedings of the Nineteenth Conference on Computational Natural Language Learning. Association </span><span style="font-size: x-small;">for Computational Linguistics. 2015. </span><sup style="font-size: small;"><a href="http://veredshwartz.blogspot.co.il/2015/07/lexical-inference.html#top-ref6">â©</a></sup></div>
</div>
<br />
<hr />
<div style="direction: ltr; text-align: left;">
<span class="Apple-style-span" style="font-size: x-small;">
<a href="https://www.blogger.com/null" name="1"><b><span style="color: black;">1</span> </b></a>These relations actually have less friendly names: holonym/meronym and hyponym/hypernym. <a href="http://veredshwartz.blogspot.co.il/2015/07/lexical-inference.html#top1"><sup>â©</sup></a>
</span></div>
</div>
Vered Shwartzhttp://www.blogger.com/profile/17531957962535846245noreply@blogger.com3