Probably Approximately a Scientific Blog: Natural Language Processing

Saturday, July 4, 2015

Natural Language Processing

I'm afraid I'm pretty lousy at explaining people what I do. I think my parents learned to memorize the key words "Natural Language Processing" so that they can tell their friends about my occupation. Another relative of mine is under the illusion that my current research is about to replace Google search, just as soon as I'm done (I swear I never told her anything like that!). When I try to simplify it, I sometimes tell people that it is a subfield of Artificial Intelligence. Then again, I think it makes some people imagine me talking with a robot as my everyday routine.

In this post I would like to tell you a little bit about what Natural Language Processing is and why I find it such an interesting field of research. In the following post, I will elaborate on what I actually (try to) do in this field.

Natural Language Processing (NLP, not to be confused with the other NLP) is mainly about filling the gap between how humans communicate (with natural languages such as English) and what computers understand (machine language). When this task will be fully solved, you will be able to communicate with your computer (or your tablet, cell phone, your smart refrigerator and your car) just as you do with another human being.

"Computers are incredibly fast, accurate and stupid; humans are incredibly slow, inaccurate and brilliant; together they are powerful beyond imagination." (Albert Einstein)¹

Computers are basically completely stupid, just as Albert Einstein pointed out. When you are engaged in a conversation with a person, each of you understands the meaning of what the other is saying. Computers basically understand only machine language, and are programmed to understand very specific instructions on top of this language. Human language is much more complex than that; you can say one thing in multiple ways, for example "where is the nearest sushi restaurant?" and "can you please give me addresses of sushi places nearby?"-- this is called language variability. Sometimes you say something that can have several meanings, like "time flies like an arrow" -- this is called language ambiguity. A human being usually understands the correct meaning in the context of the conversation. A computer... doesn't really.

However, human knowledge is limited, while today, in the big data era, the computer has access to almost unlimited knowledge. So what if we taught computers to understand us? We can have the answers to all the questions in the universe!

Of course, some of these applications already exist. If you have an Android phone, you can say "Ok Google" and then ask a question that Google (with some success) will answer. The same for Siri on Apple devices. However, this is an ongoing research, and none of these applications is perfect yet.

In addition to human language understanding, this field is also occupied with teaching computers to generate human language, so that they can fool you to think you are talking with an actual human being. I'm sure you have encountered virtual assistants:

I just want to talk. Is that a problem?

Such applications require both the understanding and generation of human language. I'm sure that with this example I've now convinced you that there's still plenty of work in this field. It is quite fun to challenge these virtual agents with complex language and topics they weren't trained to answer. I recommend it as a game :)

So how can NLP help us in our everyday life? In many ways. Here is a small subset of NLP tasks you may have encountered in applications:

Speech to text / text to speech - translation of spoken words into text and vice versa. The first and last step of applications in which you speak with a device. The internal processing is done over written text. NLP is actually a very small part of this task, which is related to electrical engineering, machine learning and other fields.
Machine translation - in two words: Google Translate.
Language model - determines how likely a certain sentence is in the language. For instance, the sentence "I'm reading this post now" is more likely to be said than "This post now reading I'm", even though both sentences contain only correct English words; and the sentence "I called my mother on the" is more likely to end with "phone" than with "banana". It is used in many applications, for example, the auto-suggest in your phone. Though it sometimes has funny suggestions, it can be very helpful.

Mmm... what was that offer again?

Automatic summarization - you know you don't have the patience to read long news / entertainment articles on the web, reviews of restaurants in TripAdvisor, and not to mention any texts you need to read for work or school purposes. This application takes long texts and provides you with a concise version of them.
Information Retrieval - support search engines and improve search results by understanding what the user really means in his query. For instance, you may have noticed the special search results you get on Google when searching for things such as time, weather, and flight details:

If you ever wondered how Google is so smart to understand you, I think you may have some of the answer now.

And here is a cool glimpse into the future (though some of it is already implemented, but definitely not common): when computers can generate human language, your refrigerator can tell you "hey, you're running out of milk - I added it to your grocery list". That would also require some help from other fields such as computer vision (enabling scanning the bar codes of the milk and other products inside the fridge). I think it's a cool example, though.

So now you see that you've actually encountered applications of NLP many times before, you just couldn't name it. I hope I managed to excite you about NLP, and hopefully I will also succeed with other topics in the next posts.

Small survey question: when you search something in Google (or any other search engine of your preference), is your query:

(1) a full question, such as "What is the height of Mount Everest?"

(2) composed of key words, such as "height Everest"

The results will be published when there will be enough readers to infer a meaningful statistical conclusion (probably never).

1 05/07/2015: Thanks to Yuval who doubted the authenticity of this quote, it turns out that it probably wasn't Einstein who said it, though it is not clear who did.^↩

17 comments:

shwartzJuly 4, 2015 at 11:39 PM
My answer to the survey: I'm really not sure. I think that it is composed of keywords most of the time and sometimes I do just write a question and count on Google to ignore all the unimportant words. If you really need me to choose one then I guess the 2nd option is the right one for me.
ReplyDelete
Replies
יובלJuly 5, 2015 at 12:16 AM
Thanks Vered! I may refer people to this post when trying to explain NLP :-)

The pipeline of natural language human-machine interface is quite complex. Each phase deserves its own field of study, and in fact some of the phases (text-to-speech, for instance) are extremely hard.

Regarding the survey, it depends if I'm typing or using voice search. When typing, I'm lazy and I know exact keywords will probably get me better results. When speaking, part of the request is also a test for the text-to-speech abilities of the search engine, so I try to give it a full sentence. But mostly it's just keywords.

Oh, and about our friend Albert up there... as the famous quote goes:
"The problem with quotes on the internet is that you can never be sure they are authentic" - Karl Marx
;-)
ReplyDelete
Replies
shiriJuly 5, 2015 at 7:08 AM
Who's the relative who thinks you're going to replace google? :-)
And regarding the Automatic summarization - so you will make a revolution in the world of tl;dr...
p.s. my answer to the survey is the second (key words).
ReplyDelete
Replies
shwartzJuly 5, 2015 at 9:39 AM
I never thought about it but the use of keywords for search results is common, a standard language model will not work well because though "what is the problem with quotes on the internet" is a probable English sentence, "problem quotes internet" is not. That makes it even harder because many more possibilities have a high possibility. It might even be as if there's almost no language model at all (maybe just use bigrams or unigrams together somehow) since you can use any word combination you want. I won't be surprised if Google take this into consideration when the voice search feature is used or for the suggestions (maybe their corpus is 'search queries'), but this might not work well with other providers of speech-to-text and auto-suggestions.

Things that make you go "hmm" (not the Hidden Markov Model 'hmm').
ReplyDelete
Replies
UnknownDecember 3, 2015 at 9:35 AM
great information to read, i learns a lot while reading this post. thanks for this blog Natural Language Processing (NLP) Market Report |
Mobile Health Apps and Solutions Market Report|Password Management Market Report
ReplyDelete
Replies
Digvijay ChaudharyFebruary 7, 2017 at 4:37 PM
This is really nice blog. Contents over here are so informative. For more information on these topics, visit here Natural Language Processing | Cognitive Science and Levels of knowledge used in Language Understanding
ReplyDelete
Replies
IlkimApril 8, 2017 at 5:49 AM
Answer to the survey: I usually use keywords for my Google search. Great post by the way, you explained it all in a simple way that I could easily understand. Thanks!
ReplyDelete
Replies
XyoreSeptember 9, 2017 at 10:44 PM
Amazing and interesting post
ReplyDelete
Replies
UnknownOctober 3, 2017 at 9:15 AM
Thanks a lot for such a nice introduction to nlp. Hope will get more from u in future.
ReplyDelete
Replies
UnknownSeptember 14, 2018 at 1:17 PM
This comment has been removed by a blog administrator.
ReplyDelete
Replies

Note: Only a member of this blog may post a comment.