Probably Approximately a Scientific Blog: April 2018

You must have heard of, or have suspected first-handedly, the famous conspiracy theory that the Facebook app listens to your phone's microphone in order to better target ads that match your current interests. I've had the funniest experience with that myself: a friend in the cosmetics business has told me about this conspiracy, and in the same conversation she mentioned that an advertising agent has called her to offer advertising her business. Later that day, I got a Facebook ad "advertise your cosmetics business". What the heck? What are the odds of that? And I don't even have a Facebook app installed, just the Facebook messenger.

Although Mark Zuckerberg denied this conspiracy theory in his senate hearing, I doubt that people would stop believing it whenever the ads algorithm surprises them. Choosing to believe Zuckerberg that they don't listen to our microphones (yet, I suspect), I'm pretty confident that they, as well as other companies, are using our written content (emails, social media posts, search queries).

Most people are alarmed by these suspicions from the privacy aspect: what data does this company hold about me? how do they use it? who do they share it with? This post will not be about that. Instead, this post will be about the technical aspect, which is what interests me most as an NLP researcher. If we assume that our apps constantly listen to us and that our written content is monitored and analyzed, what does it say about the text understanding capabilities of these companies?

Oh, and expect no answers. This post is all about questions and conspiracy theories!

What is personalized content?
Personalized content doesn't have to come in the form of an ad. It can take the form of recommendations (products to buy based on previous purchases, songs to listen to, as in this post). It can be relevant professional content from LinkedIn, discounts on services you've previously consumed, cheap flights to your planned destinations, and so on. Some of this will be a direct result of the preferences and settings you defined in the website. For example, I've registered in several websites to get updates on concerts of my favorite bands, and I get healthy vegetarian recipes from Yummly. Some of this content will be based on inferences that the system makes, assuming that certain content is relevant for you. Here is one example:

Lately I've been getting @quora digest emails on topics related to conversations I had with people (in Hebrew!). 1/5
— Vered Shwartz (@VeredShwartz) October 17, 2017

In that case I was amazed by the accuracy of the Quora digest emails I was getting. Specifically, I had a conversation with my husband about the confidence it takes to admit you don't know something, and he mentioned he likes to say something more helpful than "I don't know" to someone who needs help. The next day, I got a personally-tailored Quora digest email that contained an answer to the question "Could you say something nice instead of 'I don't know'?". It wasn't under any of the topics that I follow (computer science related topics and parakeets).

In what follows I will exemplify most of my points using ads.

What we think these algorithms do
OK, so in my case, I have to try to put my knowledge about the limitations of this technology and my skepticism aside for a second and think like the average person. In that case, I think that:

If the ad is about a topic that I discussed in a spoken conversation, then there must be a recorder, and a speech-to-text component that converts the speech into written text.
Which language did I speak or have written when this happened? In case this happened for more than one language, it's possible that the company has different algorithms (or at least different trained models of the same algorithm) for each language.
Written content and transcribed speech are processed to match with the available content/ads.
In some cases, it seems that even simple keyword matching leads to nice results. E.g., if you mentioned a vacation in Thailand you will be matched with ads containing the words vacation and Thailand (I will let you know if I get any such ads after writing this post...). It takes no text understanding capabilities to do so, it only requires recognizing that a bunch of words said in the same sentence (or in a short period of time) also appear in some ad. If you insist, it may work with information retrieval (IR) algorithms to recognize the most important words.
In other cases, it seems that a deeper understanding of the meaning of my queries and conversations is required in order to match it to the relevant content. A good example is the Quora digest example from above. Based on IR algorithms, searching for common words like I, don't, know, helpful, nice, say, something will not get you as far as searching for more rare content words like vacation and Thailand. So it must be that the algorithm has built some meaning representation to our conversation, and compared it with the one of that Quora answer, which was phrased with slightly different words. On top of everything, our conversation was in Hebrew, so it must have a universal multi-lingual meaning representation mechanism.

Alternative explanations
Skepticism returns; I can believe that my speech is recorded and transcribed fairly accurately to text when I speak English. It's a bit harder to believe when it happens in other languages (e.g. Hebrew in my case), but I can still find it somewhat reasonable; Automatic speech recognition (ASR), although isn't perfect, still works reasonably well. It's the text understanding component I'm much, much more skeptical about. Despite the constant progress, and although popular media makes it seem like AI is solved and computers completely understand human language, I know it definitely isn't the case yet. So what other explanations can there be for the targeted content we see?

By Chance. None of this actually happens and we're just imagining. Well, OK, not none of this, but in some cases, it's really just chance.

One of the reasons that we're not easily convinced by this "by chance" argument is that we generally tend to pay attention only to the true-positive cases ("hits") in which we talked about something and immediately got an ad about it. It's much harder to notice the "misses": an ad that seems off (false positive) or all the things that we discussed and got no ads about (false negative).

In the end of the day, we're all just common people that share many common interests. Advertisers may reach us because they try to reach a large audience and we happen to fall under the very broad categories they target (e.g. age group). It could be that by chance we see ads exactly for the product or service we need now.

Other Means. Technically speaking, rather than understanding text, it's much easier to consider other parameters such as your location, your declared interests (i.e. pages you've liked on Facebook, search results you clicked on in Google), your location, your age, gender, marital status, and more. If you didn't provide one or more of these details, no worries! Your friends have, and it's likely you share some of these details with them!

Here is one good example:

I keep getting babies and pregnancy ads on Facebook. I'm a married woman in her 30s, both information items are available in my Facebook profile, and that alone is enough to assume this topic is relevant for me (personally, it is not, but the percent of women like me is too small to care about the error rate, and I totally accept that). Add to this that many of my Facebook friends are other people in my age who are members of parenting groups, have liked pages of baby-related stuff, etc. I can't ever make this stop, but I guess it will stop naturally when I'm in my late forties.

I'd like to finish with an anecdote about how non-sophisticated targeted content can sometimes be, to the point where you rub your eyes in disbelief and say "how stupid can these algorithms be?". A few days ago I've written to someone in an email "I'll be in Seattle on May 30". Minutes later, I get an email from Booking.com with the title "Vered, Seattle has some last-minute deals!". That would have been smart, unless I've already used Booking.com to book a hotel room in Seattle for exactly these dates.

I may be way off and it may be that these companies have killer AI abilities which are kept very well in secret. In that case, some of my readers who work for these companies must be giggling now. To paraphrase Joseph Heller (or whoever said it first), just because you're paranoid, doesn't mean they're not after you, but hey, there's no way their technology is good enough to do what you think it does, so some of it is just pure chance. Not as catchy as the original quote, I know.

Probably Approximately a Scientific Blog

Friday, April 13, 2018

Targeted Content