Probably Approximately a Scientific Blog: Crowdsourcing (for NLP)

Developing new methods to solve scientific tasks is cool, but they usually require data. We researchers often find ourselves collecting data rather than trying to solve new problems. I've collected data for most of my papers, but never thought of it as an interesting blog post topic. Recently, I attended Chris Biemann's excellent crowdsourcing course at ESSLLI 2016 (the 28th European Summer School in Logic, Language and Information), and was inspired to write about the topic. This blog post will be much less technical and much more high-level than the course, as my posts usually are. Nevertheless, credit for many interesting insights on the topic goes to Chris Biemann.¹

Who needs data anyway?
So let's start from the beginning: what is this data and why do we need it? Suppose that I'm working on automatic methods to recognize the semantic relation between words, e.g. I want my model to know that cat is a type of animal, and that wheel is a part of a car.

At the very basic level, if I already developed such a method, I will want to check how well it does compared to humans. Evaluation of my method requires annotated data, i.e. a set of word pairs and their corresponding true semantic relations, annotated by humans. This will be the "test set"; the human annotations are considered as "gold/true labels". My model will try to predict the semantic relation between each word-pair (without accessing the true labels). Then, I will use some evaluation metric (e.g. precision, recall, F1 or accuracy) to see how well my model predicted the human annotations. For instance, my model would have 80% accuracy if for 80% of the word-pairs it predicted the same relation as the human annotators.

Figure 1: an example of dataset entries for recognizing the semantic relation between words.

If that was the only data I needed, I would have been lucky. You don't need that many examples to test your method. Therefore, I could select some word-pairs (randomly or using some heuristics), and annotate them myself, or bribe my colleagues with cookies (as I successfully did twice). The problem starts when you need training data, i.e., when you want your model to learn to predict something based on labelled examples. That usually requires many more examples, and annotating data is a very tiring and Sisyphean work.

What should we do, then? Outsource the annotation process -- i.e., pay with real money, not cookies!

What is crowdsourcing?

The word crowdsourcing is a blend word composed of crowd (intelligence) + (out-)sourcing [1]. The idea is to take a task that can be performed by experts (e.g. translating a document from English to Spanish), and outsource it to a large crowd of non-experts (workers) that can perform it.

The requester defines the task, and the workers work on it. The requester than decides whether to accept/reject the work and pays the workers (in case of acceptance).

The benefits of using "regular" people rather than experts are:

You pay them much less than experts - typically a few cents per question (/task). (e.g., [2] found that in translation tasks, the crowd reached the same quality as the professionals, with less than 12% of the costs).
They are more easily available via crowdsourcing platforms (see below).
By letting multiple people work on the task rather than a single/few experts, the task could be completed in a shorter time.

The obvious observation is that the quality of a worker is not as good as the expert; in crowdsourcing, it is not a single worker that replaces the expert, but the crowd. Rather than trusting a single worker, you assign each task to a certain number of workers, and combine their results. A common practice is to use the majority voting. For instance, let's say that I ask 5 workers what is the semantic relation between cat and dog, giving them several options. 3 of them say that cat and dog are mutually exclusive words (e.g. one cannot be both a cat and a dog), one of them says that they are opposites, and one says that cat is a type of dog. The majority has voted in the favor of mutually exclusive, and this is what I will consider as the correct answer.²

The main crowdsourcing platforms (out of many others) are Amazon Mechanical Turk and CrowdFlower. In this blog post I will not discuss the technical details of these platforms. If you are interested in a comparison between the two, refer to these slides from the NAACL 2015 crowdsourcing tutorial.

Figure 2: An example of a question in Amazon Mechanical Turk, from my project.

What can be crowdsourced?

Not every data we need to collect can be collected via crowdsourcing; some data may require expert annotation, e.g. if we need to annotate the syntactic trees of sentences in natural language, that's probably a bad idea to ask non-experts to do so.

The rules of thumb for crowdsourcability are:

The task is easy to explain, and you as a requester indeed explain it simply. They key idea is to keep it simple. The instructions should be short, i.e. do not expect workers to read a 50 page manual. They don't get paid enough for that. The instructions should include examples.
People can easily agree on the "correct" answer, e.g. "is there a cat in this image?" is good, "what is the meaning of life?" is really bad. Everything else is borderline :) One thing to consider is the possible number of correct answers. For instance, if the worker should reply with a sentence (e.g. "describe the following image"), they can do so in so many ways. Always aim one possible answer for a question.
Each question is relatively small.
Bonus: the task is fun. Workers will do better if they enjoy the task. If you can think of a way to gamify your task, do so!

Figure 3: Is there a cat in this image?

Some tasks are borderline and may become suitable for crowdsourcing if presented in the right way to the workers. If the task at hand seems too complicated to be crowdsourced, ask yourself: can I break it into smaller tasks that can each be crowdsourced? For example, let workers write a sentence that describes an image, and accept all answers; then let other workers validate the sentences (ask them: does this sentence really describe this image?).

Some examples for (mostly language-related) data collected with crowdsourcing
(references omitted, but are available in the course slides in the link above).

Checking whether a sentence is grammatical or not.
Alignment of dictionary definitions - for instance, if a word has multiple meanings, and hence has multiple definitions in each dictionary - the task was to align the definitions corresponding to the same meaning in different dictionaries.
Translation.
Paraphrase collection - get multiple sentences with the same meaning. These were obtained by asking multiple workers to describe the same short video.
Duolingo started as a crowdsourcing project!
And so did reCAPTCHA!

How to control for the quality of data?

OK, so we collected a lot of data. How do we even know if it's good? Can I trust my workers to do well on the task? Could they be as good as experts? And what if they just want my money and are cheating on the task just to get easy money?

There are many ways to control for the quality of workers:

The crowdsourcing platforms provide some information about the workers, such as the number of tasks they completed in the past, their approval rate (% of their tasks that were approved), location, etc. You can define your requirements from the workers based on this information.
Don't trust a single worker -- define that your task should be answered by a certain number of workers (typically 5) and aggregate their answers (e.g. by majority voting).
Create control questions - a few questions for which you know the correct answer. These questions are displayed to the worker just like any other questions. If a worker fails to answer too many control questions, the worker is either not good or trying to cheat you. Don't use this worker's answers (and don't let the worker participate in the task anymore; either by rejecting their work or by blocking them).³
Create a qualification test - a few questions for which you know the correct answer. You can require that any worker who wants to work on your task must take the test and pass it. As opposed to the control questions, the test questions don't have to be identical in format to the task itself, but should predict the worker's ability to perform the task well.
Second-pass reviewing - create another task in which workers validate previous workers' answers.
Bonus the good workers - they will want to keep working for you.
Watch out for spammers! Some workers are only after your money, and they don't take your task seriously, e.g. they will click on the same answer for all questions. There is no correlation between the number of questions workers answer and their quality, however, it is worth looking at the most productive workers: some of them may be very good (and you might want to give them bonuses), while some of them may be spammers.

Ethical issues in crowdsourcing

As a requester, you need to make sure you treat your workers properly. Always remember that workers are first of all people. When you consider how much to pay or whether to reject a worker's work, think of the following:

Many workers rely on crowdsourcing as their main income.
They have no job security.
Rejection in some cases is unfair - even if the worker was bad in the task, they still spent time working (unless you are sure that they are cheating).
New workers do lower-paid work to build up their reputation, but underpaying is not fair and not ethical.
Are you sure you explained the task well? Maybe it is your fault if all the workers performed badly?

The good news is that, from my little experience, paying well pays off for the requester too. If you pay enough (but not too much!), you get good workers that want to do the task well. When you underpay, the good workers don't want to work on your task - they can get better paying tasks. The time to complete the task will be longer. And if you are like me, the thought of underpaying your workers will keep you awake at night. So pay well :)⁴

Important take-outs for successful crowdsourcing:

Work in small batches. If you have 10,000 questions, don't publish all at once. Try some, learn from your mistakes, correct them and publish another batch. Mistakes are bound to happen, and they might cost you good money!
Use worker errors to improve instructions (remember: it might be your fault).
KEEP. IT. SIMPLE.
Use quality control mechanisms.
Don't underpay!
Always expect workers to be sloppy. Repeat guidelines and questions and don't expect workers to remember them.
If your questions are automatically generated, use random order and try to balance the number of questions with each expected answer, otherwise workers will exploit this bias (e.g. if most word-pairs are unrelated, they will mark all of them as unrelated without looking twice).
Make workers' lives easier, and they will perform better. For instance, if you have multiple questions regarding the same word, group them together.
If you find a way to make your task more fun, do so!

References
[1] Howe, Jeff. The rise of crowdsourcing. Wired magazine 14.6 (2006).
[2] Omar F. Zaidan and Chris Callison-Burch Crowdsourcing translation: professional quality from non-professionals. In ACL 2011.

1 And I would also like to mention another wonderful crowdsourcing tutorial that I attended last year at NAACL 2015, which was given by Chris Callison-Burch, Lyle Ungar, and Ellie Pavlick. Unfortunately, at that time I had no personal experience with crowdsourcing, nor believed that my university will ever have budget for that, therefore made no effort to remember the technical details; I was completely wrong. A year later I published a paper about a dataset collected with crowdsourcing, on which I even got a best paper award :) ^↩
2 For more sophisticated aggregation methods that assign weights to workers based on their quality, see MACE. ^↩

3 Blocking a worker means that they can't work on your tasks anymore. Rejecting a worker means that they are not paid for the work they have already done. As far as I know, it is not recommended to reject a worker, because then they write bad things about you in Turker Nation and nobody wants to work for you anymore. In addition, you should always give workers the benefit of the doubt; maybe you didn't explain the task well enough.^↩

4 So how much should you pay? First of all, not less than 2 cents. Second, try to estimate how long a single question takes and aim an hourly pay of around 6 USDs. For example, in this paper I paid 5 cents per question, which I've been told is the higher bound for such tasks.^↩

Probably Approximately a Scientific Blog

Sunday, August 28, 2016

Crowdsourcing (for NLP)

2 comments: