Developing new methods to solve scientific tasks is cool, but they usually require data. We researchers often find ourselves collecting data rather than trying to solve new problems. I've collected data for most of my papers, but never thought of it as an interesting blog post topic. Recently, I attended Chris Biemann's excellent crowdsourcing course at ESSLLI 2016 (the 28th European Summer School in Logic, Language and Information), and was inspired to write about the topic. This blog post will be much less technical and much more high-level than the course, as my posts usually are. Nevertheless, credit for many interesting insights on the topic goes to Chris Biemann.1
Who needs data anyway?
So let's start from the beginning: what is this data and why do we need it? Suppose that I'm working on automatic methods to recognize the semantic relation between words, e.g. I want my model to know that cat is a type of animal, and that wheel is a part of a car.
At the very basic level, if I already developed such a method, I will want to check how well it does compared to humans. Evaluation of my method requires annotated data, i.e. a set of word pairs and their corresponding true semantic relations, annotated by humans. This will be the "test set"; the human annotations are considered as "gold/true labels". My model will try to predict the semantic relation between each word-pair (without accessing the true labels). Then, I will use some evaluation metric (e.g. precision, recall, F1 or accuracy) to see how well my model predicted the human annotations. For instance, my model would have 80% accuracy if for 80% of the word-pairs it predicted the same relation as the human annotators.
Figure 1: an example of dataset entries for recognizing the semantic relation between words. |
What should we do, then? Outsource the annotation process -- i.e., pay with real money, not cookies!
What is crowdsourcing?
The word crowdsourcing is a blend word composed of crowd (intelligence) + (out-)sourcing [1]. The idea is to take a task that can be performed by experts (e.g. translating a document from English to Spanish), and outsource it to a large crowd of non-experts (workers) that can perform it.
The requester defines the task, and the workers work on it. The requester than decides whether to accept/reject the work and pays the workers (in case of acceptance).
The benefits of using "regular" people rather than experts are:
- You pay them much less than experts - typically a few cents per question (/task). (e.g., [2] found that in translation tasks, the crowd reached the same quality as the professionals, with less than 12% of the costs).
- They are more easily available via crowdsourcing platforms (see below).
- By letting multiple people work on the task rather than a single/few experts, the task could be completed in a shorter time.
The main crowdsourcing platforms (out of many others) are Amazon Mechanical Turk and CrowdFlower. In this blog post I will not discuss the technical details of these platforms. If you are interested in a comparison between the two, refer to these slides from the NAACL 2015 crowdsourcing tutorial.
Figure 2: An example of a question in Amazon Mechanical Turk, from my project. |
What can be crowdsourced?
Not every data we need to collect can be collected via crowdsourcing; some data may require expert annotation, e.g. if we need to annotate the syntactic trees of sentences in natural language, that's probably a bad idea to ask non-experts to do so.
The rules of thumb for crowdsourcability are:
- The task is easy to explain, and you as a requester indeed explain it simply. They key idea is to keep it simple. The instructions should be short, i.e. do not expect workers to read a 50 page manual. They don't get paid enough for that. The instructions should include examples.
- People can easily agree on the "correct" answer, e.g. "is there a cat in this image?" is good, "what is the meaning of life?" is really bad. Everything else is borderline :) One thing to consider is the possible number of correct answers. For instance, if the worker should reply with a sentence (e.g. "describe the following image"), they can do so in so many ways. Always aim one possible answer for a question.
- Each question is relatively small.
- Bonus: the task is fun. Workers will do better if they enjoy the task. If you can think of a way to gamify your task, do so!
Figure 3: Is there a cat in this image? |
Some tasks are borderline and may become suitable for crowdsourcing if presented in the right way to the workers. If the task at hand seems too complicated to be crowdsourced, ask yourself: can I break it into smaller tasks that can each be crowdsourced? For example, let workers write a sentence that describes an image, and accept all answers; then let other workers validate the sentences (ask them: does this sentence really describe this image?).
Some examples for (mostly language-related) data collected with crowdsourcing
(references omitted, but are available in the course slides in the link above).
- Checking whether a sentence is grammatical or not.
- Alignment of dictionary definitions - for instance, if a word has multiple meanings, and hence has multiple definitions in each dictionary - the task was to align the definitions corresponding to the same meaning in different dictionaries.
- Translation.
- Paraphrase collection - get multiple sentences with the same meaning. These were obtained by asking multiple workers to describe the same short video.
- Duolingo started as a crowdsourcing project!
- And so did reCAPTCHA!
OK, so we collected a lot of data. How do we even know if it's good? Can I trust my workers to do well on the task? Could they be as good as experts? And what if they just want my money and are cheating on the task just to get easy money?
There are many ways to control for the quality of workers:
- The crowdsourcing platforms provide some information about the workers, such as the number of tasks they completed in the past, their approval rate (% of their tasks that were approved), location, etc. You can define your requirements from the workers based on this information.
- Don't trust a single worker -- define that your task should be answered by a certain number of workers (typically 5) and aggregate their answers (e.g. by majority voting).
- Create control questions - a few questions for which you know the correct answer. These questions are displayed to the worker just like any other questions. If a worker fails to answer too many control questions, the worker is either not good or trying to cheat you. Don't use this worker's answers (and don't let the worker participate in the task anymore; either by rejecting their work or by blocking them).3
- Create a qualification test - a few questions for which you know the correct answer. You can require that any worker who wants to work on your task must take the test and pass it. As opposed to the control questions, the test questions don't have to be identical in format to the task itself, but should predict the worker's ability to perform the task well.
- Second-pass reviewing - create another task in which workers validate previous workers' answers.
- Bonus the good workers - they will want to keep working for you.
- Watch out for spammers! Some workers are only after your money, and they don't take your task seriously, e.g. they will click on the same answer for all questions. There is no correlation between the number of questions workers answer and their quality, however, it is worth looking at the most productive workers: some of them may be very good (and you might want to give them bonuses), while some of them may be spammers.
As a requester, you need to make sure you treat your workers properly. Always remember that workers are first of all people. When you consider how much to pay or whether to reject a worker's work, think of the following:
- Many workers rely on crowdsourcing as their main income.
- They have no job security.
- Rejection in some cases is unfair - even if the worker was bad in the task, they still spent time working (unless you are sure that they are cheating).
- New workers do lower-paid work to build up their reputation, but underpaying is not fair and not ethical.
- Are you sure you explained the task well? Maybe it is your fault if all the workers performed badly?
Important take-outs for successful crowdsourcing:
- Work in small batches. If you have 10,000 questions, don't publish all at once. Try some, learn from your mistakes, correct them and publish another batch. Mistakes are bound to happen, and they might cost you good money!
- Use worker errors to improve instructions (remember: it might be your fault).
- KEEP. IT. SIMPLE.
- Use quality control mechanisms.
- Don't underpay!
- Always expect workers to be sloppy. Repeat guidelines and questions and don't expect workers to remember them.
- If your questions are automatically generated, use random order and try to balance the number of questions with each expected answer, otherwise workers will exploit this bias (e.g. if most word-pairs are unrelated, they will mark all of them as unrelated without looking twice).
- Make workers' lives easier, and they will perform better. For instance, if you have multiple questions regarding the same word, group them together.
- If you find a way to make your task more fun, do so!
References
[1] Howe, Jeff. The rise of crowdsourcing. Wired magazine 14.6 (2006).
[2] Omar F. Zaidan and Chris Callison-Burch Crowdsourcing translation: professional quality from non-professionals. In ACL 2011.
1 And I would also like to mention another wonderful crowdsourcing tutorial that I attended last year at NAACL 2015, which was given by Chris Callison-Burch, Lyle Ungar, and Ellie Pavlick. Unfortunately, at that time I had no personal experience with crowdsourcing, nor believed that my university will ever have budget for that, therefore made no effort to remember the technical details; I was completely wrong. A year later I published a paper about a dataset collected with crowdsourcing, on which I even got a best paper award :) ↩
2 For more sophisticated aggregation methods that assign weights to workers based on their quality, see MACE. ↩
2 For more sophisticated aggregation methods that assign weights to workers based on their quality, see MACE. ↩
3 Blocking a worker means that they can't work on your tasks anymore. Rejecting a worker means that they are not paid for the work they have already done. As far as I know, it is not recommended to reject a worker, because then they write bad things about you in Turker Nation and nobody wants to work for you anymore. In addition, you should always give workers the benefit of the doubt; maybe you didn't explain the task well enough.↩
4 So how much should you pay? First of all, not less than 2 cents. Second, try to estimate how long a single question takes and aim an hourly pay of around 6 USDs. For example, in this paper I paid 5 cents per question, which I've been told is the higher bound for such tasks.↩