Natural Language Understanding Wiki

Verbosity (Von Ahn et al., 2006)[1] was one of the first attempts in gathering annotations with a GWAP.

Snow et al. (2008)[2] described design and evaluation guidelines for five natural language micro-tasks. However, they explicitly chose a set of tasks that could be easily understood by non-expert contributors, thus leaving the recruitment and training issues open.

NLP Tasks[]

  • coreference resolution: Phrase Detectives (Chamberlain et al., 2008;[3] Chamberlain et al., 2009[4]), QuizBowl (Gua et al. 2005)[5]
  • textual entailment: Negri et al. (2011)[6] (multilingual)
  • semantic role labeling: Hong and Baker (2011)[7], Baker (2012)[8]

Annotation models[]

Most tasks are modeled in the same way a classification problem is modeled in machine learning: annotators choose between a set of k categories which is the same across questions.

Phrase Detectives (Chamberlain et al., 2008;[3]) asks annotators to choose between three choices: non-referring, discourse-new, discourse-old. In the last case, they are asked to further specify the most recent mention belonging to the same entity. Paun et al. (2018)[9] attempt at modeling this setting in a pair-wise manner.

Aggregation methods[]

Majority vote

From Paun et al. (2018)[9]: "Probabilistic models of annotation, in particular, make it possible to characterize the accuracy of the annotators and correct for their bias (Dawid and Skene, 1979; Passonneau and Carpenter, 2014), to account for item-level effects (e.g.: difficulty) (Whitehill et al., 2009), and to employ different pooling strategies (Carpenter, 2008)"


Amazon's Mechanical Turk[]

Usage: NAACL (2010)[10], Laws et al. (2011)[11]

Motivation of workers: Antin and Shaw (2012)[12] found that, although monetary reward is the most important motivation drawing workers to the website, more than half of the workers also come for "fun". They argue that the results obtained by Ipeirotis (2010)[13] is distorted by social desirability. Litman et al. (2014)[14] argue that money is the most important motivation and "data quality is directly affected by compensation rates for India-based participants".


  • Increase intrinsic motivation: from Paolacci and Chandler (2014)[15]: "Thanking workers and explaining to them the meaning of the task they will complete can stimulate better work (D. Chandler & Kapelner, 2013)[16], as does framing a task as requested by a nonprofit organization (Rogstadius et al., 2011)[17]."


Usage: He et al. (2016)[18]


  1. Luis Von Ahn, Mihir Kedia, and Manuel Blum. 2006. Verbosity: a game for col- lecting common-sense facts. In Proceedings of the SIGCHI conference on Human Factors in computing systems, pages 75–78. ACM.
  2. Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y Ng. 2008. Cheap and fast—but is it good?: evaluating non-expert an- notations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 254–263. Association for Computational Linguistics.
  3. 3.0 3.1 Jon Chamberlain, Massimo Poesio, and Udo Kruschwitz. 2008. Phrase detec- tives: A web-based collaborative annotation game. Proceedings of I-Semantics, Graz.
  4. Jon Chamberlain, Udo Kruschwitz, and Massimo Poesio. 2009. Constructing an anaphorically annotated corpus with non-experts: Assessing the quality of collaborative annotations. In Proceedings of the 2009 Workshop on The Peo- ple’s Web Meets NLP: Collaboratively Constructed Semantic Resources, pages 57–62. Association for Computational Linguistics.
  5. Anupam Guha, Mohit Iyyer, Danny Bouman, and Jordan Boyd-Graber. 2015. Removing the training wheels: A coreference dataset that entertains humans and challenges computers. In Proceedings of the 2015 Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies.
  6. Matteo Negri, Luisa Bentivogli, Yashar Mehdad, Danilo Giampiccolo, and Alessan- dro Marchetti. 2011. Divide and conquer: crowd- sourcing the creation of cross-lingual textual entail- ment corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Process- ing, EMNLP ’11, pages 670–679, Stroudsburg, PA, USA. Association for Computational Linguistics.
  7. Jisup Hong and Collin F Baker. 2011. How good is the crowd at “real” wsd? ACL HLT 2011, page 30.
  8. Collin F Baker. 2012. Framenet, current collaborations and future goals. Language Re- sources and Evaluation, pages 1–18.
  9. 9.0 9.1 Paun, S., Chamberlain, J., Kruschwitz, U., Yu, J., & Poesio, M. (2018). A Probabilistic Annotation Model for Crowdsourcing Coreference. EMNLP 2018, 1926–1937.
  10. NAACL, H. (2010). Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk.
  11. Laws, F., Scheible, C., & Schütze, H. (2011). Active Learning with Amazon Mechanical Turk. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1546–1556.
  12. Antin, J., & Shaw, A. (2012). Social desirability bias and self-reports of motivation. Proceedings of the 2012 ACM Annual Conference on Human Factors in Computing Systems - CHI’12, 2925.
  13. Panos Ipeirotis. 2010. New demographics of Mechanical Turk. http://behind-the-enemy-lines. new-demographics-of-mechanical-turk. html.
  14. Litman, L., Robinson, J., & Rosenzweig, C. (2015). The relationship between motivation, monetary compensation, and data quality among US- and India-based workers on Mechanical Turk. Behavior Research Methods, 47(2), 519–528.
  15. Paolacci, G., & Chandler, J. (2014). Inside the Turk: Understanding Mechanical Turk as a Participant Pool. Current Directions in Psychological Science, 23(3), 184–188.
  16. Chandler, D., & Kapelner, A. (2013). Breaking monotony with meaning: Motivation in crowdsourcing markets. Journal of Economic Behavior & Organization, 90, 123–133.
  17. Rogstadius, J., Kostakos, V., Kittur, A., Smus, B., Laredo, J., & Vukovic, M. (2011, July). An assessment of intrinsic and extrinsic motivation on task performance in crowdsourcing markets. Paper presented at the 5th International AAAI Conference on Weblogs and Social Media, Barcelona, Spain.
  18. NAACL, H. (2010). Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk.