TODO: from Zeldes (2017)[1]: "Corpora have grown progressively more complex and multifactorial, going beyond tagged, or even syntactically annotated treebanks to encompass multiple, simultaneous levels of analysis. For example, the Switchboard corpus (henceforth SWBD, Godfrey et al. 1992) and the Wall Street Journal corpus, (WSJ, see Marcus et al. 1993, both American English) have been repeatedly annotated to add information. Examples include coreference analysis or named entities (e.g. for WSJ in OntoNotes, Hovy et al. 2006, which was extended to include Mandarin Chinese and Modern Standard Arabic), phonetic and further disfluency annotation or prosody and ToBI breaks (SWBD, Calhoun et al. 2010), as well as discourse functional annotation (the RST Discourse Treebank based on WSJ, Carlson et al. 2001). For research on Continuous Speech Recognition, portions of the WSJ corpus were even read out loud and recorded (Paul and Baker 1992). Some corpora have been constructed as multilayer resources from the outset or shortly thereafter, such as the HCRC Map Task Corpus (Anderson et al. 1991, Scottish English), the ACE corpora (Mitchell et al. 2003, Mandarin, Arabic and English), the Potsdam Commentary Corpus (German, see Stede 2004; Stede and Neumann 2014) or the Manually Annotated Sub-Corpus of the Open American National Corpus (MASC, Ide et al. 2010, American English)."

TODO: WikiNews, GUM corpus

  POS tagging Syntax parsing NER NED/EL SRL iSRL Entity Coref. Event Coref. WSD Quote Attrib.
Reuters Reuters CoNLL-2003[4] AIDA-CoNLL[5] NP4E[6] (partial) NP4E[6] (partial)
WSJ PENN Treebank[7] Constituent: PENN Treebank[7] BBN[8] Constituent: PropBank[9], NomBank[10];

Dependency: CoNLL-2008[11]

Pronoun: BBN[8] SemEval-2007 Task 17 (3,500 word)[12] PARC[13][14]
FrameNet 1.5 (WSJ+AQUAINT +MASC+LUcorpus+misc.) FrameNet FrameNet
OntoNotes 4.0 (?) Moor et al. (2013)[15](partial)
OntoNotes 5.0 (WSJ 300K+TDT4+LCD+Web) OntoNotes OntoNotes (PropBank-style) OntoNotes OntoNotes (coarse-grained)
Sherlock Holmes SemEval-2010[16] SemEval-2010[16]
Brown Brown Constituent: PENN Treebank[7] PropBank-style: CoNLL-2005[17]
SemCor (part of Brown) Brown SemCor
WSMT (13 articles) Semeval-2013 task 12 (BabelNet 1.1.1)[18]
RSS-500 NIF NER N3[19] N3

See also Edit

From Hovy et al. (2006)[20]

An example of the latter type is the Salsa project (Burchardt et al., 2004), which produced a German lexicon based on the FrameNet semantic frames and annotated a large German newswire corpus. A second example, the Prague Dependency Treebank (Hajic et al., 2001), has annotated a large Czech corpus with several levels of (tectogrammatical) representation, including parts of speech, syntax, and topic/focus information structure. Finally, the IL-Annotation project (Reeder et al., 2004) focused on the representations required to support a series of increasingly semantic phenomena across seven languages (Arabic, Hindi, English, Spanish, Korean, Japanese and French). In intent and in many details, OntoNotes is compatible with all these efforts, which may one day all participate in a larger multilingual corpus integration effort."

References Edit

  1. Zeldes, A. (2017). The GUM corpus: creating multilayer resources in the classroom. Language Resources and Evaluation, 51(3), 581–612.
  2. Kulkarni, S., Singh, A., Ramakrishnan, G., & Chakrabarti, S. (2009, June). Collective annotation of Wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 457-466). ACM.
  3. J. Hoffart, S. Seufert, D. B. Nguyen, M. Theobald, and G. Weikum. KORE: Keyphrase overlap relatedness for entity disambiguation. In Proc. of the 21st ACM international conference on Information and knowledge management, pages 545{554. ACM, 2012.
  4. Erik F. Tjong Kim Sang, Fien De Meulder: Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. CoNLL 2003
  5. Hoffart, J., Yosef, A. M., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., … Weikum, G. (2011). Robust Disambiguation of Named Entities in Text. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp. 782–792). Association for Computational Linguistics.
  6. 6.0 6.1 Hasler, L., Orasan, C., Naumann, K., Hasler, L., & Orasan, C. (2000). NPs for Events : Experiments in Coreference Annotation. In LREC (pp. 1167–1172).
  7. 7.0 7.1 7.2 M.P. Marcus, B. Santorini, and M.A. Marcinkiewicz. 1993. Building a Large Annotated Corpus of En- glish: the Penn Treebank. Computational Linguis- tics, 19.
  8. 8.0 8.1 R. Weischedel and A. Brunstein. 2005. BBN pronoun coreference and entity type corpus. Technical report, Linguistic Data Consortium.
  9. M. Palmer, D. Gildea, and P. Kingsbury. 2005. The Proposition Bank: An Annotated Corpus of Seman- tic Roles. Computational Linguistics, 31(1).
  10. Meyers, A., R. Reeves, C. Macleod, R. Szekely, V. Zielinska, B. Young, and R. Grishman. 2004. The NomBank Project: An Interim Report. In NAACL/HLT 2004 Workshop Frontiers in Corpus Annotation, Boston.
  11. Surdeanu, M., Johansson, R., Meyers, A., Màrquez, L., & Nivre, J. (2008). The CoNLL 2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies. In CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning (pp. 159–177). Coling 2008 Organizing Committee.
  12. Pradhan, S. S., Loper, E., Dligach, D., & Palmer, M. (2007, June). SemEval-2007 task 17: English lexical sample, SRL and all words. In Proceedings of the 4th International Workshop on Semantic Evaluations (pp. 87-92). Association for Computational Linguistics.
  13. Silvia Pareti. 2011. Annotating attribution relations and their features. In Proceedings of the fourth workshop on Exploiting Semantic Annotations in Information Retrieval, pages 19–20. ACM.
  14. Silvia Pareti. 2012. A database of attribution relations. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, pages 3213–3217.
  15. Moor, T., Roth, M., & Frank, A. (2013). Predicate-specific Annotations for Implicit Role Binding: Corpus Annotation, Data Analysis and Evaluation Experiments. In Proceedings of the 10th International Conference on Computational Semantics (pp. 369–375).
  16. 16.0 16.1 Ruppenhofer, J., Sporleder, C., Morante, R., Baker, C., & Palmer, M. (2010). SemEval-2010 Task 10: Linking Events and Their Participants in Discourse, (July), 45–50. PDF
  17. Carreras, Xavier, and Lluís Màrquez. "Introduction to the CoNLL-2005 shared task: Semantic role labeling." In Proceedings of the Ninth Conference on Computational Natural Language Learning, pp. 152-164. Association for Computational Linguistics, 2005.
  18. Navigli, R., Jurgens, D., & Vannella, D. (2013, June). Semeval-2013 task 12: Multilingual word sense disambiguation. In Second Joint Conference on Lexical and Computational Semantics (* SEM) (Vol. 2, pp. 222-231).
  19. N3 - A Collection of Datasets for Named Entity Recognition and Disambiguation in the NLP Interchange Format Authors: Michael Röder, Ricardo Usbeck, Sebastian Hellmann, Daniel Gerber und Andreas Both In: The 9th edition of the Language Resources and Evaluation Conference, 26-31 May, Reykjavik, Iceland
  20. Hovy, Eduard, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. "OntoNotes: the 90% solution." In Proceedings of the human language technology conference of the NAACL, Companion Volume: Short Papers, pp. 57-60. Association for Computational Linguistics, 2006.
Community content is available under CC-BY-SA unless otherwise noted.