Summary Edit

Language #documents #tokens Genres Layers Events




Reference Download
OntoNotes English newswire, broadcast news, bible, telephone conversations, etc. Seg, tok, syntax, SRL, coref Y N
AnCora-CA Spanish+Catalan
Phrase Detectives English N?
WikiCoref N[excerpt 1] Ghaddar and Langlais (2016)[1]
GUM English 126 109,141 interviews, news, travel guide, academic writing, etc. Seg, tok, syntax, coref, discourse,


Y Zeldes (2017)[2] Github

Details Edit

OntoNotes is the biggest corpus as of 2018, covering topics such as written news, broadcast news, telephone conversations, the bible, etc. It is freely available at


From Recasens et al. (2012)[3]:

"The NP4E corpus (Hasler et al., 2006) contains approximately 50,000 tokens from the Reuters corpus (Rose et al., 2002) fully annotated with NP coreference and partially annotated with event coreference. It was the result of a project whose goal was to develop a set of annotation guidelines for NP and event coreference for newswire texts in the do- main of terrorism and security. The documents were selected according to five topics (i.e., Bukavu bombing, Peru hostages, Tajikistan hostages, Israel suicide bomb, China-Taiwan hijack). NP4E was analyzed using the Conexor’s parser (Tapanainen and Jarvinen, 1997), but only tokenization was used for the annotation process. Markables corresponded with NPs, and were identified manually at all the levels of embedding, and including all the modifiers of an NP in the markable.
The AnCora-CA corpus (Recasens and Martı, 2010) contains 400,000 words annotated with coreference information on top of manually annotated grammatical relations, argument structures, thematic roles, semantic verb classes, named entities, and WordNet nominal senses (Taule et al., 2008). AnCora-CA comprises newspaper and newswire articles from El Periodico newspaper, and the ACN news agency. Markables were identified according to the already existing syntactic annotations. All the NPs were considered to be markables but, unlike NP4E, markables excluded non-referring NPs such as appositive phrases, nominal predicates, negated NPs, and NPs within idioms. Also, unlike NP4E, relative pronouns were included as markables."
Tinkertoy corpus (Sasaki et al. 2002)[4]: "task-oriented dialogues elicited with an interactive game, namely the Tinkertoy matching game" (Japanese).

Phrase Detectives on LDC (can be downloaded for free), refs: Poesio et al. (2013)[5], Chamberlain et al. (2016)[6]

WikiCoref: Ghaddar and Langlais (2016a)[1] annotated 30 documents covering different topics such as "People, Organization, Human made Object, or Occupation". Ghaddar and Langlais (2016b)[7] devised a method to annotate "main concept" (the entity in the title) to a high F1.

Scientific papers: Schäfer et al. (2012)[8], Chaimongkol et al. (2014)[9]

Excerpts Edit

  1. From Ghaddar and Langlais (2016): "In general, the annotation scheme in WikiCoref mainly fol- lows the OntoNotes scheme (Pradhan et al., 2007). In par- ticular, only noun phrases are eligible to be mentions and only non-singleton coreference sets are kept in the version distributed."

References Edit

  1. 1.0 1.1 Abbas Ghaddar and Philippe Langlais. 2016a. WikiCoref: An English coreference-annotated corpus of Wikipedia articles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portoroˇz, Slovenia, 05/2016.
  2. Zeldes, A. (2017). The GUM corpus: creating multilayer resources in the classroom. Language Resources and Evaluation, 51(3), 581–612.
  3. Recasens, M., Ant, M., Orasan, C., & Martí, M. A. (2012). Annotating Near-Identity from Coreference Disagreements. Proceedings of LREC 2012, 165–172.
  4. Sasaki, F., Sasaki, F., Wegener, C., Wegener, C., Witt, A., Witt, A., … P�nninghaus, J. (2002). Co-reference annotation and resources: A multilingual corpus of typologically diverse languages. Proceedings of the 3nd International Conference on Language Resources and Evaluation (LREC 2002), 1225–1230.
  5. Poesio, M., Chamberlain, J., Kruschwitz, U., Robaldo, L., & Ducceschi, L. (2013). Phrase Detectives: Utilizing Collective Intelligence for Internet-scale Language Resource Creation. ACM Trans. Interact. Intell. Syst., 3(1), 3:1--3:44.
  6. Chamberlain, J., Poesio, M., & Kruschwitz, U. (2016). Phrase Detectives Corpus 1.0 Crowdsourced Anaphoric Coreference. In LREC.
  7. Ghaddar, Abbas, and Philippe Langlais. 2016b. "Coreference in Wikipedia: Main Concept Resolution." CoNLL.
  8. Schäfer, Ulrich, Christian Spurk, and Jörg Steffen. "A fully coreference-annotated corpus of scholarly papers from the ACL anthology." (2012).
  9. Chaimongkol, Panot, Akiko Aizawa, and Yuka Tateisi. "Corpus for Coreference Resolution on Scientific Papers." LREC. 2014.
Community content is available under CC-BY-SA unless otherwise noted.