OntoNotes is the biggest corpus as of 2018, covering topics such as written news, broadcast news, telephone conversations, the bible, etc. It is freely available at https://catalog.ldc.upenn.edu/LDC2013T19.
From Recasens et al. (2012):
"The NP4E corpus (Hasler et al., 2006) contains approximately 50,000 tokens from the Reuters corpus (Rose et al., 2002) fully annotated with NP coreference and partially annotated with event coreference. It was the result of a project whose goal was to develop a set of annotation guidelines for NP and event coreference for newswire texts in the do- main of terrorism and security. The documents were selected according to five topics (i.e., Bukavu bombing, Peru hostages, Tajikistan hostages, Israel suicide bomb, China-Taiwan hijack). NP4E was analyzed using the Conexor’s parser (Tapanainen and Jarvinen, 1997), but only tokenization was used for the annotation process. Markables corresponded with NPs, and were identified manually at all the levels of embedding, and including all the modifiers of an NP in the markable.
The AnCora-CA corpus (Recasens and Martı, 2010) contains 400,000 words annotated with coreference information on top of manually annotated grammatical relations, argument structures, thematic roles, semantic verb classes, named entities, and WordNet nominal senses (Taule et al., 2008). AnCora-CA comprises newspaper and newswire articles from El Periodico newspaper, and the ACN news agency. Markables were identified according to the already existing syntactic annotations. All the NPs were considered to be markables but, unlike NP4E, markables excluded non-referring NPs such as appositive phrases, nominal predicates, negated NPs, and NPs within idioms. Also, unlike NP4E, relative pronouns were included as markables."Tinkertoy corpus (Sasaki et al. 2002): "task-oriented dialogues elicited with an interactive game, namely the Tinkertoy matching game" (Japanese).
WikiCoref: Ghaddar and Langlais (2016a) annotated 30 documents covering different topics such as "People, Organization, Human made Object, or Occupation". Ghaddar and Langlais (2016b) devised a method to annotate "main concept" (the entity in the title) to a high F1.
- NXT Switchboard is of particular interest because it consist of oral transcriptions https://catalog.ldc.upenn.edu/LDC2009T26
- ARRAU https://catalog.ldc.upenn.edu/LDC2013T22
- ACE and MUC corpora
- GUM corpus: https://corpling.uis.georgetown.edu/gum/
- ↑ Recasens, M., Ant, M., Orasan, C., & Martí, M. A. (2012). Annotating Near-Identity from Coreference Disagreements. Proceedings of LREC 2012, 165–172.
- ↑ Sasaki, F., Sasaki, F., Wegener, C., Wegener, C., Witt, A., Witt, A., … P�nninghaus, J. (2002). Co-reference annotation and resources: A multilingual corpus of typologically diverse languages. Proceedings of the 3nd International Conference on Language Resources and Evaluation (LREC 2002), 1225–1230.
- ↑ Poesio, M., Chamberlain, J., Kruschwitz, U., Robaldo, L., & Ducceschi, L. (2013). Phrase Detectives: Utilizing Collective Intelligence for Internet-scale Language Resource Creation. ACM Trans. Interact. Intell. Syst., 3(1), 3:1--3:44. http://doi.org/10.1145/2448116.2448119
- ↑ Chamberlain, J., Poesio, M., & Kruschwitz, U. (2016). Phrase Detectives Corpus 1.0 Crowdsourced Anaphoric Coreference. In LREC.
- ↑ Abbas Ghaddar and Philippe Langlais. 2016a. WikiCoref: An English coreference-annotated corpus of Wikipedia articles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portoroˇz, Slovenia, 05/2016.
- ↑ Ghaddar, Abbas, and Philippe Langlais. 2016b. "Coreference in Wikipedia: Main Concept Resolution." CoNLL.
- ↑ Schäfer, Ulrich, Christian Spurk, and Jörg Steffen. "A fully coreference-annotated corpus of scholarly papers from the ACL anthology." (2012).
- ↑ Chaimongkol, Panot, Akiko Aizawa, and Yuka Tateisi. "Corpus for Coreference Resolution on Scientific Papers." LREC. 2014.