Changes: Entity coreference datasets

Revision as of 11:12, 10 February 2020

Summary

	Language	Events annotated?	Singleton annotated?	Reference
OntoNotes	English	Y	N
AnCora-CA	Spanish+Catalan
Phrase Detectives	English		N?
WikiCoref			N^{[excerpt 1]}	Ghaddar and Langlais (2016)^[1]
GUM
ACE
MUC

Details

OntoNotes is the biggest corpus as of 2018, covering topics such as written news, broadcast news, telephone conversations, the bible, etc. It is freely available at https://catalog.ldc.upenn.edu/LDC2013T19.

TODO: http://dces.essex.ac.uk/staff/poesio/publications/chacat.pdf

From Recasens et al. (2012)^[2]:

"The NP4E corpus (Hasler et al., 2006) contains approximately 50,000 tokens from the Reuters corpus (Rose et al., 2002) fully annotated with NP coreference and partially annotated with event coreference. It was the result of a project whose goal was to develop a set of annotation guidelines for NP and event coreference for newswire texts in the do- main of terrorism and security. The documents were selected according to five topics (i.e., Bukavu bombing, Peru hostages, Tajikistan hostages, Israel suicide bomb, China-Taiwan hijack). NP4E was analyzed using the Conexor’s parser (Tapanainen and Jarvinen, 1997), but only tokenization was used for the annotation process. Markables corresponded with NPs, and were identified manually at all the levels of embedding, and including all the modifiers of an NP in the markable.

The AnCora-CA corpus (Recasens and Martı, 2010) contains 400,000 words annotated with coreference information on top of manually annotated grammatical relations, argument structures, thematic roles, semantic verb classes, named entities, and WordNet nominal senses (Taule et al., 2008). AnCora-CA comprises newspaper and newswire articles from El Periodico newspaper, and the ACN news agency. Markables were identified according to the already existing syntactic annotations. All the NPs were considered to be markables but, unlike NP4E, markables excluded non-referring NPs such as appositive phrases, nominal predicates, negated NPs, and NPs within idioms. Also, unlike NP4E, relative pronouns were included as markables."

Tinkertoy corpus (Sasaki et al. 2002)^[3]: "task-oriented dialogues elicited with an interactive game, namely the Tinkertoy matching game" (Japanese).

Phrase Detectives on LDC (can be downloaded for free), refs: Poesio et al. (2013)^[4], Chamberlain et al. (2016)^[5]

WikiCoref: Ghaddar and Langlais (2016a)^[1] annotated 30 documents covering different topics such as "People, Organization, Human made Object, or Occupation". Ghaddar and Langlais (2016b)^[6] devised a method to annotate "main concept" (the entity in the title) to a high F1.

Scientific papers: Schäfer et al. (2012)^[7], Chaimongkol et al. (2014)^[8]

NXT Switchboard is of particular interest because it consist of oral transcriptions https://catalog.ldc.upenn.edu/LDC2009T26
ARRAU https://catalog.ldc.upenn.edu/LDC2013T22
ACE and MUC corpora
GUM corpus: https://corpling.uis.georgetown.edu/gum/

Excerpts

↑ From Ghaddar and Langlais (2016): "In general, the annotation scheme in WikiCoref mainly fol- lows the OntoNotes scheme (Pradhan et al., 2007). In par- ticular, only noun phrases are eligible to be mentions and only non-singleton coreference sets are kept in the version distributed."

References

↑ ^1.0 ^1.1 Abbas Ghaddar and Philippe Langlais. 2016a. WikiCoref: An English coreference-annotated corpus of Wikipedia articles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portoroˇz, Slovenia, 05/2016.
↑ Recasens, M., Ant, M., Orasan, C., & Martí, M. A. (2012). Annotating Near-Identity from Coreference Disagreements. Proceedings of LREC 2012, 165–172.
↑ Sasaki, F., Sasaki, F., Wegener, C., Wegener, C., Witt, A., Witt, A., … P�nninghaus, J. (2002). Co-reference annotation and resources: A multilingual corpus of typologically diverse languages. Proceedings of the 3nd International Conference on Language Resources and Evaluation (LREC 2002), 1225–1230.
↑ Poesio, M., Chamberlain, J., Kruschwitz, U., Robaldo, L., & Ducceschi, L. (2013). Phrase Detectives: Utilizing Collective Intelligence for Internet-scale Language Resource Creation. ACM Trans. Interact. Intell. Syst., 3(1), 3:1--3:44. http://doi.org/10.1145/2448116.2448119
↑ Chamberlain, J., Poesio, M., & Kruschwitz, U. (2016). Phrase Detectives Corpus 1.0 Crowdsourced Anaphoric Coreference. In LREC.
↑ Ghaddar, Abbas, and Philippe Langlais. 2016b. "Coreference in Wikipedia: Main Concept Resolution." CoNLL.
↑ Schäfer, Ulrich, Christian Spurk, and Jörg Steffen. "A fully coreference-annotated corpus of scholarly papers from the ACL anthology." (2012).
↑ Chaimongkol, Panot, Akiko Aizawa, and Yuka Tateisi. "Corpus for Coreference Resolution on Scientific Papers." LREC. 2014.

[1] From Ghaddar and Langlais (2016): "In general, the annotation scheme in WikiCoref mainly fol- lows the OntoNotes scheme (Pradhan et al., 2007). In par- ticular, only noun phrases are eligible to be mentions and only non-singleton coreference sets are kept in the version distributed."

[:0-2] 1.0 ^1.1 Abbas Ghaddar and Philippe Langlais. 2016a. WikiCoref: An English coreference-annotated corpus of Wikipedia articles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portoroˇz, Slovenia, 05/2016.

[3] Recasens, M., Ant, M., Orasan, C., & Martí, M. A. (2012). Annotating Near-Identity from Coreference Disagreements. Proceedings of LREC 2012, 165–172.

[4] Sasaki, F., Sasaki, F., Wegener, C., Wegener, C., Witt, A., Witt, A., … P�nninghaus, J. (2002). Co-reference annotation and resources: A multilingual corpus of typologically diverse languages. Proceedings of the 3nd International Conference on Language Resources and Evaluation (LREC 2002), 1225–1230.

[5] Poesio, M., Chamberlain, J., Kruschwitz, U., Robaldo, L., & Ducceschi, L. (2013). Phrase Detectives: Utilizing Collective Intelligence for Internet-scale Language Resource Creation. ACM Trans. Interact. Intell. Syst., 3(1), 3:1--3:44. http://doi.org/10.1145/2448116.2448119

[6] Chamberlain, J., Poesio, M., & Kruschwitz, U. (2016). Phrase Detectives Corpus 1.0 Crowdsourced Anaphoric Coreference. In LREC.

[7] Ghaddar, Abbas, and Philippe Langlais. 2016b. "Coreference in Wikipedia: Main Concept Resolution." CoNLL.

[8] Schäfer, Ulrich, Christian Spurk, and Jörg Steffen. "A fully coreference-annotated corpus of scholarly papers from the ACL anthology." (2012).

[9] Chaimongkol, Panot, Akiko Aizawa, and Yuka Tateisi. "Corpus for Coreference Resolution on Scientific Papers." LREC. 2014.

[excerpt 1]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

@@ Line 41: / Line 41: @@
 |
 |
-|N<ref group="excerpts">From Ghaddar and Langlais (2016): "In general, the annotation scheme in WikiCoref mainly fol-
+|N<ref group="excerpt">From Ghaddar and Langlais (2016): "In general, the annotation scheme in WikiCoref mainly fol-
 lows the OntoNotes scheme (Pradhan et al., 2007). In par-
 ticular, only noun phrases are eligible to be mentions and
@@ Line 96: / Line 96: @@
 == Excerpts ==
-<references group="excerpts" />
+<references group="excerpt" />
 == References ==