Natural Language Understanding Wiki

Amoia et al. (2012)[1] argued that spoken language uses different strategies w.r.t. coreference. More precisely, they observed that 51% of coreference cases in their interviews are personal pronouns while in written text 18.4%. However, most of the cases might just be "you" and "I" as two persons address each other and themselves. Those are uninteresting cases. The relative proportion between types (e.g. named entity vs. noun phrase) is not so different.

From Strube and Chritoph (2003)[2]:

"There are important differences between written text and spoken dialogue which have to be accounted for. The most obvious difference is that in spoken dialogue there is an abundance of (personal and demonstrative) pronouns with non-NP-antecedents or no antecedents at all. Corpus studies have shown that a significant amount of pronouns in spoken dialogue have non-NP-antecedents: Byron & Allen (1998) report that about 50% of the pronouns in the TRAINS93 corpus have non-NP-antecedents. Eckert & Strube (2000) note that only about 45% of the pronouns in a set of Switchboard dialogues have NP-antecedents. The remainder consists of 22% which have non-NP-antecedents and 33% without antecedents. These studies suggest that the performance of a pronoun resolution algorithm can be improved considerably by enabling it to resolve also pronouns with non-NP-antecedents.

Chen et al. (2011)[3]: "this task is more difficult in dialogue. First, utterances may be informal, ungrammatical or disfluent; second, people spontaneously use hand gestures, body gestures and gaze."


Röesiger and Riester (2015)[4] use prosodic information to improve coreference for speech.

Strube and Müller (2003)[5] "present a set of features designed for pronoun resolu- tion in spoken dialogue and determine the most promising features"

Tetreault and Allen (2004)[6] use discourse segmentation information.

Chen et al. (2011)[3] use multimodal features.


DIRNDL (German)[]

From Röesiger and Riester (2015)[4]: "DIRNDL corpus (ca. 50.000 tokens, 3221 sentences), a radio news corpus annotated with both manual coreference and manual prosody la- bels (Eckart et al., 2012[7]; Bj¨orkelund et al., 2014)."

ANCOR_Centre (French)[]

Muzerelle et al. (2014)[8]:

"This article presents ANCOR_Centre, a French coreference corpus, available under the Creative Commons Licence. With a size of around 500,000 words, the corpus is large enough to serve the needs of data-driven approaches in NLP and represents one of the largest coreference resources currently available.

It consists of four different spoken corpora that were already transcribed during previous research projects (Table 2). Two of them have been extracted from the ESLO corpus, which collects sociolinguistic interviews with a restricted interactivity (Schang et al., 2012). On the opposite, OTG and Accueil_UBS concern highly interactive Human-Human dialogues (Nicolas et al., 2002). These last two corpora differ by the media of interaction: direct conversation or phone call. All of these corpora are freely distributed under a Creative Commons license. Conversational speech only represents 7% of the total corpus because of the scarcity of such free resources in French."


One of the biggest (if not the biggest) corpus for coreference resolution for English. The conversational parts come from previous resources such as: CallHome (telephone conversations), TDT-4 (broadcast news; the audio is here) and GALE (?) (broadcast conversation).


From Strube and Chritoph (2003)[2]:

"The annotation consists of 16601 markables, i.e. sequences ofwords and attributes associated with them. On the top level, different types of markables are distinguished: NP-markables [...]. VP-markables are verb phrases, S-markables sentences. Disfluency-markables [...]. Among other (type-dependent) attributes, markables contain a member attribute with the ID of the coreference class they are part of (if any). If an expression is used to refer to an entity that is not referred to by any other expression, it is considered a singleton."

Audio data is distributed by LDC: 1, 2.

Monroe domain[]

From Tetreault and Allen (2004)[6]:

"Our corpus consists of five transcribed task-oriented dialogs (1756 utterances total) between two humans called the Monroe domain (Stent, 2001). The participants were given a set of emergencies and told to collaborate on building a plan to allocate resources to resolve all the emergencies in a timely manner.


The third phase involved annotating the reference relationships between terms using a variation of the GNOME project scheme (Poesio, 2000). We annotated coreference relationships between noun phrases and also annotated all pronouns. We labeled each pronoun with one of the following relations: coreference, action, demonstrative, and functional."

ICSI Meeting Corpus[]

The audio and transcription as distributed by LDC, coreference of three pronouns it, this, and that are added by Müller (2008)[9].

DAD corpus (Danish+Italian)[]

From Navarretta (2007)[10]:

"The main goal behind the annotation of reference in the DAD project has been to provide annotated data to be used in the study and treatment of abstract anaphora in Danish and Italian written and spoken corpora (Navarretta and Olsen, 2008)."

ACE 2007?[]

TODO: Luo et al. (2009)[11]

"This study uses the 2007 ACE data. In the ACE program, a mention is textual reference to an object of interest while the set of mentions in a document referring to the same object is called entity."



  1. Amoia, M., Kunz, K., & Lapshinova-Koltunski, E. (2012). Coreference in Spoken vs. Written Texts: a Corpus-based Analysis. In LREC 2012 (pp. 158–164).
  2. 2.0 2.1 Strube, M., & Christoph, M. (2003). A Machine Learning Approach to Pronoun Resolution in Spoken Dialogue. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, 168–175.
  3. 3.0 3.1 Chen, L., Wang, A., & Eugenio, B. Di. (2011). Improving pronominal and deictic co-reference resolution with multi-modal features. Proceedings of the SIGDIAL 2011, 307–311. Retrieved from
  4. 4.0 4.1 Röesiger, I., & Riester, A. (2015). Using prosodic annotations to improve coreference resolution of spoken text. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2, 83–88. Retrieved from
  5. Michael Strube and Christoph M¨ uller. 2003. A machine learning approach to pronoun resolution in spoken dialogue. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pages 168–175. Jacques
  6. 6.0 6.1 Joel Tetreault and James Allen. 2004. Dialogue structure and pronoun resolution. In Proceedings of the 5th Discourse Anaphora and Anaphor Resolution Colloquium, S. Miguel, Portugal.
  7. Eckart, K., Riester, A., & Schweitzer, K. (2012). A Discourse Information Radio News Database for Linguistic Analysis. Linked Data in Linguistics, 65–76.
  8. Muzerelle, J., Lefeuvre, A., Schang, E., Antoine, J., Maurel, D., Eshkol, I., & Villaneau, J. (2014). ANCOR_Centre, a Large Free Spoken French Coreference Corpus. LREC, 843–847.
  9. Müller, M.-C. (2008). Fully Automatic Resolution of “it”, “this”, and “that” in Unrestricted Multi-Party Dialog.
  10. Navarretta, C. (2007). The DAD parallel corpora and their uses. Corpus, 705–712.
  11. Luo, X., Florian, R., & Ward, T. (2009). Improving Coreference Resolution by Using Conversational Metadata. Naacl, (June), 201–204.