Natural Language Understanding Wiki
Advertisement

Main references:

  • Ontonotes 4.0: TODO
  • Ontonotes 5.0: Weischedel et al. (2013)[1]

Download:

Genres[]

OntoNotes is composed of several "genre" (or rather sources) as follows (Pradhan et al. 2013[2], Weischedel et al. 2013[3]):

  • bc: broadcast conversation
  • bn: broadcast news
  • mz: magazine genre (Sinorama magazine)
  • nw: newswire genre
  • pt: pivot text (250K English translation of the New Testament annotated with parse, proposition, name and coreference; and about 100K parses for a portion of the Old Testament)
  • tc: telephone conversation (CallHome corpus)
  • wb: web data (85K of single sentences selected to improve sense coverage)

Views[]

Word sense view[]

Word senses in Ontonotes are composed of WordNet 3.0 and a small number of new senses based on human-readable dictionary. Some abbreviations you can find in sense inventory files (e.g. /data/files/data/english/metadata/sense-inventories/coursework-n.xml) are:[4]

  • MAC: Macmillan dictionary
  • MW online: Merriam Webster online

Errata[]

Using NearDup, I found 46 duplications in OntoNote 5.0 and another 47 pairs of files that overlap more than 90% (a small portion compared to 13k files). They are annotations of the same underlying document. The annotations themselves are likely to contain slight variations.

List of exact duplications:

File 1 File 2
annotations/nw/wsj/03/wsj_0364.parse annotations/nw/wsj/01/wsj_0190.parse
annotations/nw/wsj/05/wsj_0511.parse annotations/nw/wsj/01/wsj_0190.parse
annotations/nw/wsj/05/wsj_0511.parse annotations/nw/wsj/03/wsj_0364.parse
annotations/nw/wsj/06/wsj_0696.parse annotations/nw/wsj/01/wsj_0190.parse
annotations/nw/wsj/06/wsj_0696.parse annotations/nw/wsj/03/wsj_0364.parse
annotations/nw/wsj/06/wsj_0696.parse annotations/nw/wsj/05/wsj_0511.parse
annotations/nw/wsj/10/wsj_1056.parse annotations/nw/wsj/01/wsj_0190.parse
annotations/nw/wsj/10/wsj_1056.parse annotations/nw/wsj/03/wsj_0364.parse
annotations/nw/wsj/10/wsj_1056.parse annotations/nw/wsj/05/wsj_0511.parse
annotations/nw/wsj/10/wsj_1056.parse annotations/nw/wsj/06/wsj_0696.parse
annotations/nw/wsj/12/wsj_1228.parse annotations/nw/wsj/01/wsj_0190.parse
annotations/nw/wsj/12/wsj_1228.parse annotations/nw/wsj/03/wsj_0364.parse
annotations/nw/wsj/12/wsj_1228.parse annotations/nw/wsj/05/wsj_0511.parse
annotations/nw/wsj/12/wsj_1228.parse annotations/nw/wsj/06/wsj_0696.parse
annotations/nw/wsj/12/wsj_1228.parse annotations/nw/wsj/10/wsj_1056.parse
annotations/nw/wsj/13/wsj_1382.parse annotations/nw/wsj/01/wsj_0190.parse
annotations/nw/wsj/13/wsj_1382.parse annotations/nw/wsj/03/wsj_0364.parse
annotations/nw/wsj/13/wsj_1382.parse annotations/nw/wsj/05/wsj_0511.parse
annotations/nw/wsj/13/wsj_1382.parse annotations/nw/wsj/06/wsj_0696.parse
annotations/nw/wsj/13/wsj_1382.parse annotations/nw/wsj/10/wsj_1056.parse
annotations/nw/wsj/13/wsj_1382.parse annotations/nw/wsj/12/wsj_1228.parse
annotations/nw/wsj/15/wsj_1558.parse annotations/nw/wsj/15/wsj_1557.parse
annotations/nw/wsj/19/wsj_1941.parse annotations/nw/wsj/01/wsj_0190.parse
annotations/nw/wsj/19/wsj_1941.parse annotations/nw/wsj/03/wsj_0364.parse
annotations/nw/wsj/19/wsj_1941.parse annotations/nw/wsj/05/wsj_0511.parse
annotations/nw/wsj/19/wsj_1941.parse annotations/nw/wsj/06/wsj_0696.parse
annotations/nw/wsj/19/wsj_1941.parse annotations/nw/wsj/10/wsj_1056.parse
annotations/nw/wsj/19/wsj_1941.parse annotations/nw/wsj/12/wsj_1228.parse
annotations/nw/wsj/19/wsj_1941.parse annotations/nw/wsj/13/wsj_1382.parse
annotations/nw/wsj/21/wsj_2114.parse annotations/nw/wsj/01/wsj_0190.parse
annotations/nw/wsj/21/wsj_2114.parse annotations/nw/wsj/03/wsj_0364.parse
annotations/nw/wsj/21/wsj_2114.parse annotations/nw/wsj/05/wsj_0511.parse
annotations/nw/wsj/21/wsj_2114.parse annotations/nw/wsj/06/wsj_0696.parse
annotations/nw/wsj/21/wsj_2114.parse annotations/nw/wsj/10/wsj_1056.parse
annotations/nw/wsj/21/wsj_2114.parse annotations/nw/wsj/12/wsj_1228.parse
annotations/nw/wsj/21/wsj_2114.parse annotations/nw/wsj/13/wsj_1382.parse
annotations/nw/wsj/21/wsj_2114.parse annotations/nw/wsj/19/wsj_1941.parse
annotations/wb/sel/28/sel_2886.parse annotations/wb/sel/28/sel_2884.parse
annotations/wb/sel/37/sel_3763.parse annotations/wb/sel/30/sel_3022.parse
annotations/wb/sel/51/sel_5169.parse annotations/wb/sel/41/sel_4156.parse
annotations/wb/sel/63/sel_6382.parse annotations/wb/sel/48/sel_4847.parse
annotations/wb/sel/74/sel_7478.parse annotations/wb/sel/60/sel_6047.parse
annotations/wb/sel/94/sel_9487.parse annotations/wb/sel/51/sel_5110.parse
annotations/wb/sel/95/sel_9585.parse annotations/wb/sel/42/sel_4283.parse
annotations/wb/sel/95/sel_9589.parse annotations/wb/sel/43/sel_4300.parse
annotations/wb/sel/97/sel_9779.parse annotations/wb/sel/77/sel_7752.parse

List of files with more than 90% overlap:

File 1 File 2
annotations/nw/wsj/14/wsj_1486.parse annotations/nw/wsj/06/wsj_0680.parse
annotations/nw/wsj/15/wsj_1557.parse annotations/nw/wsj/01/wsj_0190.parse
annotations/nw/wsj/15/wsj_1557.parse annotations/nw/wsj/03/wsj_0364.parse
annotations/nw/wsj/15/wsj_1557.parse annotations/nw/wsj/05/wsj_0511.parse
annotations/nw/wsj/15/wsj_1557.parse annotations/nw/wsj/06/wsj_0696.parse
annotations/nw/wsj/15/wsj_1557.parse annotations/nw/wsj/10/wsj_1056.parse
annotations/nw/wsj/15/wsj_1557.parse annotations/nw/wsj/12/wsj_1228.parse
annotations/nw/wsj/15/wsj_1557.parse annotations/nw/wsj/13/wsj_1382.parse
annotations/nw/wsj/15/wsj_1558.parse annotations/nw/wsj/01/wsj_0190.parse
annotations/nw/wsj/15/wsj_1558.parse annotations/nw/wsj/03/wsj_0364.parse
annotations/nw/wsj/15/wsj_1558.parse annotations/nw/wsj/05/wsj_0511.parse
annotations/nw/wsj/15/wsj_1558.parse annotations/nw/wsj/06/wsj_0696.parse
annotations/nw/wsj/15/wsj_1558.parse annotations/nw/wsj/10/wsj_1056.parse
annotations/nw/wsj/15/wsj_1558.parse annotations/nw/wsj/12/wsj_1228.parse
annotations/nw/wsj/15/wsj_1558.parse annotations/nw/wsj/13/wsj_1382.parse
annotations/nw/wsj/18/wsj_1862.parse annotations/nw/wsj/12/wsj_1259.parse
annotations/nw/wsj/19/wsj_1941.parse annotations/nw/wsj/15/wsj_1557.parse
annotations/nw/wsj/19/wsj_1941.parse annotations/nw/wsj/15/wsj_1558.parse
annotations/nw/wsj/21/wsj_2114.parse annotations/nw/wsj/15/wsj_1557.parse
annotations/nw/wsj/21/wsj_2114.parse annotations/nw/wsj/15/wsj_1558.parse
annotations/wb/sel/38/sel_3878.parse annotations/wb/sel/38/sel_3875.parse
annotations/wb/sel/43/sel_4300.parse annotations/wb/sel/42/sel_4283.parse
annotations/wb/sel/46/sel_4653.parse annotations/wb/sel/46/sel_4648.parse
annotations/wb/sel/46/sel_4660.parse annotations/wb/sel/46/sel_4648.parse
annotations/wb/sel/46/sel_4660.parse annotations/wb/sel/46/sel_4653.parse
annotations/wb/sel/48/sel_4844.parse annotations/wb/sel/48/sel_4836.parse
annotations/wb/sel/49/sel_4975.parse annotations/wb/sel/49/sel_4974.parse
annotations/wb/sel/51/sel_5196.parse annotations/wb/sel/51/sel_5185.parse
annotations/wb/sel/53/sel_5342.parse annotations/wb/sel/53/sel_5341.parse
annotations/wb/sel/60/sel_6042.parse annotations/wb/sel/60/sel_6040.parse
annotations/wb/sel/63/sel_6399.parse annotations/wb/sel/63/sel_6398.parse
annotations/wb/sel/65/sel_6591.parse annotations/wb/sel/65/sel_6585.parse
annotations/wb/sel/67/sel_6786.parse annotations/wb/sel/67/sel_6785.parse
annotations/wb/sel/76/sel_7698.parse annotations/wb/sel/76/sel_7697.parse
annotations/wb/sel/79/sel_7943.parse annotations/wb/sel/79/sel_7942.parse
annotations/wb/sel/80/sel_8018.parse annotations/wb/sel/80/sel_8011.parse
annotations/wb/sel/82/sel_8232.parse annotations/wb/sel/81/sel_8149.parse
annotations/wb/sel/82/sel_8234.parse annotations/wb/sel/81/sel_8168.parse
annotations/wb/sel/87/sel_8703.parse annotations/wb/sel/87/sel_8702.parse
annotations/wb/sel/87/sel_8771.parse annotations/wb/sel/87/sel_8756.parse
annotations/wb/sel/87/sel_8787.parse annotations/wb/sel/87/sel_8766.parse
annotations/wb/sel/90/sel_9002.parse annotations/wb/sel/89/sel_8992.parse
annotations/wb/sel/91/sel_9189.parse annotations/wb/sel/91/sel_9186.parse
annotations/wb/sel/93/sel_9372.parse annotations/wb/sel/93/sel_9369.parse
annotations/wb/sel/95/sel_9585.parse annotations/wb/sel/43/sel_4300.parse
annotations/wb/sel/95/sel_9589.parse annotations/wb/sel/42/sel_4283.parse
annotations/wb/sel/95/sel_9589.parse annotations/wb/sel/95/sel_9585.parse

All the documents above form 34 clusters (see code to create cluster):

Cluster of duplicate or near-duplicate documents
annotations/nw/wsj/14/wsj_1486.parse, annotations/nw/wsj/06/wsj_0680.parse
annotations/nw/wsj/12/wsj_1259.parse, annotations/nw/wsj/18/wsj_1862.parse
annotations/nw/wsj/03/wsj_0364.parse, annotations/nw/wsj/12/wsj_1228.parse, annotations/nw/wsj/06/wsj_0696.parse, annotations/nw/wsj/10/wsj_1056.parse, annotations/nw/wsj/21/wsj_2114.parse, annotations/nw/wsj/05/wsj_0511.parse, annotations/nw/wsj/01/wsj_0190.parse, annotations/nw/wsj/13/wsj_1382.parse, annotations/nw/wsj/15/wsj_1557.parse, annotations/nw/wsj/19/wsj_1941.parse, annotations/nw/wsj/15/wsj_1558.parse
annotations/wb/sel/38/sel_3875.parse, annotations/wb/sel/38/sel_3878.parse
annotations/wb/sel/46/sel_4660.parse, annotations/wb/sel/46/sel_4653.parse, annotations/wb/sel/46/sel_4648.parse
annotations/wb/sel/48/sel_4844.parse, annotations/wb/sel/48/sel_4836.parse
annotations/wb/sel/49/sel_4975.parse, annotations/wb/sel/49/sel_4974.parse
annotations/wb/sel/51/sel_5185.parse, annotations/wb/sel/51/sel_5196.parse
annotations/wb/sel/53/sel_5341.parse, annotations/wb/sel/53/sel_5342.parse
annotations/wb/sel/60/sel_6040.parse, annotations/wb/sel/60/sel_6042.parse
annotations/wb/sel/63/sel_6398.parse, annotations/wb/sel/63/sel_6399.parse
annotations/wb/sel/65/sel_6585.parse, annotations/wb/sel/65/sel_6591.parse
annotations/wb/sel/67/sel_6785.parse, annotations/wb/sel/67/sel_6786.parse
annotations/wb/sel/76/sel_7698.parse, annotations/wb/sel/76/sel_7697.parse
annotations/wb/sel/79/sel_7942.parse, annotations/wb/sel/79/sel_7943.parse
annotations/wb/sel/80/sel_8011.parse, annotations/wb/sel/80/sel_8018.parse
annotations/wb/sel/81/sel_8149.parse, annotations/wb/sel/82/sel_8232.parse
annotations/wb/sel/82/sel_8234.parse, annotations/wb/sel/81/sel_8168.parse
annotations/wb/sel/87/sel_8702.parse, annotations/wb/sel/87/sel_8703.parse
annotations/wb/sel/87/sel_8771.parse, annotations/wb/sel/87/sel_8756.parse
annotations/wb/sel/87/sel_8787.parse, annotations/wb/sel/87/sel_8766.parse
annotations/wb/sel/89/sel_8992.parse, annotations/wb/sel/90/sel_9002.parse
annotations/wb/sel/91/sel_9189.parse, annotations/wb/sel/91/sel_9186.parse
annotations/wb/sel/93/sel_9372.parse, annotations/wb/sel/93/sel_9369.parse
annotations/wb/sel/95/sel_9589.parse, annotations/wb/sel/95/sel_9585.parse, annotations/wb/sel/43/sel_4300.parse, annotations/wb/sel/42/sel_4283.parse
annotations/nw/wsj/15/wsj_1557.parse, annotations/nw/wsj/15/wsj_1558.parse
annotations/wb/sel/28/sel_2886.parse, annotations/wb/sel/28/sel_2884.parse
annotations/wb/sel/37/sel_3763.parse, annotations/wb/sel/30/sel_3022.parse
annotations/wb/sel/51/sel_5169.parse, annotations/wb/sel/41/sel_4156.parse
annotations/wb/sel/63/sel_6382.parse, annotations/wb/sel/48/sel_4847.parse
annotations/wb/sel/60/sel_6047.parse, annotations/wb/sel/74/sel_7478.parse
annotations/wb/sel/94/sel_9487.parse, annotations/wb/sel/51/sel_5110.parse
annotations/wb/sel/42/sel_4283.parse, annotations/wb/sel/95/sel_9585.parse
annotations/wb/sel/77/sel_7752.parse, annotations/wb/sel/97/sel_9779.parse

See also[]

References[]

  1. Weischedel, R., Palmer, M., Marcus, M., Hovy, E., Pradhan, S., Ramshaw, L., … Houston, A. (2013). OntoNotes Release 5.0 LDC2013T19. Linguistic Data Consortium.
  2. Pradhan, S., Moschitti, A., Xue, N., Ng, H. T., Bjorkelund, A., Uryupina, O., … Zhong, Z. (2013). Towards Robust Linguistic Analysis Using OntoNotes. Proceedings of the Seventeenth Conference on Computational Natural Language Learning, 143–152.
  3. Weischedel, R., Palmer, M., Marcus, M., Hovy, E., Pradhan, S., Ramshaw, L., … Houston, A. (2013). OntoNotes Release 5.0 LDC2013T19. Linguistic Data Consortium.
  4. Discussion on Google Group "Koc University Artificial Intelligence Laboratory", 21 Dec 2013
Advertisement