Main references:
- Ontonotes 4.0: TODO
- Ontonotes 5.0: Weischedel et al. (2013)[1]
Download:
Genres[]
OntoNotes is composed of several "genre" (or rather sources) as follows (Pradhan et al. 2013[2], Weischedel et al. 2013[3]):
- bc: broadcast conversation
- bn: broadcast news
- mz: magazine genre (Sinorama magazine)
- nw: newswire genre
- pt: pivot text (250K English translation of the New Testament annotated with parse, proposition, name and coreference; and about 100K parses for a portion of the Old Testament)
- tc: telephone conversation (CallHome corpus)
- wb: web data (85K of single sentences selected to improve sense coverage)
Views[]
Word sense view[]
Word senses in Ontonotes are composed of WordNet 3.0 and a small number of new senses based on human-readable dictionary. Some abbreviations you can find in sense inventory files (e.g. /data/files/data/english/metadata/sense-inventories/coursework-n.xml
) are:[4]
- MAC: Macmillan dictionary
- MW online: Merriam Webster online
Errata[]
Using NearDup, I found 46 duplications in OntoNote 5.0 and another 47 pairs of files that overlap more than 90% (a small portion compared to 13k files). They are annotations of the same underlying document. The annotations themselves are likely to contain slight variations.
List of exact duplications:
File 1 | File 2 |
---|---|
annotations/nw/wsj/03/wsj_0364.parse | annotations/nw/wsj/01/wsj_0190.parse |
annotations/nw/wsj/05/wsj_0511.parse | annotations/nw/wsj/01/wsj_0190.parse |
annotations/nw/wsj/05/wsj_0511.parse | annotations/nw/wsj/03/wsj_0364.parse |
annotations/nw/wsj/06/wsj_0696.parse | annotations/nw/wsj/01/wsj_0190.parse |
annotations/nw/wsj/06/wsj_0696.parse | annotations/nw/wsj/03/wsj_0364.parse |
annotations/nw/wsj/06/wsj_0696.parse | annotations/nw/wsj/05/wsj_0511.parse |
annotations/nw/wsj/10/wsj_1056.parse | annotations/nw/wsj/01/wsj_0190.parse |
annotations/nw/wsj/10/wsj_1056.parse | annotations/nw/wsj/03/wsj_0364.parse |
annotations/nw/wsj/10/wsj_1056.parse | annotations/nw/wsj/05/wsj_0511.parse |
annotations/nw/wsj/10/wsj_1056.parse | annotations/nw/wsj/06/wsj_0696.parse |
annotations/nw/wsj/12/wsj_1228.parse | annotations/nw/wsj/01/wsj_0190.parse |
annotations/nw/wsj/12/wsj_1228.parse | annotations/nw/wsj/03/wsj_0364.parse |
annotations/nw/wsj/12/wsj_1228.parse | annotations/nw/wsj/05/wsj_0511.parse |
annotations/nw/wsj/12/wsj_1228.parse | annotations/nw/wsj/06/wsj_0696.parse |
annotations/nw/wsj/12/wsj_1228.parse | annotations/nw/wsj/10/wsj_1056.parse |
annotations/nw/wsj/13/wsj_1382.parse | annotations/nw/wsj/01/wsj_0190.parse |
annotations/nw/wsj/13/wsj_1382.parse | annotations/nw/wsj/03/wsj_0364.parse |
annotations/nw/wsj/13/wsj_1382.parse | annotations/nw/wsj/05/wsj_0511.parse |
annotations/nw/wsj/13/wsj_1382.parse | annotations/nw/wsj/06/wsj_0696.parse |
annotations/nw/wsj/13/wsj_1382.parse | annotations/nw/wsj/10/wsj_1056.parse |
annotations/nw/wsj/13/wsj_1382.parse | annotations/nw/wsj/12/wsj_1228.parse |
annotations/nw/wsj/15/wsj_1558.parse | annotations/nw/wsj/15/wsj_1557.parse |
annotations/nw/wsj/19/wsj_1941.parse | annotations/nw/wsj/01/wsj_0190.parse |
annotations/nw/wsj/19/wsj_1941.parse | annotations/nw/wsj/03/wsj_0364.parse |
annotations/nw/wsj/19/wsj_1941.parse | annotations/nw/wsj/05/wsj_0511.parse |
annotations/nw/wsj/19/wsj_1941.parse | annotations/nw/wsj/06/wsj_0696.parse |
annotations/nw/wsj/19/wsj_1941.parse | annotations/nw/wsj/10/wsj_1056.parse |
annotations/nw/wsj/19/wsj_1941.parse | annotations/nw/wsj/12/wsj_1228.parse |
annotations/nw/wsj/19/wsj_1941.parse | annotations/nw/wsj/13/wsj_1382.parse |
annotations/nw/wsj/21/wsj_2114.parse | annotations/nw/wsj/01/wsj_0190.parse |
annotations/nw/wsj/21/wsj_2114.parse | annotations/nw/wsj/03/wsj_0364.parse |
annotations/nw/wsj/21/wsj_2114.parse | annotations/nw/wsj/05/wsj_0511.parse |
annotations/nw/wsj/21/wsj_2114.parse | annotations/nw/wsj/06/wsj_0696.parse |
annotations/nw/wsj/21/wsj_2114.parse | annotations/nw/wsj/10/wsj_1056.parse |
annotations/nw/wsj/21/wsj_2114.parse | annotations/nw/wsj/12/wsj_1228.parse |
annotations/nw/wsj/21/wsj_2114.parse | annotations/nw/wsj/13/wsj_1382.parse |
annotations/nw/wsj/21/wsj_2114.parse | annotations/nw/wsj/19/wsj_1941.parse |
annotations/wb/sel/28/sel_2886.parse | annotations/wb/sel/28/sel_2884.parse |
annotations/wb/sel/37/sel_3763.parse | annotations/wb/sel/30/sel_3022.parse |
annotations/wb/sel/51/sel_5169.parse | annotations/wb/sel/41/sel_4156.parse |
annotations/wb/sel/63/sel_6382.parse | annotations/wb/sel/48/sel_4847.parse |
annotations/wb/sel/74/sel_7478.parse | annotations/wb/sel/60/sel_6047.parse |
annotations/wb/sel/94/sel_9487.parse | annotations/wb/sel/51/sel_5110.parse |
annotations/wb/sel/95/sel_9585.parse | annotations/wb/sel/42/sel_4283.parse |
annotations/wb/sel/95/sel_9589.parse | annotations/wb/sel/43/sel_4300.parse |
annotations/wb/sel/97/sel_9779.parse | annotations/wb/sel/77/sel_7752.parse |
List of files with more than 90% overlap:
File 1 | File 2 |
---|---|
annotations/nw/wsj/14/wsj_1486.parse | annotations/nw/wsj/06/wsj_0680.parse |
annotations/nw/wsj/15/wsj_1557.parse | annotations/nw/wsj/01/wsj_0190.parse |
annotations/nw/wsj/15/wsj_1557.parse | annotations/nw/wsj/03/wsj_0364.parse |
annotations/nw/wsj/15/wsj_1557.parse | annotations/nw/wsj/05/wsj_0511.parse |
annotations/nw/wsj/15/wsj_1557.parse | annotations/nw/wsj/06/wsj_0696.parse |
annotations/nw/wsj/15/wsj_1557.parse | annotations/nw/wsj/10/wsj_1056.parse |
annotations/nw/wsj/15/wsj_1557.parse | annotations/nw/wsj/12/wsj_1228.parse |
annotations/nw/wsj/15/wsj_1557.parse | annotations/nw/wsj/13/wsj_1382.parse |
annotations/nw/wsj/15/wsj_1558.parse | annotations/nw/wsj/01/wsj_0190.parse |
annotations/nw/wsj/15/wsj_1558.parse | annotations/nw/wsj/03/wsj_0364.parse |
annotations/nw/wsj/15/wsj_1558.parse | annotations/nw/wsj/05/wsj_0511.parse |
annotations/nw/wsj/15/wsj_1558.parse | annotations/nw/wsj/06/wsj_0696.parse |
annotations/nw/wsj/15/wsj_1558.parse | annotations/nw/wsj/10/wsj_1056.parse |
annotations/nw/wsj/15/wsj_1558.parse | annotations/nw/wsj/12/wsj_1228.parse |
annotations/nw/wsj/15/wsj_1558.parse | annotations/nw/wsj/13/wsj_1382.parse |
annotations/nw/wsj/18/wsj_1862.parse | annotations/nw/wsj/12/wsj_1259.parse |
annotations/nw/wsj/19/wsj_1941.parse | annotations/nw/wsj/15/wsj_1557.parse |
annotations/nw/wsj/19/wsj_1941.parse | annotations/nw/wsj/15/wsj_1558.parse |
annotations/nw/wsj/21/wsj_2114.parse | annotations/nw/wsj/15/wsj_1557.parse |
annotations/nw/wsj/21/wsj_2114.parse | annotations/nw/wsj/15/wsj_1558.parse |
annotations/wb/sel/38/sel_3878.parse | annotations/wb/sel/38/sel_3875.parse |
annotations/wb/sel/43/sel_4300.parse | annotations/wb/sel/42/sel_4283.parse |
annotations/wb/sel/46/sel_4653.parse | annotations/wb/sel/46/sel_4648.parse |
annotations/wb/sel/46/sel_4660.parse | annotations/wb/sel/46/sel_4648.parse |
annotations/wb/sel/46/sel_4660.parse | annotations/wb/sel/46/sel_4653.parse |
annotations/wb/sel/48/sel_4844.parse | annotations/wb/sel/48/sel_4836.parse |
annotations/wb/sel/49/sel_4975.parse | annotations/wb/sel/49/sel_4974.parse |
annotations/wb/sel/51/sel_5196.parse | annotations/wb/sel/51/sel_5185.parse |
annotations/wb/sel/53/sel_5342.parse | annotations/wb/sel/53/sel_5341.parse |
annotations/wb/sel/60/sel_6042.parse | annotations/wb/sel/60/sel_6040.parse |
annotations/wb/sel/63/sel_6399.parse | annotations/wb/sel/63/sel_6398.parse |
annotations/wb/sel/65/sel_6591.parse | annotations/wb/sel/65/sel_6585.parse |
annotations/wb/sel/67/sel_6786.parse | annotations/wb/sel/67/sel_6785.parse |
annotations/wb/sel/76/sel_7698.parse | annotations/wb/sel/76/sel_7697.parse |
annotations/wb/sel/79/sel_7943.parse | annotations/wb/sel/79/sel_7942.parse |
annotations/wb/sel/80/sel_8018.parse | annotations/wb/sel/80/sel_8011.parse |
annotations/wb/sel/82/sel_8232.parse | annotations/wb/sel/81/sel_8149.parse |
annotations/wb/sel/82/sel_8234.parse | annotations/wb/sel/81/sel_8168.parse |
annotations/wb/sel/87/sel_8703.parse | annotations/wb/sel/87/sel_8702.parse |
annotations/wb/sel/87/sel_8771.parse | annotations/wb/sel/87/sel_8756.parse |
annotations/wb/sel/87/sel_8787.parse | annotations/wb/sel/87/sel_8766.parse |
annotations/wb/sel/90/sel_9002.parse | annotations/wb/sel/89/sel_8992.parse |
annotations/wb/sel/91/sel_9189.parse | annotations/wb/sel/91/sel_9186.parse |
annotations/wb/sel/93/sel_9372.parse | annotations/wb/sel/93/sel_9369.parse |
annotations/wb/sel/95/sel_9585.parse | annotations/wb/sel/43/sel_4300.parse |
annotations/wb/sel/95/sel_9589.parse | annotations/wb/sel/42/sel_4283.parse |
annotations/wb/sel/95/sel_9589.parse | annotations/wb/sel/95/sel_9585.parse |
All the documents above form 34 clusters (see code to create cluster):
Cluster of duplicate or near-duplicate documents |
---|
annotations/nw/wsj/14/wsj_1486.parse, annotations/nw/wsj/06/wsj_0680.parse |
annotations/nw/wsj/12/wsj_1259.parse, annotations/nw/wsj/18/wsj_1862.parse |
annotations/nw/wsj/03/wsj_0364.parse, annotations/nw/wsj/12/wsj_1228.parse, annotations/nw/wsj/06/wsj_0696.parse, annotations/nw/wsj/10/wsj_1056.parse, annotations/nw/wsj/21/wsj_2114.parse, annotations/nw/wsj/05/wsj_0511.parse, annotations/nw/wsj/01/wsj_0190.parse, annotations/nw/wsj/13/wsj_1382.parse, annotations/nw/wsj/15/wsj_1557.parse, annotations/nw/wsj/19/wsj_1941.parse, annotations/nw/wsj/15/wsj_1558.parse |
annotations/wb/sel/38/sel_3875.parse, annotations/wb/sel/38/sel_3878.parse |
annotations/wb/sel/46/sel_4660.parse, annotations/wb/sel/46/sel_4653.parse, annotations/wb/sel/46/sel_4648.parse |
annotations/wb/sel/48/sel_4844.parse, annotations/wb/sel/48/sel_4836.parse |
annotations/wb/sel/49/sel_4975.parse, annotations/wb/sel/49/sel_4974.parse |
annotations/wb/sel/51/sel_5185.parse, annotations/wb/sel/51/sel_5196.parse |
annotations/wb/sel/53/sel_5341.parse, annotations/wb/sel/53/sel_5342.parse |
annotations/wb/sel/60/sel_6040.parse, annotations/wb/sel/60/sel_6042.parse |
annotations/wb/sel/63/sel_6398.parse, annotations/wb/sel/63/sel_6399.parse |
annotations/wb/sel/65/sel_6585.parse, annotations/wb/sel/65/sel_6591.parse |
annotations/wb/sel/67/sel_6785.parse, annotations/wb/sel/67/sel_6786.parse |
annotations/wb/sel/76/sel_7698.parse, annotations/wb/sel/76/sel_7697.parse |
annotations/wb/sel/79/sel_7942.parse, annotations/wb/sel/79/sel_7943.parse |
annotations/wb/sel/80/sel_8011.parse, annotations/wb/sel/80/sel_8018.parse |
annotations/wb/sel/81/sel_8149.parse, annotations/wb/sel/82/sel_8232.parse |
annotations/wb/sel/82/sel_8234.parse, annotations/wb/sel/81/sel_8168.parse |
annotations/wb/sel/87/sel_8702.parse, annotations/wb/sel/87/sel_8703.parse |
annotations/wb/sel/87/sel_8771.parse, annotations/wb/sel/87/sel_8756.parse |
annotations/wb/sel/87/sel_8787.parse, annotations/wb/sel/87/sel_8766.parse |
annotations/wb/sel/89/sel_8992.parse, annotations/wb/sel/90/sel_9002.parse |
annotations/wb/sel/91/sel_9189.parse, annotations/wb/sel/91/sel_9186.parse |
annotations/wb/sel/93/sel_9372.parse, annotations/wb/sel/93/sel_9369.parse |
annotations/wb/sel/95/sel_9589.parse, annotations/wb/sel/95/sel_9585.parse, annotations/wb/sel/43/sel_4300.parse, annotations/wb/sel/42/sel_4283.parse |
annotations/nw/wsj/15/wsj_1557.parse, annotations/nw/wsj/15/wsj_1558.parse |
annotations/wb/sel/28/sel_2886.parse, annotations/wb/sel/28/sel_2884.parse |
annotations/wb/sel/37/sel_3763.parse, annotations/wb/sel/30/sel_3022.parse |
annotations/wb/sel/51/sel_5169.parse, annotations/wb/sel/41/sel_4156.parse |
annotations/wb/sel/63/sel_6382.parse, annotations/wb/sel/48/sel_4847.parse |
annotations/wb/sel/60/sel_6047.parse, annotations/wb/sel/74/sel_7478.parse |
annotations/wb/sel/94/sel_9487.parse, annotations/wb/sel/51/sel_5110.parse |
annotations/wb/sel/42/sel_4283.parse, annotations/wb/sel/95/sel_9585.parse |
annotations/wb/sel/77/sel_7752.parse, annotations/wb/sel/97/sel_9779.parse |
See also[]
- Jeff Kaufman's "Consistency in OntoNotes"
- Counting events in OntoNotes 5.0
References[]
- ↑ Weischedel, R., Palmer, M., Marcus, M., Hovy, E., Pradhan, S., Ramshaw, L., … Houston, A. (2013). OntoNotes Release 5.0 LDC2013T19. Linguistic Data Consortium.
- ↑ Pradhan, S., Moschitti, A., Xue, N., Ng, H. T., Bjorkelund, A., Uryupina, O., … Zhong, Z. (2013). Towards Robust Linguistic Analysis Using OntoNotes. Proceedings of the Seventeenth Conference on Computational Natural Language Learning, 143–152.
- ↑ Weischedel, R., Palmer, M., Marcus, M., Hovy, E., Pradhan, S., Ramshaw, L., … Houston, A. (2013). OntoNotes Release 5.0 LDC2013T19. Linguistic Data Consortium.
- ↑ Discussion on Google Group "Koc University Artificial Intelligence Laboratory", 21 Dec 2013