Natural Language Understanding Wiki
Advertisement

Download[]

See [1].

Annotation procedure[]

IITB corpus - annotation GUI

Browser-based annotation GUI. From Kulkarni et al. (2009).

"Documents for manual annotation were collected from the links within home pages of popular sites belonging to a handful of domains that included sports, entertainment, science and technology, and health (sources: http://news.google.com/ and http://www.espnstar.com/).

On an annotation GUI, candidate spots are highlighted to differentiate between pending and already annotated spots. Clicking on a spot drops down a list of possible disambiguations. Hovering on a specific Wikipedia label shows an excerpt from the definition paragraph of the corresponding entity.

Volunteers were told to be as exhaustive as possible and tag all possible segments, even if to mark them as NA. The number of distinct Wikipedia entities that were linked to was about 3,800. About 40% of the spots was labeled NA."

Quality[]

No inter-annotator agreement was available as only 8% of spots were tagged by more than one persons.

Corpus statistics.
Number of documents 107
Total number of spots 17,200
Spot per 100 tokens 30
Average ambiguity per Spot 5.3
#Spots tagged by more than one person 1390
#NA among these spots 524
#Spots with disagreement 278
#Spots with disagreement involving NA 218

Citation[]

Kulkarni, S., Singh, A., Ramakrishnan, G., & Chakrabarti, S. (2009, June). Collective annotation of Wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 457-466). ACM.

Advertisement