IITB dataset

Download[]

See [1].

Annotation procedure[]

IITB corpus - annotation GUI — Browser-based annotation GUI. From Kulkarni et al. (2009).

"Documents for manual annotation were collected from the links within home pages of popular sites belonging to a handful of domains that included sports, entertainment, science and technology, and health (sources: http://news.google.com/ and http://www.espnstar.com/).

On an annotation GUI, candidate spots are highlighted to differentiate between pending and already annotated spots. Clicking on a spot drops down a list of possible disambiguations. Hovering on a specific Wikipedia label shows an excerpt from the definition paragraph of the corresponding entity.

Volunteers were told to be as exhaustive as possible and tag all possible segments, even if to mark them as NA. The number of distinct Wikipedia entities that were linked to was about 3,800. About 40% of the spots was labeled NA."

Quality[]

No inter-annotator agreement was available as only 8% of spots were tagged by more than one persons.

Corpus statistics.
Number of documents	107
Total number of spots	17,200
Spot per 100 tokens	30
Average ambiguity per Spot	5.3
#Spots tagged by more than one person	1390
#NA among these spots	524
#Spots with disagreement	278
#Spots with disagreement involving NA	218

Citation[]

Kulkarni, S., Singh, A., Ramakrishnan, G., & Chakrabarti, S. (2009, June). Collective annotation of Wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 457-466). ACM.

IITB dataset

Contents

Download[]

Annotation procedure[]

Quality[]

Citation[]

Fan Feed