"Documents for manual annotation were collected from the links within home pages of popular sites belonging to a handful of domains that included sports, entertainment, science and technology, and health (sources: http://news.google.com/ and http://www.espnstar.com/).
On an annotation GUI, candidate spots are highlighted to differentiate between pending and already annotated spots. Clicking on a spot drops down a list of possible disambiguations. Hovering on a specific Wikipedia label shows an excerpt from the definition paragraph of the corresponding entity.
Volunteers were told to be as exhaustive as possible and tag all possible segments, even if to mark them as NA. The number of distinct Wikipedia entities that were linked to was about 3,800. About 40% of the spots was labeled NA."
No inter-annotator agreement was available as only 8% of spots were tagged by more than one persons.
|Number of documents||107|
|Total number of spots||17,200|
|Spot per 100 tokens||30|
|Average ambiguity per Spot||5.3|
|#Spots tagged by more than one person||1390|
|#NA among these spots||524|
|#Spots with disagreement||278|
|#Spots with disagreement involving NA||218|
Kulkarni, S., Singh, A., Ramakrishnan, G., & Chakrabarti, S. (2009, June). Collective annotation of Wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 457-466). ACM.