Natural Language Understanding Wiki

TAC is a dataset for entity linking and spot filling prepared by LDC. It is a part of the TAC KBP evaluations (Simpson et al., 2010).

The KBP dataset consists of a reference knowledge base (KB) and a collection of documents that contain potential mentions of, and information about, the target entities for the KBP evalua- tion tasks.

All datasets are prepared by LDC, ensuring they are well-formed XML and can thus be parsed using a standard XML parser. In 2009, the LDC released the first version of this data set, containing 1,289,649 data files collected from various genres. The following year, 63,943 new documents from a new web collection, and 424,296 documents from the existing GALE web collection, were added, resulting in the 2010 dataset of 1,777,888 documents for use in linking to a knowledge base. The numbers are the same for 2009 and 2010, with the exception of the web collection, which was added in 2010.

The second part of this dataset is the knowledge base. LDC used the October 2008 snapshot of Wikipedia to construct a reference KB of 818,741 entities to support TAC-KBP. Each entity has the following items:

Entity ID
A unique identifier for each entity in the knowledge base.
Wikipedia page title
A canonical name for the Wikipedia page.
Wikipedia page name
The title for the Wikipedia page.
Entity type
The type assigned to the entity; PER (person), ORG (organization), GPE (geo-political), or UKN (unknown). Types were assigned by the LDC in a processing phase after assigning UKN as a default type for all entities. The assignment process depends on the type of an article’s Infobox, if any, so this mapping was determined by the type most likely associated with a given Infobox name (e.g. entity id = E0009430 Infobox School is ORG). The assigned types are not 100% accurate (e.g. entity id = E0009382 Infobox Company is UKN). A breakdown of entity types and their frequencies in the KB is presented in the table below.
Type Entities
GPE 116,498
ORG 55,813
PER 114,523
UKN 531,907


The TAC dataset was prepared for two tasks, “Entity Linking” and “Slot filling”, and this explains the big difference between the number of documents and the number of annotated mentions. This dataset is suitable for the EL task, and for single named entity disambigua- tion approaches. However it is not suitable for evaluating collective disambiguation approaches because not all named entity textual mentions are annotated in the query document.[1]

External links[]


  1. Alhelbawy, A. (2014). Collective Approaches to Named Entity Disambiguation (Doctoral dissertation, University of Sheffield).