Natural Language Understanding Wiki
(Adding categories)
(fix typo)
Line 2: Line 2:
   
 
=== Spotting ===
 
=== Spotting ===
They use the extended set of labels in [http://wiki.dbpedia.org/Lexicalizations the lexicalization dataset] to create a lexicon for spotting. LingPipe Exact Dictionary-Based Chunker<ref>Alias-i. LingPipe 4.0.0. <nowiki>[http://alias-i.com/lingpipe retrieved on 24.08.2010]</nowiki>, 2008.</ref> which relies on the Aho-Corasick string matching algorithm<ref>A. V. Aho and M. J. Corasick. E�cient string matching: an aid to bibliographic search. Commun. ACM, 18:333{340, June 1975.</ref> with longest case-insensitive match.
+
They use the extended set of labels in [http://wiki.dbpedia.org/Lexicalizations the lexicalization dataset] to create a lexicon for spotting. LingPipe Exact Dictionary-Based Chunker<ref>Alias-i. LingPipe 4.0.0. <nowiki>[http://alias-i.com/lingpipe retrieved on 24.08.2010]</nowiki>, 2008.</ref> which relies on the Aho-Corasick string matching algorithm<ref>A. V. Aho and M. J. Corasick. Efficient string matching: an aid to bibliographic search. Commun. ACM, 18:333{340, June 1975.</ref> with longest case-insensitive match.
   
 
==== Ignoring common words ====
 
==== Ignoring common words ====
Line 8: Line 8:
   
 
=== Candidate selection ===
 
=== Candidate selection ===
To narrow down the space of disambiguation possibilities.
+
To narrow down the space of disambiguation possibilities.
   
 
=== Modeling DBpedia resources ===
 
=== Modeling DBpedia resources ===
# Aggregate all paragraphs mentioning each concept in Wikipedia
+
# Aggregate all paragraphs mentioning each concept in Wikipedia
 
# Compute Term Frequency (TF) representing the relevance of a word for a given resource
 
# Compute Term Frequency (TF) representing the relevance of a word for a given resource
 
# Compute Inverse Candidate Frequency (ICF) weight. The intuition behind ICF is that the discriminative power of a word is inversely proportional to the number of DBpedia resources it is associated with.
 
# Compute Inverse Candidate Frequency (ICF) weight. The intuition behind ICF is that the discriminative power of a word is inversely proportional to the number of DBpedia resources it is associated with.
Line 19: Line 19:
   
 
=== Disambiguation ===
 
=== Disambiguation ===
Rank candidate resources according to the similarity score between their context vectors and the context surrounding the surface form. Cosine was used as the similarity measure.
+
Rank candidate resources according to the similarity score between their context vectors and the context surrounding the surface form. Cosine was used as the similarity measure.
 
== References ==
 
== References ==
 
<references />
 
<references />

Revision as of 08:41, 10 October 2014

Algorithm

Spotting

They use the extended set of labels in the lexicalization dataset to create a lexicon for spotting. LingPipe Exact Dictionary-Based Chunker[1] which relies on the Aho-Corasick string matching algorithm[2] with longest case-insensitive match.

Ignoring common words

A configuration flag can instruct the system to disregard in this stage any spots that are only composed of verbs, adjectives, adverbs and prepositions. The part of speech tagger was LingPipe implementation based on Hidden Markov Models

Candidate selection

To narrow down the space of disambiguation possibilities.

Modeling DBpedia resources

  1. Aggregate all paragraphs mentioning each concept in Wikipedia
  2. Compute Term Frequency (TF) representing the relevance of a word for a given resource
  3. Compute Inverse Candidate Frequency (ICF) weight. The intuition behind ICF is that the discriminative power of a word is inversely proportional to the number of DBpedia resources it is associated with.
    ,
    where is the set of candidate resources for a surface form s and be the total number of resources in that are associated with the word .
  4. Create a Vector Space Model (VSM) with TF*ICF weights.

Disambiguation

Rank candidate resources according to the similarity score between their context vectors and the context surrounding the surface form. Cosine was used as the similarity measure.

References

  1. Alias-i. LingPipe 4.0.0. [http://alias-i.com/lingpipe retrieved on 24.08.2010], 2008.
  2. A. V. Aho and M. J. Corasick. Efficient string matching: an aid to bibliographic search. Commun. ACM, 18:333{340, June 1975.