Changes: DBpedia Spotlight 2011

Revision as of 08:41, 10 October 2014

Algorithm

Spotting

They use the extended set of labels in the lexicalization dataset to create a lexicon for spotting. LingPipe Exact Dictionary-Based Chunker^[1] which relies on the Aho-Corasick string matching algorithm^[2] with longest case-insensitive match.

Ignoring common words

A configuration flag can instruct the system to disregard in this stage any spots that are only composed of verbs, adjectives, adverbs and prepositions. The part of speech tagger was LingPipe implementation based on Hidden Markov Models

Candidate selection

To narrow down the space of disambiguation possibilities.

Modeling DBpedia resources

Aggregate all paragraphs mentioning each concept in Wikipedia
Compute Term Frequency (TF) representing the relevance of a word for a given resource
Compute Inverse Candidate Frequency (ICF) weight. The intuition behind ICF is that the discriminative power of a word is inversely proportional to the number of DBpedia resources it is associated with.
$ICF(w_j) = \log \frac{|R_s|}{n(w_j)} = \log |R_s| - \log n(w_j)$ ,

where $R_s$ is the set of candidate resources for a surface form s and $n(w_j)$ be the total number of resources in $R_s$ that are associated with the word $w_j$ .
Create a Vector Space Model (VSM) with TF*ICF weights.

Disambiguation

Rank candidate resources according to the similarity score between their context vectors and the context surrounding the surface form. Cosine was used as the similarity measure.

References

↑ Alias-i. LingPipe 4.0.0. [http://alias-i.com/lingpipe retrieved on 24.08.2010], 2008.
↑ A. V. Aho and M. J. Corasick. Efficient string matching: an aid to bibliographic search. Commun. ACM, 18:333{340, June 1975.

[1] Alias-i. LingPipe 4.0.0. [http://alias-i.com/lingpipe retrieved on 24.08.2010], 2008.

[2] A. V. Aho and M. J. Corasick. Efficient string matching: an aid to bibliographic search. Commun. ACM, 18:333{340, June 1975.

[1]

[2]

@@ Line 2: / Line 2: @@
 === Spotting ===
-They  use  the  extended  set  of  labels  in  [http://wiki.dbpedia.org/Lexicalizations the  lexicalization dataset] to create a lexicon for spotting. LingPipe Exact Dictionary-Based Chunker<ref>Alias-i. LingPipe 4.0.0. <nowiki>[http://alias-i.com/lingpipe retrieved on 24.08.2010]</nowiki>, 2008.</ref> which relies on the Aho-Corasick string matching algorithm<ref>A. V. Aho and M. J. Corasick. E�cient string matching:  an aid to bibliographic search. Commun. ACM, 18:333{340, June 1975.</ref> with longest case-insensitive match.
+They use the extended set of labels in [http://wiki.dbpedia.org/Lexicalizations the lexicalization dataset] to create a lexicon for spotting. LingPipe Exact Dictionary-Based Chunker<ref>Alias-i. LingPipe 4.0.0. <nowiki>[http://alias-i.com/lingpipe retrieved on 24.08.2010]</nowiki>, 2008.</ref> which relies on the Aho-Corasick string matching algorithm<ref>A. V. Aho and M. J. Corasick. Efficient string matching:  an aid to bibliographic search. Commun. ACM, 18:333{340, June 1975.</ref> with longest case-insensitive match.
 ==== Ignoring common words ====
@@ Line 8: / Line 8: @@
 === Candidate selection ===
-To  narrow  down the  space  of  disambiguation  possibilities.
+To narrow down the space of disambiguation possibilities.
 === Modeling DBpedia resources ===
-# Aggregate all paragraphs mentioning each concept  in  Wikipedia
+# Aggregate all paragraphs mentioning each concept in Wikipedia
 # Compute Term Frequency (TF) representing the relevance of a word for a given resource
 # Compute Inverse Candidate Frequency (ICF) weight. The intuition behind ICF is that the discriminative power of a word is inversely proportional to the number of DBpedia resources it is associated with.
@@ Line 19: / Line 19: @@
 === Disambiguation ===
-Rank  candidate resources according to the similarity score between their context vectors and the context surrounding the surface form. Cosine was used as the similarity measure.
+Rank candidate resources according to the similarity score between their context vectors and the context surrounding the surface form. Cosine was used as the similarity measure.
 == References ==
 <references />