Selectional preference

From Roberts and Egg (2014)^[1]: "Selectional preferences (Katz and Fodor, 1963; Wilks, 1975^[2]; Resnik, 1993^[3]) are the tendency for a word to semantically select or constrain which other words may appear in a direct syntactic relation with it." In case this selection is expressed in binary term (allowed/not-allowed), it is also called selectional restriction (Séaghdha and Korhonen, 2014)^[4]. SP can be contrasted with verb subcategorization "with subcategorization describing the syntactic arguments taken by a verb, and selectional preferences describing the semantic preferences verbs have for their arguments" (Van de Cruys et al., 2012)^[5]

Approaches

Classified based on granularity

From Zapirain et al. (2013)^[6]: "In one extreme, we would have a small set of coarse semantic classes. For instance, some authors have used the 26 so-called “semantic fields” used to classify all nouns in WordNet (Agirre, Baldwin, and Martinez 2008; Agirre et al. 2011). The classification could be more fine-grained, as defined by theWordNet hierarchy (Resnik 1993b; Agirre and Martinez 2001; McCarthy and Carroll 2003), and other lexical resources could be used as well. Other authors have used automatically induced hierarchical word classes, clustered according to occurrence information from corpora (Koo, Carreras, and Collins 2008; Ratinov and Roth 2009). On the other extreme, each word would be its own semantic class, as in the lexical model, but one could also model selectional preference using distributional similarity (Grefenstette 1992; Lin 1998; Pantel and Lin 2000; Erk 2007; Bergsma, Lin, and Goebel 2008)."

Cluster-based (using WordNet)

"Resnik (1996)^[7] relies on WordNet synsets in order to generate gener- alized noun clusters. The selectional preference strength of a specific verb v in a particular relation is calculated by computing the Kullback-Leibler divergence between the cluster distribution of the verb and the prior cluster distribution."

$S_{R(v)}=\sum _{c}p(c|v)\log {\frac {p(c|v)}{p(c)}}$

"where c stands for a noun cluster, and R stands for a given predicate-argument relation. The selectional association of a particular noun cluster is then the contribution of that cluster to the verb’s preference strength."

$A_{R(v,c)}={\frac {p(c|v)\log {\frac {p(c|v)}{p(c)}}}{S_{R(v)}}}$

"The model’s generalization relies entirely on WordNet, and there is no generalization among the verbs."

"Li and Abe (1998) use the principle of Minimum Descrip- tion Length in order to find a suitable generalization level within the lexical WordNet hierarchy. A same intuition is used by Clark and Weir (2001), but they use hypothesis testing instead to find the appro- priate level of generalization. A recent approach that makes use of WordNet (in combination with Bayesian modeling) is the one by O ́ Se ́aghdha and Korhonen (2012)."

Cluster-based (using corpus)

"Rooth et al. (1999)^[8] propose an Expectation-Maximization (EM) clustering algorithm for selectional preference acquisition based on a probabilistic latent variable model. The idea is that both predicate v and argument o are generated from a latent variable c, where the latent variables represent clusters of tight verb-argument interactions."

$p(v,o)=\sum _{c\in C}p(c,v,o)=\sum _{c\in C}p(c)p(v|c)p(o|c)$

Classified based on model structure/algorithm

TODO: From Zapirain et al. (2013): "More recent work on distributional selectional preference has explored the use of discriminative models (Bergsma, Lin, and Goebel 2008) and topical models (O Seaghdha 2010; Ritter, Mausam, and Etzioni 2010)."

Language modeling

"It is tempting to assume that with a large enough corpus, preference learning reduces to a simple language modelling task that can be solved by counting predicate-argument co-occurrences. Indeed, Keller and Lapata (2003) show that relatively good performance at plausibility estimation can be attained by submitting queries to a Web search engine."

"O Seghdha (2010) shows that the Web-based approach is reliably outperformed by more complex models trained on smaller corpora for less frequent predicate-argument combinations."

Analogical reasoning (examplar-based)

"Erk (2007)^[9] and Erk et al. (2010)^[10] describe a method that uses corpus-driven distributional similarity metrics for the induction of selectional preferences. The key idea is that a predicate-argument tuple (v,o) is felicitous if the predicate v appears in the training corpus with arguments o′ that are similar to o, i.e."

$S(v,o)=\sum _{o'\in O_{v}}{\frac {wt(v,o')}{Z(v)}}\cdot sim(o,o')$

Linear classification

Bergsma et al. (2008)^[11]: SVM

Neural network

Van de Cruys (2014)^[12]

Tensor factorization

Van de Cruys et al. (2012) use 12-mode tensors to encode selectional preferences (and verb subcategorization). Each mode contains a different grammatical relation. TODO: how did they encode SP which is essentially semantic then?

Evaluation

Pseudo-disambiguation task (Rooth et al., 1999)

From Van de Cruys (2014)^[12]: "The task provides an adequate test of the generalization capabilities of our models. For the two-way case, the task is to judge which object (o or o′) is more likely for a particular verb v, where (v, o) is a tuple attested in the corpus, and o′ is a direct object randomly drawn from the object vocab- ulary. The tuple is considered correct if the model prefers the attested tuple (v,o) over (v,o′). For the three-way case, the task is to judge which subject (s or s′) and direct object (o or o′) are more likely for a particular verb v, where (v, s, o) is the attested tuple, and s′ and o′ are a random subject and object drawn from their respective vocabularies. The tuple is considered correct if the model prefers the attested tuple (v,s,o) over the alternatives (v,s,o′), (v,s′,o), and (v,s′,o′). Tables 1 and 2 respectively show a number of examples from the two-way and three-way pseudo-disambiguation task."

Table 1: Example of two-way pseudo-disambiguation
v	o	o′
perform	play	geometry
buy	wine	renaissance
read	introduction	peanut

Table 2: Example of three-way pseudo-disambiguation
v	s	o	s′	o′
win	team	game	diversity	egg
publish	government	document	grid	priest
develop	company	software	breakfast	landlord

There doesn't seem to be a standardized dataset for this evaluation. Van de Cruys (2014) created his own dataset from UkWaC (2 billion words)^[13] and didn't publish the tuples. It should be noted that although SP should be about semantic preference, he created tuples from syntactic (dependency) relations.

Semantic role classification task

Zapirain-2013-table-11 — Results of semantic role classification task (Zapirain et al., 2013)

Zapirain et al. (2013)^[6]: "SP models will be used in isolation, according to the classification rule in Equation (11), to predict role labels for a set of (predicate, argument-head) pairs."

TODO: can I find it online?

Evaluate against human plausibility judgment

"This is a standard approach to selectional preference evaluation (Keller and Lapata, 2003; Brockmann and Lapata, 2003; O ́ Se ́aghdha, 2010) and arguably yields a better appraisal of a model’s intrinsic semantic quality than other evaluations such as pseudo-disambiguation or held-out likelihood prediction."

However, it isn't clear what does it mean by plausible. In Keller and Lapata (2003)^[14], "the concept of plausibility was not defined, but examples of plausible and implausible bigrams were given (different examples for each stimulus set)."

Keller and Lapata (2003)

"... a set of plausibility judgements collected by Keller and Lapata (2003)^[14]. This dataset comprises 180 predicate-argument combinations for each of three syntactic relations: verb-object, noun-noun modification and adjective-noun modification. The data for each relation is divided into a “seen” portion containing 90 combinations that were observed in the British National Corpus and an “unseen” portion containing 90 combinations that do not appear (though the predicates and arguments do appear separately).

Some methods have reached the upper bound of this dataset.^[15]

Wang et al. (2018)

3,062 triples of S-V-O with plausibility judgment from Mechanical Turk workers (Wang et al. 2018)^[16]

Application

Word sense disambiguation (Resnik 1993a; Agirre and Martinez 2001; McCarthy and Carroll 2003),
pronoun resolution

(Bergsma, Lin, and Goebel 2008

Metaphor processing (Shutova et al., 2013)
named-entity recognition (Ratinov and Roth 2009)

Information extraction: "improve the quality of inference and information extraction rules" (Pantel et al. 2007;
Textual inferecne: Ritter, Mausam, and Etzioni (2010; Section 4.3)^[17]

Parsing

Hindle 1990; Resnik 1993b; Pantel and Lin 2000; Agirre, Baldwin, and Martinez 2008; Koo, Carreras, and Collins 2008; Agirre et al. 2011), Zhou et al., (2011)^[18], Mirroshandel et al. (2016) ^[19]

Semantic role labeling

TODO: (Gildea and Jurafsky, 2002), (Zapirain et al., 2009)^[20] (Zapirain et al., 2010)^[21]

(Gildea and Jurafsky, 2002),

Zapirain et al. (2013)^[6] use directly a SP model in semantic role classification (i.e. the step after argument identification):

$ROLE(p,w)=\arg \max SP(p,r,w)$

They propose special treatment for prepositional arguments which is very important for this task.

References

↑ Roberts, Will, and Markus Egg. "A Comparison of Selectional Preference Models for Automatic Verb Classification." EMNLP. 2014.
↑ Yorick Wilks. 1975. An intelligent analyzer and understander of English. Communications of the ACM, 18(5):264–274.
↑ Philip Resnik. 1993. Selection and information: A class-based approach to lexical relationships. Ph.D. thesis, University of Pennsylvania.
↑ Séaghdha, Diarmuid O., and Anna Korhonen. "Probabilistic distributional semantics with latent variable models." Computational Linguistics 40.3 (2014): 587-631.
↑ Cruys, T. Van De, Rimell, L., Poibeau, T., & Korhonen, A. (2012). Multi-way Tensor Factorization for Unsupervised Lexical Acquisition. Proceedings of COLING, 2(December), 2703–2720.
↑ ^6.0 ^6.1 ^6.2 Zapirain, B., Agirre, E., Màrquez, L., & Surdeanu, M. (2013). Selectional preferences for semantic role classification. Computational Linguistics, 39(3), 631-663.
↑ Philip Resnik. 1996. Selectional constraints: An information-theoretic model and its computational realization. Cognition, 61:127–159, November.
↑ Mats Rooth, Stefan Riezler, Detlef Prescher, Glenn Car- roll, and Franz Beil. 1999. Inducing a semanti- cally annotated lexicon via em-based clustering. In Proceedings of the 37th annual meeting of the As- sociation for Computational Linguistics on Compu- tational Linguistics, pages 104–111. Association for Computational Linguistics.
↑ Erk, Katrin. "A simple, similarity-based model for selectional preferences." ANNUAL MEETING-ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. Vol. 45. No. 1. 2007.
↑ Erk, Katrin, Sebastian Padó, and Ulrike Padó. "A flexible, corpus-driven model of regular and inverse selectional preferences." Computational Linguistics 36.4 (2010): 723-763.
↑ Shane Bergsma, Dekang Lin, and Randy Goebel. 2008. Discriminative learning of selectional preference from unlabeled text. In Proceedings of the Con- ference on Empirical Methods in Natural Language Processing, pages 59–68. Association for Computa- tional Linguistics.
↑ ^12.0 ^12.1 Van de Cruys, T. (2014). A neural network approach to selectional preference acquisition. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 26-35).
↑ Tim van de Cruys's presentation at EMNLP 2014
↑ ^14.0 ^14.1 Frank Keller and Mirella Lapata. 2003. Using the Web to obtain frequencies for unseen bigrams. Computational Linguistics, 29(3):459–484.
↑ Séaghdha, D. O., & Korhonen, A. (2014). Probabilistic distributional semantics with latent variable models. Computational Linguistics, 40(3), 587-631.
↑ Wang, S., Durrett, G., & Erk, K. (2018). Modeling Semantic Plausibility by Injecting World Knowledge. Retrieved from http://arxiv.org/abs/1804.00619
↑ Ritter, A., Etzioni, M., & Etzioni, O. (2010). A Latent Dirichlet Allocation method for Selectional Preferences. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 423–434.
↑ Zhou, G., Zhao, J., Liu, K., & Cai, L. (2011). Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 1556–1565.
↑ Mirroshandel, Seyed Abolghasem, and Alexis Nasr. "Integrating selectional constraints and subcategorization frames in a dependency parser." Computational Linguistics (2016).
↑ Zapirain, Benat, Eneko Agirre, and Lluıs Ma`rquez. 2009. Generalizing over lexical features: Selectional preferences for semantic role classification. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing (ACL-IJCNLP-2009), pages 73–76, Suntec.
↑ Zapirain, B., Agirre, E., Màrquez, L., & Surdeanu, M. (2010, June). Improving semantic role classification with selectional preferences. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 373-376). Association for Computational Linguistics.

[1] Roberts, Will, and Markus Egg. "A Comparison of Selectional Preference Models for Automatic Verb Classification." EMNLP. 2014.

[2] Yorick Wilks. 1975. An intelligent analyzer and understander of English. Communications of the ACM, 18(5):264–274.

[3] Philip Resnik. 1993. Selection and information: A class-based approach to lexical relationships. Ph.D. thesis, University of Pennsylvania.

[4] Séaghdha, Diarmuid O., and Anna Korhonen. "Probabilistic distributional semantics with latent variable models." Computational Linguistics 40.3 (2014): 587-631.

[5] Cruys, T. Van De, Rimell, L., Poibeau, T., & Korhonen, A. (2012). Multi-way Tensor Factorization for Unsupervised Lexical Acquisition. Proceedings of COLING, 2(December), 2703–2720.

[:1-6] 6.0 ^6.1 ^6.2 Zapirain, B., Agirre, E., Màrquez, L., & Surdeanu, M. (2013). Selectional preferences for semantic role classification. Computational Linguistics, 39(3), 631-663.

[7] Philip Resnik. 1996. Selectional constraints: An information-theoretic model and its computational realization. Cognition, 61:127–159, November.

[8] Mats Rooth, Stefan Riezler, Detlef Prescher, Glenn Car- roll, and Franz Beil. 1999. Inducing a semanti- cally annotated lexicon via em-based clustering. In Proceedings of the 37th annual meeting of the As- sociation for Computational Linguistics on Compu- tational Linguistics, pages 104–111. Association for Computational Linguistics.

[9] Erk, Katrin. "A simple, similarity-based model for selectional preferences." ANNUAL MEETING-ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. Vol. 45. No. 1. 2007.

[10] Erk, Katrin, Sebastian Padó, and Ulrike Padó. "A flexible, corpus-driven model of regular and inverse selectional preferences." Computational Linguistics 36.4 (2010): 723-763.

[11] Shane Bergsma, Dekang Lin, and Randy Goebel. 2008. Discriminative learning of selectional preference from unlabeled text. In Proceedings of the Con- ference on Empirical Methods in Natural Language Processing, pages 59–68. Association for Computa- tional Linguistics.

[:0-12] 12.0 ^12.1 Van de Cruys, T. (2014). A neural network approach to selectional preference acquisition. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 26-35).

[13] Tim van de Cruys's presentation at EMNLP 2014

[keller.lapata.2003-14] 14.0 ^14.1 Frank Keller and Mirella Lapata. 2003. Using the Web to obtain frequencies for unseen bigrams. Computational Linguistics, 29(3):459–484.

[15] Séaghdha, D. O., & Korhonen, A. (2014). Probabilistic distributional semantics with latent variable models. Computational Linguistics, 40(3), 587-631.

[16] Wang, S., Durrett, G., & Erk, K. (2018). Modeling Semantic Plausibility by Injecting World Knowledge. Retrieved from http://arxiv.org/abs/1804.00619

[17] Ritter, A., Etzioni, M., & Etzioni, O. (2010). A Latent Dirichlet Allocation method for Selectional Preferences. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 423–434.

[18] Zhou, G., Zhao, J., Liu, K., & Cai, L. (2011). Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 1556–1565.

[19] Mirroshandel, Seyed Abolghasem, and Alexis Nasr. "Integrating selectional constraints and subcategorization frames in a dependency parser." Computational Linguistics (2016).

[20] Zapirain, Benat, Eneko Agirre, and Lluıs Ma`rquez. 2009. Generalizing over lexical features: Selectional preferences for semantic role classification. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing (ACL-IJCNLP-2009), pages 73–76, Suntec.

[21] Zapirain, B., Agirre, E., Màrquez, L., & Surdeanu, M. (2010, June). Improving semantic role classification with selectional preferences. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 373-376). Association for Computational Linguistics.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]