303 Pages

## Regularization Edit

"The maximum-entropy (ME) principle, which prescribes choosing the model that maximizes the entropy out of all models that satisfy given feature constraints, can be seen as a built-in regularization mechanism that avoids overfitting the training data."[1]

"For ME models, the use of an l2 regularizer, corresponding to imposing a Gaussian prior on the parameter val- ues, has been proposed by Johnson et al. (1999) and Chen and Rosenfeld (1999). Feature selection for ME models has commonly used simple frequency- based cut-off, or likelihood-based feature induction as introduced by Della Pietra et al. (1997)."[1]

"Tibshirani (1996) proposed a technique based on l1 regularization that embeds feature selection into regularization such that both a precise assessment of the reliability of features and the decision about in- clusion or deletion of features can be done in the same framework."[1]

"a combined incremental feature selection and regularization method can be established for maximum entropy modeling by a natural incorporation of the regularizer into gradient-based feature selection, following Perkins et al. (2003)."[1]

## Optimization Edit

A survey: Malouf (2002)[2] comparing "Generalized Iterative Scaling and Improved Iterative Scaling, as well as general purposed optimization techniques such as gradient ascent, conjugate gradient, and variable metric methods".

### Big "vocabulary" cases Edit

Johnson et al. (1999)[3] dealt with probability of syntactic structures. Finding normalization term is difficult since it entails summing over the infinite set of possible syntactic structures. Some solutions are proposed: Monte-Carlo estimation (Abney, 1997[4]) which is not efficient, pseudo-likelihodd i.e. estimating the normalization term (Johnson et al., 1999). I call this problem "big vocabulary" because of its resemblance to the problem in language modeling.

## Applications Edit

• Language model: Chen et al. (1999)[5]
• Sentence boundary detection: Ratnaparkhi (1998)[6]
• POS tagging: Ratnaparkhi (1998)[6]
• Syntax parsing: Ratnaparkhi (1998)[6]
• Unsupervised prepositional phrase attachment: Ratnaparkhi (1998)[6]
• Recommender system: Jin et al. (2005)[7]
• Named-entity recognition: Borthwick (1999)[8], Curran and Clark (2003)[9]
• Semantic role labeling:
• PropBank-style: Che et al. (2009)[10] ("During the SRC stage, a Maximum entropy (Berger et al., 1996) classifier is used to predict the probabilities of a word in the sentence")
• NomBank: Jiang and Ng (2006)[11]
• FrameNet-style: Fleischman et al. (2003)[12], Das et al. (2010)[13]
• Coreference resolution: Culotta et al. (2006)[14]

## References Edit

1. 1.0 1.1 1.2 1.3 Stefan Riezler, and Alexander Vasserman. "Incremental Feature Selection and l1 Regularization for Relaxed Maximum-Entropy Modeling." EMNLP. 2004.
2. Malouf, Robert. "A comparison of algorithms for maximum entropy parameter estimation." proceedings of the 6th conference on Natural language learning-Volume 20. Association for Computational Linguistics, 2002.
3. Johnson, Mark, et al. "Estimators for stochastic unification-based grammars." Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, 1999.
4. Steven P. Abney. 1997. Stochastic Attribute- Value Grammars. Computational Linguis- tics, 23(4):597–617.
5. Chen, Stanley F., and Ronald Rosenfeld. "Efficient sampling and feature selection in whole sentence maximum entropy language models." Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on. Vol. 1. IEEE, 1999.
6. 6.0 6.1 6.2 6.3 Adwait Ratnaparkhi. Maximum entropy models for natural language ambiguity resolution. Diss. University of Pennsylvania, 1998.
7. Jin, Xin, Yanzan Zhou, and Bamshad Mobasher. "A maximum entropy web recommendation system: combining collaborative and content features." Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, 2005.
8. Borthwick, Andrew. A maximum entropy approach to named entity recognition. Diss. New York University, 1999.
9. Curran, James R., and Stephen Clark. "Language independent NER using a maximum entropy tagger." Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4. Association for Computational Linguistics, 2003.
10. Che, W., Li, Z., Li, Y., Guo, Y., Qin, B., & Liu, T. (2009, June). Multilingual dependency-based syntactic and semantic parsing. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task (pp. 49-54). Association for Computational Linguistics.
11. Jiang, Zheng Ping, and Hwee Tou Ng. "Semantic role labeling of NomBank: A maximum entropy approach." Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2006.
12. M. Fleischman, N.Kwon, and E. Hovy. 2003. Maximum entropy models for FrameNet classification. In Proc. of EMNLP.
13. Das, D., Schneider, N., Chen, D., & Smith, N. A. N. (2010). Probabilistic frame-semantic parsing. HLT ’10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 3(June), 948–956. Retrieved from http://dl.acm.org/citation.cfm?id=1858136\nhttp://dl.acm.org/citation.cfm?id=1857999.1858136
14. Culotta, Aron, et al. "First-order probabilistic models for coreference resolution." (2006).