An exponential language model or maximum entropy language model use the following formula to express the conditional probability of word $ w_i $ given context $ h_i $:

$ P(w_i|h_i) = \frac{1}{Z(h_i)} \exp\left(\sum_j \lambda_j f_j (h_i, w_i )\right) $,

where $ \lambda_j $ are the parameters, $ f_j (h_i , w_i ) $ are arbitrary functions of the pair $ (h_i , w_i ) $ and $ Z(h i ) $ is a normalization factor:

$ Z(h ) = \sum_{w \in V}\exp\left( \sum_j \lambda_j f_j (h, w)\right). $

The parameters are learned from the training data based on the Maximum Entropy principle. It was first introduced into language modeling by Pietra et al. (1992)[1]. Later, it was systematically investigated by Rosenfeld (1996)[2].

Most neural network LMs use softmax output layer therefore can be considered exponential LMs albeit with sophisticated feature templates.

References Edit

  1. Stephen A. Della Pietra, Vincent J. Della Pietra, Robert L. Mercer, and Salim Roukos. Adaptive language modeling using minimum discriminant estimation. In Proceedings of the workshop on Speech and Natural Language, pages 103–106, 1992.
  2. Ronald Rosenfeld. A maximum entropy approach to adaptive statistical language modeling. Computer, Speech and Language, 10(3):187–228, 1996.
Community content is available under CC-BY-SA unless otherwise noted.