Natural Language Understanding Wiki

A statistical language model assigns a probability to a sequence of m words by means of a probability distribution. Having a way to estimate the relative likelihood of different phrases is useful in many natural language processing applications. Language modeling is used in speech recognition, machine translation, part-of-speech tagging, parsing, handwriting recognition, information retrieval and other applications.

TODO: combining corpus and knowledge bases:



Based on informational content[]

  • Cache-based LMs[1]: use a cache window to store statistical temporal information. The motivation for cache-based language models is that language is characterized by the fact that human tends to use language in a bursty way. In other words, a word that occurs in recent history has a higher chance of occurring again in the near future.
  • Class-based LMs[2] and topic-based LMs[3] exploit the clustering of training data to improve language models.
  • Structured LMs[4] directly embed the syntactic structure of language into language models.

More about structured LMs: "Many researchers attempt to go beyond the word- based language model and augment the translation system with syntax-based language models. Charniak, Knight, and Yamada (2003) design a CFG-based syntax language model for translation output reranking. Shen, Xu, and Weischedel (2008)[5] propose a dependency language model for the hierarchical phrase-based system (Chiang 2007)). Post and Gildea (2009), Xiao, Zhu, and Zhu (2011) and Zhang, Zhai, and Zong (2013) propose a tree substitution grammar based syntax language model for the string-to-tree translation model. However, these syntax-based language models much increase the decoding time and they are very difficult to be integrated into the phrase-based translation systems which just generate translation outputs phrase by phrase."

Based on information encoding scheme[]



According to Shi (2014)[8]:

"To measure the quality of a language model, one method is to estimate the logarithm like-lihood LP(W ) of test data with n words, which are assumed to be drawn from the true data distribution.

The negative value of this quantity, i.e., −LP(W ) is the cross-entropy. In information theory [91], the cross-entropy H(p, q) of p and q measures how close a probability model q comes to the true model p of some random variable X, which is formulated as:




According to Shi (2014)[8]:

"The most commonly used measure for language models is perplexity. The perplexity PL

of a language model is calculated as the geometric average of the inverse probability of the words on the test data:


where . Perplexity is highly correlated with cross-entropy. It actually can be seen as exponential of entropy. Note that in most cases, the true model is unknown. Therefore perplexity can be viewed as an empirical estimate of the cross-entropy.

Perplexity can be the measure for both the language and models. As the measure for the language, it estimates the complexity of a language. When it is considered as the measure for models, it shows how close the model is to the “true” model represented by the test data. The lower the perplexity, the better the model is.

It is important to keep in mind that perplexity is not suitable for measuring language models using un-normalized probabilities. Also perplexity can not be used to compare language models that were constructed on different vocabularies. In these situations, other

measures should be chosen."

Word prediction accuracy[]

According to Shi (2014)[8]:

"Word prediction has applications in natural language processing, such as augmentative and

alternative communication, spelling correction, word and sentence auto completion, etc. Typically word prediction provides one word or a list of words which fit the context best. This function can be realized by language models as a side product. Looking at this from the other side, word prediction accuracy provides a measure of the performance of language models. Word prediction accuracy is calculated as follows:

where C is the number of words that are correctly predicted. N is the total number of words in the testing. Similar to WER, word prediction accuracy (WPA) is also correlated with perplexity. Intuitively, perplexity can be thought of as the average number of choices a language model has to make. The smaller the number of choices, the higher the word prediction accuracy is. Usually low perplexity co-occurs with a high WPA. However, there are also counterexamples in the literature [159]. Compared with perplexity, WPA has less constraints. It can be applied to measure unnormalized language models. It can also be applied to compare language models con- structed from different vocabularies, which happens often in adaptive language models. Compared with the computation of is speech recognizer dependent, WER , WPA WPA is much easier to calculate. Where WER does not have extra dependencies, which makes it suitable to compare language models used in different speech recognizers, i.e. at different

research sites.

Word error rate[]

According to Shi (2014)[8]:

"In speech recognition, the performance of language models is also assessed by word error

rate ( WER ), which is defined as

where S, D and I are the number of substitutions, deletions and insertions, respectively, when the prediction hypotheses are aligned with the ground truth according to a minimum edit distance.

WER is the measure that comes from speech recognition systems. In order to calculate a WER, a complete speech recognizer is needed. WER is more expensive. Compared with the calculation of perplexity, The WER results are noisy, because speech recognition performance also depends on the quality of acoustic models. Usually low perplexity implies low word error rate. However, this is not always true[9][10]. Ultimately, the quality of language models must be measured by their effect on real applications. When comparing different language models on the same well constructed speech recognition systems, the

WER is an informative metric.

External links[]

Wikipedia has an article on: Language model


  1. Roland Kuhn and Renato de Mori. A cache-based natural language model for speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12 (6):570–583, 1990.
  2. Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. Class-based n-gram models of natural language. Computational Lin- guistics, 18:467–479, 1992.
  3. Daniel Gildea and Thomas Hofmann. Topic-based language models using EM. In Proceedings of EUROSPEECH, pages 2167–2170, 1999.
  4. Ciprian Chelba. A structured language model. In Association for Computational Linguistics, pages 498–500, 1997.
  5. Shen, L., Xu, J., & Weischedel, R. M. (2008, June). A New String-to-Dependency Machine Translation Algorithm with a Target Dependency Language Model. In ACL (pp. 577-585).
  6. Mnih, Andriy and Geoffrey Hinton (2007). “Three new graphical models for statistical language modelling”. In: Proceedings of the 24th international conference on Machine learning. ICML ’07. Corvalis, Oregon: ACM, pp. 641–648.
  7. Arisoy, E., Chen, S. F., Ramabhadran, B., & Sethy, A. (2014). Converting neural network language models into back-off language models for efficient decoding in automatic speech recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(1), 184-192.
  8. 8.0 8.1 8.2 8.3 Language Models with Meta-information
  9. Stanley Chen, Douglas Beeferman, and Ronald Rosenfeld. Evaluation Metrics for Language Models. In DARPA Broadcast News Transcription and Understanding Workshop (BNTUW), February 1998.
  10. Rukimini Iyer, Mari Ostendor, and Marie Meteer. Analyzing and predicting language model improvements. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding., pages 254 – 261, 1997.