A statistical **language model** assigns a probability to a sequence of *m* words $ P(w_1,\ldots,w_m) $ by means of a probability distribution. Having a way to estimate the relative likelihood of different phrases is useful in many natural language processing applications. Language modeling is used in speech recognition, machine translation, part-of-speech tagging, parsing, handwriting recognition, information retrieval and other applications.

TODO: combining corpus and knowledge bases: http://aclweb.org/anthology/N/N15/N15-1165.pdf

TODO: https://arxiv.org/pdf/1611.01628.pdf

## Types Edit

### Based on informational content Edit

- Cache-based LMs
^{[1]}: use a cache window to store statistical temporal information. The motivation for cache-based language models is that language is characterized by the fact that human tends to use language in a bursty way. In other words, a word that occurs in recent history has a higher chance of occurring again in the near future. - Class-based LMs
^{[2]}and topic-based LMs^{[3]}exploit the clustering of training data to improve language models. - Structured LMs
^{[4]}directly embed the syntactic structure of language into language models.

More about structured LMs:
"Many researchers attempt to go beyond the word-
based language model and augment the translation system with syntax-based language models. Charniak, Knight,
and Yamada (2003) design a CFG-based syntax language
model for translation output reranking. Shen, Xu, and
Weischedel (2008)^{[5]} propose a dependency language model
for the hierarchical phrase-based system (Chiang 2007)).
Post and Gildea (2009), Xiao, Zhu, and Zhu (2011) and
Zhang, Zhai, and Zong (2013) propose a tree substitution
grammar based syntax language model for the string-to-tree
translation model. However, these syntax-based language
models much increase the decoding time and they are very
difficult to be integrated into the phrase-based translation
systems which just generate translation outputs phrase by
phrase."

### Based on information encoding scheme Edit

- N-gram LMs
- Decision tree LMs
- Random forest LMs

- Dynamic Bayesian LMs
- Exponential LMs
- Neural network LMs
- Log-bilinear LMs
^{[6]} - Converted LMs: for faster decoding, feed-foward NNLM to back-off N-gram LM
^{[7]}

## Measure Edit

### Cross-entropy Edit

According to Shi (2014)^{[8]}:

"To measure the quality of a language model, one method is to estimate the logarithm like-lihood LP(W ) of test data with n words, which are assumed to be drawn from the true data distribution.$ \text{LP}(W ) = \frac{1}{n} \sum_i^n\log_2 (P(w_i)) $ The negative value of this quantity, i.e., −LP(W ) is the cross-entropy. In information theory [91], the cross-entropy H(p, q) of p and q measures how close a probability model q comes to the true model p of some random variable X, which is formulated as:

$ H(p, q) = - \sum_{x \in X} p(x) \log_2 q(x) $. "

### Perplexity Edit

According to Shi (2014)^{[8]}:

"The most commonly used measure for language models is perplexity. The perplexity PL of a language model is calculated as the geometric average of the inverse probability of the words on the test data:$ \text{PL} = \left( \prod_{i=1} P(w_i |h(w_i)) \right)^{-\frac{1}{t}} $, where $ h(w_i) = w_1, w_2,...,w_{i-1} $ . Perplexity is highly correlated with cross-entropy. It actually can be seen as exponential of entropy. Note that in most cases, the true model is unknown. Therefore perplexity can be viewed as an empirical estimate of the cross-entropy.

Perplexity can be the measure for both the language and models. As the measure for the language, it estimates the complexity of a language. When it is considered as the measure for models, it shows how close the model is to the “true” model represented by the test data. The lower the perplexity, the better the model is.

It is important to keep in mind that perplexity is not suitable for measuring language models using un-normalized probabilities. Also perplexity can not be used to compare language models that were constructed on different vocabularies. In these situations, other measures should be chosen."

### Word prediction accuracy Edit

According to Shi (2014)^{[8]}:

"Word prediction has applications in natural language processing, such as augmentative and alternative communication, spelling correction, word and sentence auto completion, etc. Typically word prediction provides one word or a list of words which fit the context best. This function can be realized by language models as a side product. Looking at this from the other side, word prediction accuracy provides a measure of the performance of language models. Word prediction accuracy is calculated as follows:$ WPA = \frac{C}{N} $ where C is the number of words that are correctly predicted. N is the total number of words in the testing. Similar to WER, word prediction accuracy (WPA) is also correlated with perplexity. Intuitively, perplexity can be thought of as the average number of choices a language model has to make. The smaller the number of choices, the higher the word prediction accuracy is. Usually low perplexity co-occurs with a high WPA. However, there are also counterexamples in the literature [159]. Compared with perplexity, WPA has less constraints. It can be applied to measure unnormalized language models. It can also be applied to compare language models con- structed from different vocabularies, which happens often in adaptive language models. Compared with the computation of is speech recognizer dependent, WER , WPA WPA is much easier to calculate. Where WER does not have extra dependencies, which makes it suitable to compare language models used in different speech recognizers, i.e. at different research sites.

### Word error rate Edit

According to Shi (2014)^{[8]}:

"In speech recognition, the performance of language models is also assessed by word error rate ( WER ), which is defined as$ WER = \frac{S + D + I}{N} $ where S, D and I are the number of substitutions, deletions and insertions, respectively, when the prediction hypotheses are aligned with the ground truth according to a minimum edit distance.

WER is the measure that comes from speech recognition systems. In order to calculate a WER, a complete speech recognizer is needed. WER is more expensive. Compared with the calculation of perplexity, The WER results are noisy, because speech recognition performance also depends on the quality of acoustic models. Usually low perplexity implies low word error rate. However, this is not always true

^{[9]}^{[10]}. Ultimately, the quality of language models must be measured by their effect on real applications. When comparing different language models on the same well constructed speech recognition systems, the WER is an informative metric.

## External links Edit

## References Edit

- ↑ Roland Kuhn and Renato de Mori. A cache-based natural language model for speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12 (6):570–583, 1990.
- ↑ Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. Class-based n-gram models of natural language. Computational Lin- guistics, 18:467–479, 1992.
- ↑ Daniel Gildea and Thomas Hofmann. Topic-based language models using EM. In Proceedings of EUROSPEECH, pages 2167–2170, 1999.
- ↑ Ciprian Chelba. A structured language model. In Association for Computational Linguistics, pages 498–500, 1997.
- ↑ Shen, L., Xu, J., & Weischedel, R. M. (2008, June). A New String-to-Dependency Machine Translation Algorithm with a Target Dependency Language Model. In ACL (pp. 577-585).
- ↑ Mnih, Andriy and Geoffrey Hinton (2007). “Three new graphical models for statistical language modelling”. In: Proceedings of the 24th international conference on Machine learning. ICML ’07. Corvalis, Oregon: ACM, pp. 641–648.
- ↑ Arisoy, E., Chen, S. F., Ramabhadran, B., & Sethy, A. (2014). Converting neural network language models into back-off language models for efficient decoding in automatic speech recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(1), 184-192.
- ↑
^{8.0}^{8.1}^{8.2}^{8.3}Language Models with Meta-information - ↑ Stanley Chen, Douglas Beeferman, and Ronald Rosenfeld. Evaluation Metrics for Language Models. In DARPA Broadcast News Transcription and Understanding Workshop (BNTUW), February 1998.
- ↑ Rukimini Iyer, Mari Ostendor, and Marie Meteer. Analyzing and predicting language model improvements. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding., pages 254 – 261, 1997.