## FANDOM

313 Pages

A statistical language model assigns a probability to a sequence of m words $P(w_1,\ldots,w_m)$ by means of a probability distribution. Having a way to estimate the relative likelihood of different phrases is useful in many natural language processing applications. Language modeling is used in speech recognition, machine translation, part-of-speech tagging, parsing, handwriting recognition, information retrieval and other applications.

TODO: combining corpus and knowledge bases: http://aclweb.org/anthology/N/N15/N15-1165.pdf

## Types Edit

### Based on informational content Edit

• Cache-based LMs[1]: use a cache window to store statistical temporal information. The motivation for cache-based language models is that language is characterized by the fact that human tends to use language in a bursty way. In other words, a word that occurs in recent history has a higher chance of occurring again in the near future.
• Class-based LMs[2] and topic-based LMs[3] exploit the clustering of training data to improve language models.
• Structured LMs[4] directly embed the syntactic structure of language into language models.

More about structured LMs: "Many researchers attempt to go beyond the word- based language model and augment the translation system with syntax-based language models. Charniak, Knight, and Yamada (2003) design a CFG-based syntax language model for translation output reranking. Shen, Xu, and Weischedel (2008)[5] propose a dependency language model for the hierarchical phrase-based system (Chiang 2007)). Post and Gildea (2009), Xiao, Zhu, and Zhu (2011) and Zhang, Zhai, and Zong (2013) propose a tree substitution grammar based syntax language model for the string-to-tree translation model. However, these syntax-based language models much increase the decoding time and they are very difficult to be integrated into the phrase-based translation systems which just generate translation outputs phrase by phrase."

## Measure Edit

### Cross-entropy Edit

According to Shi (2014)[8]:

"To measure the quality of a language model, one method is to estimate the logarithm like-lihood LP(W ) of test data with n words, which are assumed to be drawn from the true data distribution.

$\text{LP}(W ) = \frac{1}{n} \sum_i^n\log_2 (P(w_i))$

The negative value of this quantity, i.e., −LP(W ) is the cross-entropy. In information theory [91], the cross-entropy H(p, q) of p and q measures how close a probability model q comes to the true model p of some random variable X, which is formulated as:

$H(p, q) = - \sum_{x \in X} p(x) \log_2 q(x)$.
"

### Perplexity Edit

According to Shi (2014)[8]:

"The most commonly used measure for language models is perplexity. The perplexity PL of a language model is calculated as the geometric average of the inverse probability of the words on the test data:

$\text{PL} = \left( \prod_{i=1} P(w_i |h(w_i)) \right)^{-\frac{1}{t}}$,

where $h(w_i) = w_1, w_2,...,w_{i-1}$ . Perplexity is highly correlated with cross-entropy. It actually can be seen as exponential of entropy. Note that in most cases, the true model is unknown. Therefore perplexity can be viewed as an empirical estimate of the cross-entropy.

Perplexity can be the measure for both the language and models. As the measure for the language, it estimates the complexity of a language. When it is considered as the measure for models, it shows how close the model is to the “true” model represented by the test data. The lower the perplexity, the better the model is.

It is important to keep in mind that perplexity is not suitable for measuring language models using un-normalized probabilities. Also perplexity can not be used to compare language models that were constructed on different vocabularies. In these situations, other measures should be chosen."

### Word prediction accuracy Edit

According to Shi (2014)[8]:

"Word prediction has applications in natural language processing, such as augmentative and alternative communication, spelling correction, word and sentence auto completion, etc. Typically word prediction provides one word or a list of words which fit the context best. This function can be realized by language models as a side product. Looking at this from the other side, word prediction accuracy provides a measure of the performance of language models. Word prediction accuracy is calculated as follows:

$WPA = \frac{C}{N}$

where C is the number of words that are correctly predicted. N is the total number of words in the testing. Similar to WER, word prediction accuracy (WPA) is also correlated with perplexity. Intuitively, perplexity can be thought of as the average number of choices a language model has to make. The smaller the number of choices, the higher the word prediction accuracy is. Usually low perplexity co-occurs with a high WPA. However, there are also counterexamples in the literature [159]. Compared with perplexity, WPA has less constraints. It can be applied to measure unnormalized language models. It can also be applied to compare language models con- structed from different vocabularies, which happens often in adaptive language models. Compared with the computation of is speech recognizer dependent, WER , WPA WPA is much easier to calculate. Where WER does not have extra dependencies, which makes it suitable to compare language models used in different speech recognizers, i.e. at different research sites.

### Word error rate Edit

According to Shi (2014)[8]:

"In speech recognition, the performance of language models is also assessed by word error rate ( WER ), which is defined as

$WER = \frac{S + D + I}{N}$

where S, D and I are the number of substitutions, deletions and insertions, respectively, when the prediction hypotheses are aligned with the ground truth according to a minimum edit distance.

WER is the measure that comes from speech recognition systems. In order to calculate a WER, a complete speech recognizer is needed. WER is more expensive. Compared with the calculation of perplexity, The WER results are noisy, because speech recognition performance also depends on the quality of acoustic models. Usually low perplexity implies low word error rate. However, this is not always true[9][10]. Ultimately, the quality of language models must be measured by their effect on real applications. When comparing different language models on the same well constructed speech recognition systems, the WER is an informative metric.

## Edit

Community content is available under CC-BY-SA unless otherwise noted.