Natural Language Understanding Wiki
Register
Advertisement

A statistical language model assigns a probability to a sequence of m words by means of a probability distribution. Having a way to estimate the relative likelihood of different phrases is useful in many natural language processing applications. Language modeling is used in speech recognition, machine translation, part-of-speech tagging, parsing, handwriting recognition, information retrieval and other applications.

Types

Based on information content

  • Cache-based LMs[1]: use a cache window to store statistical temporal information. The motivation for cache-based language models is that language is characterized by the fact that human tends to use language in a bursty way. In other words, a word that occurs in recent history has a higher chance of occurring again in the near future.
  • Class-based LMs[2] and topic-based LMs[3] exploit the clustering of training data to improve language models.
  • Structured LMs[4] directly embed the syntactic structure of language into language models.

Based on information encoding scheme

  • N-gram LMs
  • Decision tree LMs
    • Random forest LMs
  • Dynamic Bayesian LMs
  • Exponential LMs
  • Neural network LMs
  • Converted LMs: for faster decoding, feed-foward NNLM to back-off N-gram LM[5]

Measure

Cross-entropy

To measure the quality of a language model, one method is to estimate the logarithm like-lihood LP(W ) of test data with n words, which are assumed to be drawn from the true data distribution.

The negative value of this quantity, i.e., −LP(W ) is the cross-entropy. In information theory [91], the cross-entropy H(p, q) of p and q measures how close a probability model q comes to the true model p of some random variable X, which is formulated as:

.

Perplexity

The most commonly used measure for language models is perplexity. The perplexity PL of a language model is calculated as the geometric average of the inverse probability of the words on the test data:

,

where . Perplexity is highly correlated with cross-entropy. It actually can be seen as exponential of entropy. Note that in most cases, the true model is unknown. Therefore perplexity can be viewed as an empirical estimate of the cross-entropy.

Perplexity can be the measure for both the language and models. As the measure for the language, it estimates the complexity of a language [23]. When it is considered as the measure for models, it shows how close the model is to the “true” model represented by the test data. The lower the perplexity, the better the model is.

It is important to keep in mind that perplexity is not suitable for measuring language models using un-normalized probabilities. Also perplexity can not be used to compare language models that were constructed on different vocabularies. In these situations, other measures should be chosen.

Word prediction accuracy

Word prediction has applications in natural language processing, such as augmentative and alternative communication [175], spelling correction [34], word and sentence auto com- pletion, etc. Typically word prediction provides one word or a list of words which fit the context best. This function can be realized by language models as a side product. Looking at this from the other side, word prediction accuracy provides a measure of the performance of language models [159]. Word prediction accuracy is calculated as follows:

where C is the number of words that are correctly predicted. N is the total number of words in the testing. Similar to WER, word prediction accuracy (WPA) is also correlated with perplexity. Intuitively, perplexity can be thought of as the average number of choices a language model has to make. The smaller the number of choices, the higher the word prediction accuracy is. Usually low perplexity co-occurs with a high WPA. However, there are also counterexamples in the literature [159]. Compared with perplexity, WPA has less constraints. It can be applied to measure unnormalized language models. It can also be applied to compare language models con- structed from different vocabularies, which happens often in adaptive language models. Compared with the computation of is speech recognizer dependent, WER , WPA WPA is much easier to calculate. Where WER does not have extra dependencies, which makes it suitable to compare language models used in different speech recognizers, i.e. at different research sites.

Word error rate

In speech recognition, the performance of language models is also assessed by word error rate ( WER ), which is defined as

where S, D and I are the number of substitutions, deletions and insertions, respectively, when the prediction hypotheses are aligned with the ground truth according to a minimum edit distance.

WER is the measure that comes from speech recognition systems. In order to calculate a WER, a complete speech recognizer is needed. WER is more expensive. Compared with the calculation of perplexity, The WER results are noisy, because speech recognition performance also depends on the quality of acoustic models. Usually low perplexity implies low word error rate. However, this is not always true [29, 64]. Ultimately, the quality of language models must be measured by their effect on real applications. When comparing different language models on the same well constructed speech recognition systems, the WER is an informative metric.

External links

Wikipedia-logo
Wikipedia has an article on: Language model

References

  1. Roland Kuhn and Renato de Mori. A cache-based natural language model for speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12 (6):570–583, 1990.
  2. Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. Class-based n-gram models of natural language. Computational Lin- guistics, 18:467–479, 1992.
  3. Daniel Gildea and Thomas Hofmann. Topic-based language models using EM. In Proceedings of EUROSPEECH, pages 2167–2170, 1999.
  4. Ciprian Chelba. A structured language model. In Association for Computational Linguistics, pages 498–500, 1997.
  5. Arisoy, E., Chen, S. F., Ramabhadran, B., & Sethy, A. (2014). Converting neural network language models into back-off language models for efficient decoding in automatic speech recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(1), 184-192.
Advertisement