Types Edit

Difficulties Edit

TODO: Chen et al. (2015) survey existing techniques and "further extend self normalization to be a proper estimator of likelihood and introduce an efficient variant of softmax".

TODO: a totally different approach: pointing/copying: Gulcehre et al. (2016)[1]

TODO: what about score matching?

From Grave et al. (2016)[2]: "(i) the methods that consider the original distribution and aim at providing approximations of the probabilities, or of a subset of them (Bengio et al., 2003b; Ji et al., 2015), from (ii) the approaches that compute exact probabilities for an approximate model yielding a lower computational cost, such as the popular hierarchical softmax (Goodman, 2001a; Mnih & Hinton, 2009; Morin & Bengio, 2005)."

Grave et al. (2016)[2]: adaptive softmax

Various solutions have been proposed to circumvent the issue. Some of them are variants of the original class decomposition idea (Bengio, 2002) :

  1. importance sampling : (Bengio et al., 2003) ;
  2. uniform sampling of ranking criterion : (Collobert et al., 2011) ;
  3. hierarchical softmax : (Morin et al., 2005) ;
  4. hierarchical log-bilinear model : (Mnih et al., 2009) ;
  5. structured output layer : (Le et al., 2011) ;

Others avoid computing the probability distribution: noise-constrastive estimation (Mnih et al., 2012) and negative sampling; or use alternative loss (Vincent et al. 2015).

Slow training time Edit

TODO: Bloom embeddings (Serrà and Karatzoglou, 2017[3])

Solution: Approximate Softmax (AKA sampled softmax) Edit

Jean et al. (2014)[4]: "In practice, hence, we partition the training cor-pus and define a subset V ? of the target vocabulary for each partition prior to training. Before training begins, we sequentially examine each tar-get sentence in the training corpus and accumulate unique target words until the number of unique tar-get words reaches the predefined threshold τ. The accumulated vocabulary will be used for this partition of the corpus during training."

Solution: Short-list Edit

An output layer size of tens to hundreds thousands has proved to be too computationally expensive. A simple solution is to use a short-list of only thousands words.

Originally, Bengio (2003)[5] merged all infrequent words into a special class UKN (short hand for "Unknown") that represents probability of all rare words. The rare words within the class have probability estimated based on their unigram frequency. This approach has been later improved by Schwenk and Gauvain (2005)[6], who redistributed probabilities of rare words using n-gram model.

Note that the vocabulary truncation techniques can provide very significant speed-ups, but at a noticeable cost of accuracy. Schwenk did use in some cases as little as 2K output units in the neural network, and even if these correspond to the most frequent words, the performance degradation was significant as was later shown in Le et al. (2011)[7].

The influence of the vocabulary size on the performance of RNNs seems to be an as yet untouched research topic.

Solution: Hierarchical architecture Edit

Solution: SOUL architecture Edit

The vocabulary is divided into two part:

  1. Short-listed words whose probability is computed directly
  2. Other words are further divided into classes and subclasses. Their probability is computed by the following formula:
    $ P(w_i|h)=P(c_1(w_i)|h) \prod_{d=2}^{D} P(c_d(w_i)|h, c_{1:d-1}) $

Classes are induced by repeatedly clustering words. More specifically, Le et al. (2011)[7] used a 3-step procedure:

  1. Train a standard NNLM model with the short-list as an output, following the one vector initialization scheme (Le et al., 2010)[8].
  2. Reduce the dimension of the context space using a principal component analysis (10 dimensions).
  3. Perform the recursive K-means word clustering based on the distributed representation induced by the context space (except for words in the short-list).

Solution: Noise-contrastive estimation Edit

Main article: Noise-contrastive estimation


Pre-compute hidden layer Edit

[12](equivalent to sparse operation?)

Slow decoding time Edit

TODO good results:

TODO: filter output vocabulary based on co-occurrence with source words (EACL tutorial on neural MT).

Solutions to slow training time listed above also help reduce decoding time. Besides, some other techniques are also available in the literature.

Solutions: convert into N-gram LM Edit


Solution: Self-normalizing NNLM Edit

If we can make raw probabilities approximately sum to one all the time, we can avoid repeatedly normalizing them. Devlin et al. (2014)[12] introduced a regularizing term into the loss function to encourage the network to self-normalize.

See also: When and why are log-linear models self-normalizing?

TODO: another solution: Pang et al. (2020)[14]

Hyper-parameters Edit

Vocabulary size Edit

A typical choice for |V| is between 50.000 and 300.000 (Mikolov, 2012), although in Collobert and Weston (2008) |V|=30.000 was chosen.

References Edit

  1. Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou and Yoshua Bengio. 2016. Pointing the Unknown Words. ACL.
  2. 2.0 2.1 EFFICIENT SOFTMAX APPROXIMATION FOR GPUS Édouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, Hervé Jégou
  3. Serrà, J., & Karatzoglou, A. (2017). Getting deep recommenders fit: Bloom embeddings for sparse binary input/output networks.
  4. Cho, Sébastien Jean Kyunghyun, Roland Memisevic, and Yoshua Bengio. 2014. "On Using Very Large Target Vocabulary for Neural Machine Translation." PDF
  5. Y. Bengio, R. Ducharme, P. Vincent. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137-1155, 2003.
  6. H. Schwenk, J. Gauvain. Training Neural Network Language Models On Very Large Corpora. In Proceedings of Joint Conference HLT/EMNLP, 2005.
  7. 7.0 7.1 H.-S. Le, I. Oparin, A. Allauzen, J.-L. Gauvain, F. Yvon. Structured Output Layer Neural Network Language Model. In Proc. of ICASSP, 2011.
  8. H.S. Le, A. Allauzen, G. Wisniewski, and F. Yvon, “Training continuous space language models: Some practical issues,” in Proc. of EMNLP’10, Cambridge, MA, 2010, pp. 778–788.
  9. Mnih, A., & Teh, Y. W. (2012). A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426.
  10. Vaswani, A., Zhao, Y., Fossum, V., & Chiang, D. (2013). Decoding with Large-Scale Neural Language Models Improves Translation. In EMNLP (pp. 1387-1392).
  11. Mnih, A., & Kavukcuoglu, K. (2013). Learning word embeddings efficiently with noise-contrastive estimation. In Advances in Neural Information Processing Systems (pp. 2265-2273).
  12. 12.0 12.1 Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., & Makhoul, J. (2014). Fast and robust neural network joint models for statistical machine translation. Proceedings of the ACL, Association for Computational Linguistics, Baltimore.
  13. Arisoy, E., Chen, S. F., Ramabhadran, B., & Sethy, A. (2014). Converting neural network language models into back-off language models for efficient decoding in automatic speech recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(1), 184-192.
  14. Pang, T., Xu, K., Dong, Y., Du, C., Chen, N., & Zhu, J. (2020). Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness. In International Conference on Learning Representations. Retrieved from
Community content is available under CC-BY-SA unless otherwise noted.