AdaDelta[1] is an adaptive stochastic gradient descent with some appealing attributes for NLP tasks:

  • No need to specify a global learning rate
  • Individual effective learning rate for each dimension (model weight, dimension of each embedding)
  • Effective learning rate doesn't descend to zero as for AdaGrad -- this is good for non-stationary problems
  • Smaller learning rate for frequent dimensions (think: frequent words) and bigger learning rate for infrequent dimensions (hence infrequent words)

Usage Edit

Bowman et al. (2015)[2] used AdaDelta to neural networks in natural language inference task. Peng et al. (2015)[3] attacked similar problem using neural reasoner architecture.

Dong et al. (2015)[4] applied AdaDelta in machine translation.

References Edit

  1. Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. arXiv, 6. Retrieved from
  2. Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
  3. Peng, B., Lu, Z., Li, H., & Wong, K. F. (2015). Towards Neural Network-based Reasoning. arXiv preprint arXiv:1508.05508.
  4. Dong, D., Wu, H., He, W., Yu, D., & Wang, H. (2015). Multi-task learning for multiple language translation. ACL.
Community content is available under CC-BY-SA unless otherwise noted.