AdaDelta is an adaptive stochastic gradient descent with some appealing attributes for NLP tasks:
- No need to specify a global learning rate
- Individual effective learning rate for each dimension (model weight, dimension of each embedding)
- Effective learning rate doesn't descend to zero as for AdaGrad -- this is good for non-stationary problems
- Smaller learning rate for frequent dimensions (think: frequent words) and bigger learning rate for infrequent dimensions (hence infrequent words)
Dong et al. (2015) applied AdaDelta in machine translation.
- Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. arXiv, 6. Retrieved from http://arxiv.org/abs/1212.5701
- Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
- Peng, B., Lu, Z., Li, H., & Wong, K. F. (2015). Towards Neural Network-based Reasoning. arXiv preprint arXiv:1508.05508.
- Dong, D., Wu, H., He, W., Yu, D., & Wang, H. (2015). Multi-task learning for multiple language translation. ACL.