AdaDelta

AdaDelta^[1] is an adaptive stochastic gradient descent with some appealing attributes for NLP tasks:

No need to specify a global learning rate
Individual effective learning rate for each dimension (model weight, dimension of each embedding)
Effective learning rate doesn't descend to zero as for AdaGrad -- this is good for non-stationary problems
Smaller learning rate for frequent dimensions (think: frequent words) and bigger learning rate for infrequent dimensions (hence infrequent words)

Usage

Bowman et al. (2015)^[2] used AdaDelta to neural networks in natural language inference task. Peng et al. (2015)^[3] attacked similar problem using neural reasoner architecture.

Dong et al. (2015)^[4] applied AdaDelta in machine translation.

References

↑ Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. arXiv, 6. Retrieved from http://arxiv.org/abs/1212.5701
↑ Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
↑ Peng, B., Lu, Z., Li, H., & Wong, K. F. (2015). Towards Neural Network-based Reasoning. arXiv preprint arXiv:1508.05508.
↑ Dong, D., Wu, H., He, W., Yu, D., & Wang, H. (2015). Multi-task learning for multiple language translation. ACL.

[1] Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. arXiv, 6. Retrieved from http://arxiv.org/abs/1212.5701

[2] Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.

[3] Peng, B., Lu, Z., Li, H., & Wong, K. F. (2015). Towards Neural Network-based Reasoning. arXiv preprint arXiv:1508.05508.

[4] Dong, D., Wu, H., He, W., Yu, D., & Wang, H. (2015). Multi-task learning for multiple language translation. ACL.

[1]

[2]

[3]

[4]