Natural Language Understanding Wiki

Attention mechanism was initially invented for machine translation but quickly found applications in many other tasks. It works whenever one needs to "translate" from one structure (images, sequences, trees) to another. Ilya Sutskever, Research Director at OpenAI (as of 2015), said in an interview: "[attention models] are here to stay, and that they will play a very important role in the future of deep learning."

The basic idea is to read the input structure twice: once to encode the gist and another time (at each step while decoding) to "pay attention" to certain details.

However, Press and Smith (2018)[1] show that similar performance in machine translation can be achieved using an eager model without attention.


Machine translation[]

TODO: Luong et al. (2015)[2]

Text processing/understanding[]

Natural language inference: Parikh et al. (2016)[3]

Abstractive summarization: Chopra et al. (2016)[4]: "The conditioning is provided by a novel convolutional attention-based encoder which ensures that the decoder focuses on the appropriate input words at each step of generation."

Question answering: TODO Dhingra et al. (2016)[5]


Mnih, V., Heess, N., Graves, A., & Kavukcuoglu, K. (2014). Recurrent Models of Visual Attention, 1–12. Retrieved from

Ba, J., Mnih, V., & Kavukcuoglu, K. (2014). Multiple Object Recognition with Visual Attention. arXiv Preprint arXiv:1412.7755.


Attention as explanation[]

Many papers assumed that attention coefficients explain the decisions of the model. However, Jain and Wallace (2019)[6] argued that they don't, at least for a class of models (see also Matt Gardner's comments).


  1. Press, O., & Smith, N. A. (2018). You May Not Need Attention. EMNLP.
  2. Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. Emnlp, (September), 11. Retrieved from
  3. Parikh, A. P., Täckström, O., Das, D., & Uszkoreit, J. (2016). A Decomposable Attention Model for Natural Language Inference. Retrieved from
  4. Chopra, S., Auli, M., & Rush, A. M. (2016). Abstractive Sentence Summarization with Attentive Recurrent Neural Networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 93–98). San Diego, California: Association for Computational Linguistics. Retrieved from
  5. Dhingra, B., Liu, H., Cohen, W. W., & Salakhutdinov, R. (2016). Gated-Attention Readers for Text Comprehension. Retrieved from
  6. Jain, S., & Wallace, B. C. (2019). Attention is not Explanation. CoRR, abs/1902.10186. Code.