Tang et al. (2018)[1]:

  • Self-attention is not better in terms of long-range dependency
  • Self-attention is better in WSD
  • Results from Tran et al. (2018)[2] that transformer performs worse than RNN in long-range dependency are due to hyperparameter choice

Reference Edit

  1. Tang, G., Müller, M., Rios, A., & Sennrich, R. (2018). Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures, 4263–4272.
  2. Ke Tran, Arianna Bisazza, and Christof Monz. 2018. The Importance of Being Recurrent for Modeling Hierarchical Structure. arXiv preprint arXiv:1803.03585.