Transformer

Tang et al. (2018)^[1]:

Self-attention is not better in terms of long-range dependency
Self-attention is better in WSD
Results from Tran et al. (2018)^[2] that transformer performs worse than RNN in long-range dependency are due to hyperparameter choice

Reference

↑ Tang, G., Müller, M., Rios, A., & Sennrich, R. (2018). Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures, 4263–4272. http://doi.org/10.1108/13598540910927296
↑ Ke Tran, Arianna Bisazza, and Christof Monz. 2018. The Importance of Being Recurrent for Modeling Hierarchical Structure. arXiv preprint arXiv:1803.03585.