Tang et al. (2018)[1]:
- Self-attention is not better in terms of long-range dependency
- Self-attention is better in WSD
- Results from Tran et al. (2018)[2] that transformer performs worse than RNN in long-range dependency are due to hyperparameter choice
Reference
- ↑ Tang, G., Müller, M., Rios, A., & Sennrich, R. (2018). Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures, 4263–4272. http://doi.org/10.1108/13598540910927296
- ↑ Ke Tran, Arianna Bisazza, and Christof Monz. 2018. The Importance of Being Recurrent for Modeling Hierarchical Structure. arXiv preprint arXiv:1803.03585.