Transformer | Natural Language Understanding Wiki | Fandom

Advertisement

Tang et al. (2018)^[1]:

Self-attention is not better in terms of long-range dependency
Self-attention is better in WSD
Results from Tran et al. (2018)^[2] that transformer performs worse than RNN in long-range dependency are due to hyperparameter choice

Hybrid architectures:

Conformer (Gulati et al. 2020)^[3]
HaloNet - Local attention (Vaswani et al. 2021)^[4]
BoTNet (Srinivas et al. 2021)^[5]
CoatNet (Dai et al. 2021)^[6]
and more (ConViT, LeViT, CMT,...)

Reference

↑ Tang, G., Müller, M., Rios, A., & Sennrich, R. (2018). Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures, 4263–4272. http://doi.org/10.1108/13598540910927296
↑ Ke Tran, Arianna Bisazza, and Christof Monz. 2018. The Importance of Being Recurrent for Modeling Hierarchical Structure. arXiv preprint arXiv:1803.03585.
↑ https://arxiv.org/pdf/2005.08100.pdf
↑ https://arxiv.org/pdf/2103.12731.pdf
↑ https://arxiv.org/abs/2101.11605
↑ https://arxiv.org/abs/2106.04803

Advertisement