Natural Language Understanding Wiki
Advertisement

Tang et al. (2018)[1]:

  • Self-attention is not better in terms of long-range dependency
  • Self-attention is better in WSD
  • Results from Tran et al. (2018)[2] that transformer performs worse than RNN in long-range dependency are due to hyperparameter choice

Hybrid architectures:

  • Conformer (Gulati et al. 2020)[3]
  • HaloNet - Local attention (Vaswani et al. 2021)[4]
  • BoTNet (Srinivas et al. 2021)[5]
  • CoatNet (Dai et al. 2021)[6]
  • and more (ConViT, LeViT, CMT,...)

Further reading[]

Reference[]

  1. Tang, G., Müller, M., Rios, A., & Sennrich, R. (2018). Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures, 4263–4272. http://doi.org/10.1108/13598540910927296
  2. Ke Tran, Arianna Bisazza, and Christof Monz. 2018. The Importance of Being Recurrent for Modeling Hierarchical Structure. arXiv preprint arXiv:1803.03585.
  3. https://arxiv.org/pdf/2005.08100.pdf
  4. https://arxiv.org/pdf/2103.12731.pdf
  5. https://arxiv.org/abs/2101.11605
  6. https://arxiv.org/abs/2106.04803
Advertisement