Tang et al. (2018)[1]:
- Self-attention is not better in terms of long-range dependency
- Self-attention is better in WSD
- Results from Tran et al. (2018)[2] that transformer performs worse than RNN in long-range dependency are due to hyperparameter choice
Hybrid (micro) architectures:
- Conformer (Gulati et al. 2020)[3]
- HaloNet - Local attention (Vaswani et al. 2021)[4]
- BoTNet (Srinivas et al. 2021)[5]
- CoatNet (Dai et al. 2021)[6]
- and more (ConViT, LeViT, CMT,...)
(Macro) architectures:
- encoder-only (less common): for classification tasks, example: BERT
- encoder-decoder: seq-seq tasks: translation, general text generation, example: BART
- decoder-only (most commonly used): most tasks, example: GPT, BLOOM, LLaMA
Further reading[]
Reference[]
- ↑ Tang, G., Müller, M., Rios, A., & Sennrich, R. (2018). Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures, 4263–4272. http://doi.org/10.1108/13598540910927296
- ↑ Ke Tran, Arianna Bisazza, and Christof Monz. 2018. The Importance of Being Recurrent for Modeling Hierarchical Structure. arXiv preprint arXiv:1803.03585.
- ↑ https://arxiv.org/pdf/2005.08100.pdf
- ↑ https://arxiv.org/pdf/2103.12731.pdf
- ↑ https://arxiv.org/abs/2101.11605
- ↑ https://arxiv.org/abs/2106.04803