I am learning Transformers and want to understand the roles of Self-attention and Multi-head attention, but first, I am interested in understanding how "attention" is used and how it applies to (1) Plain RNN models (or LSTM if that is better), (2) When CNN networks are used for Sequence Models, and (3) In the context of a Transformer. I understand that one distinction between these is that RNN, LSTM and applied CNN in the pre-transformer context of building Sequence Models are sequential, whereas when using Transformers, we process the input sequence "all at once" similar to how we process images (all pixels are sent in at once so the CONV layers can encode the image. Any help or point me to previous entries or journal articles or good web/blogs would be appreciated.
