Hey everyone! Not sure if this is the right place to post, but recently in my free time, I was reviewing Transformers and the maths / guts behind it. I re-skimmed Attention is All You Need [1706.03762] Attention Is All You Need, Uni of Waterloo’s CS480/680 Lecture 19: Attention and Transformer Networks - YouTube and Ali Ghodsi, Lect 13 (Fall 2020): Deep learning, Transformer, BERT, GPT - YouTube and Attention Is All You Need - YouTube
I know GPTx is just the Decoder with Masked Multihead self attention predicting learnt word embeddings X with a softmax final layer predicting the next token.
I minused the batch normalization and residual connections for simplicity.
Normal attention’s equations where W are learnt parameters. Likewise X is also the learnt embeddings. Sigma is the Softmax function.
However with masking, since GPTx attends ONLY to stuff from the left, we use a masking matrix M. M is 0s on the lower triangular and diagonal, and -inf on the upper triangular. This is cause exp(-inf) = 0.
Using matrix diagrams, we see that the shapes of the multiplications above become:
After considering multihead attention, we concatenate every attention output and do a large linear kernel:
Likewise the position encoding is just sines and cosines each dimension.
I’m working on backpropagation derivatives still!
Is my formulations / summary correct? Did I do something wrong in trying to understand transformers?