Hello,
I am trying to understand the attention mechanism and have some basic questions.
-
Is the output of one single attention head a vector or a scalar? (or even matrix?)
-
There is a softmax function over a matrix in the attention head. Is the output of this softmax function a vector or scalar? Over which dimension goes the softmax, if the output is a vector?