Hey everyone! New to the chat! On the topic of Transformers being able to write a cohesive “long term” novel, Transformers must have some sort of differentiable memory attached.
Another issue is recurrence slowing things down. Ie when you train GPTm (m for memory I made it up!), you don’t want to individually input the output of the old prediction into the next training step - ie you don’t want any correlation between training step 1 and N. In GPTx/BERT/variants, they just assume the output is right (teacher forcing).
So the issues are:
- How to add differentiable memory to GPT?
- How to do this without a recurrence relation?
I first thought of something boring (haven’t tested). You make a matrix M (called memory) of size (m, f). f is the original embedding dimension (like 768). m can be massive say 20,000. For every batch of text, you pass the MH attention layers, dense layers all the way to the final softmax layer, then somehow copy BERT’s CLS approach and “extract” the CLS 768 dim vector and perform v=M*(CLS) which will get u a tall skinny vector (20,000 by 1).
Then, perform a long tailed sigmoid ie 1/(1+exp(-0.5v)) onto v. Then element wise multiply the sigmoid output with v^T. You’ll get a (20,000 by 768) matrix the size of M.
Then M(t+1) = M + 1/(1+exp(-0.5v)) * v^T. Then append M onto X (which can be very problematic), or somehow “summarise” M (ie say via a Clarkson Woodruff Transform shrinking M(20000,768) to say (500,768). You can even train the summarisation weight matrix S so we get:
M(t+1) = M + 1/(1+exp(-0.5v)) * v^T
X(t+1) = concat[ X(t+1) , S * M(t+1) ]
The CW Transform will just “freeze” S as a hash table.
This has 2 benefits:
- Incorporates long term attention. Ie the dot product makes similar memories remember even more often, and discounts not so important memories.
- Fixes catastrophic forgetting. The use of a long tailed sigmoid allows long term memories to stay inplace and not vanish.
However there are is a clear issue with this approach:
- Recurrence comes back! Batches must now be sequential… Ie previously u can have 1,000,000 books scramble each page, and GPT would be fine. Now GPTm needs to train ONLY on book 10 then 103 then 12039 with page orderings intact.