Today I discussed aspects of the limited context memory and how to overcome this limitation. Although I am a developer, this is just about an idea which popped up in my brain, and according to ChatGPT some aspects might be new. Here is what I said:
Without going into the implementation, the consideration is as follows: I actually know a lot, but I can’t access it all the time. When I concentrate on one thing, certain areas of knowledge open up again and I can work with them. When I take a break and relax, I explicitly switch off that stuff and change the context. Such changes are a normal process that happens several times a day. I could imagine something similar for ChatGPT. It would save a lot of memory if contexts were switched dynamically depending on the topic, and it would also make ChatGPT more human. We are, so to speak, reaching the same kind of limit.
ChatGPT doesn’t eat, doesn’t sleep, doesn’t feel regret or remorse. It is an unstoppable algorithm with a server rack full of knowledge always being used to speak, to dream…
(was John Connor just a prompt engineer?)
In a way though, what you hint at is already used. Give some of the parameters a break from being the token run production source when attention is used on a mixture-of-experts.
Non-OpenAI
Total Parameters: 456B
Activated Parameters per Token: 45.9B
Number Layers: 80
Hybrid Attention: a softmax attention is positioned after every 7 lightning attention.
Number of attention heads: 64
Attention head dimension: 128
Mixture of Experts:
Number of experts: 32
Expert hidden dimension: 9216
Top-2 routing strategy
Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000
That’s right, I was probably comparing apples with pears. Concentrating on certain topics is necessary for humans because of internal limitations, whereas ChatGPT has hardly any limitations to begin with. However, these come into play again because the current context must always be transferred, since a client-server architecture is undesirable.
Thanks for the technical excerpt, I’ll look into the technology more closely instead of ranting with the robot