Do 'MAX tokens' include the follow up prompts and completion in a single chat session

Within the ChatGPT app, the max_token response length has been discovered, by experimentation and bugs. It is 1536 tokens. 150% of 1024.

After hitting the limit, it has a clever trick, the “continue” button. GPT-4 AI especially has now been tuned to make answers of limited size though, so the button’s appearance is rarer.

advanced topic warning

In ChatGPT, there can’t be an active or live display of what past conversation will actually be used. ChatGPT’s backend currently uses a similarity-matching technique on the submitted input to retrieve just a handful of past conversation turns from the database. This sometimes can be more user messages than AI replies, and they can be a salt-and-pepper smattering of just snippets of what was discussed at length, some exchanges summarized. Learning how it works is the uncertainty principle: you can’t probe the conversation management without your jailbreak affecting it.

ChatGPT’s technique would require processing of the sent user input with an embeddings vector database match, part of why we now see a delay before token generation (along with secret output generation monitoring to examine for content violations beyond what the AI is trained to deny).

More unseen input tokens would mean more processing load, which is what you pay for on API. Making the chatbot software more forgetful saves computation resources.

Your own app (or mine) can do a live token counting and parameter consideration to adapt and show what chat would be sent and gray-out older messages. Token counting requires a 2MB dictionary download and processor-intensive library. Also, accurate live display as you type would preclude embeddings or lookups using submitted input to retrieve relevant conversation even older (although you could click and manually disable or force history message also, a GUI for only an advanced user.)

1 Like

Complexity of the memory part of the model grows by the square of the memory size.
Also, as you increase the context, you also need to increase the number of attention heads, to be able to make good use of that context, which in turn increases the size of all the inner layers.
So, doubling the size of the context, means quadrupling the input stage, and likely doubling the rest of the model.

Technically, this is just a few numbers in a yaml file or something. No big deal!
Economically, and training-wise, well, it’s a whole different kettle of wax.

There are some other methods, such as ALiBi which allows you to extend the memory without re-training, but those approaches tend to miss a lot of stuff in the context when you push them. See for example: Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation | OpenReview

1 Like

Let me just add some random link.

But there could also be a knowledgegraph where you have a context overview and asking the model multiple times in the background until all neccessary context has been prioritized and then take from the graphs whats needed and then formulate the answer…

But that’s more engineering than science.
I still respect how you think the solution will be within a model one day.