Do 'MAX tokens' include the follow up prompts and completion in a single chat session

Do ‘MAX tokens’ include the follow up prompts and completion in a single chat session or does it apply for just a single prompt and it’s completion while a follow up prompt and completion gets a new set of maximum tokens

1 Like

Welcome to the forum.

It’s stateless, so the Max tokens resets each call to the API.

4 Likes

Thank you Paul. Much appreciated!
To clarify, does stateless mean that the max tokens (e.g 4097 tokens) does not apply to a particular format.
meaning the 4097 tokens can be (single prompt and a single completion) or (a prompt, a completion and it’s follow up prompts and completions)?

Are you 100% sure as I think I read in their doc somewhere that max tokens e.g 4097 apply to a prompt and a completion ?

It’s the prompt+completion. By stateless, I mean that it doesn’t “save” any information and you must send what you want it to have each time you call the API.

Pls what do you mean my this. I apologize this sounds ignorant as I’m not a coder😅

Each call has it’s own Max Tokens parameter…

2 Likes

There are two things that might be conflated here:

A model’s context length is the total amount of tokens it can handle at once, a combined count of both the total input that you send and the response you get back.

An API call’s max_tokens parameter reserves a specific amount of the context length for forming an answer, and sets the size of the maximum response that you will receive back from the AI.


“Followup responses” to me means that you have a chatbot application (instead of just making individual requests for single data processing tasks).

In a chatbot scenario, the software you use or write should also include some of the past conversation as role messages, before finally including the most recent user query, so that the API calls (which are unconnected and without history) can make the AI understand what you were talking about.

This growing conversation means that you keep sending more input to the AI model with each question, until ultimately you must manage and truncate the past conversation. If you send too much input, besides paying a lot, you can hit a limit where the input plus the max_token space you reserved for an answer is larger than the context length, and you instead get an error.

2 Likes

Thank you! The difference between context length and max tokens is clearer.

Is a single API call a single prompt and single completion
Does each API call have its own max tokens of 4097
Do the role messages you explained count as tokens in the next API call
If someone asks a follow up question, since the previous messages are ‘remembered’ do the previous messages count to the max tokens until that conversation is ended

To expand on what @PaulBellow wrote…

As far as the model is concerned, there is no such thing as a “chat session.” You can understand “stateless” this way,

The entire “state” is whatever you send in your request, that’s it. To have a “conversation” you need to construct a new request appending the last response and a new user prompt then submit it again.

You could change anything you wanted to in any of the previous messages and the model would never know the difference, it’s basically just a big function, input in, output out.

So, just as if I had a function f(x), you could send in x = 3 and get some output, say 7.

Then you might say x = (3, 7, 2) and get back 19, so now you’ve got the sequence (3, 7, 2, 19). Anyone with this sequence and access to f can continue generating values in this sequence (this is why we can share chats) because the entire state is whatever we say it is at any time.

3 Likes

Let me explain it in readable pseudo code

you prompt this:

block 1

User: Hey bot, what is an apple? Answer in one word
Assistant: fruit

Then you send another request and it will consist of this:

block 2

User: Hey bot, what is an apple? Answer in one word
Assistant: fruit
User: Can I eat it?
Assistant: Depends on if you are allergic or not

On your first request you will be charged for block 1 in tokens and on your second request you will be charged for block 2 - each block converted to tokens individually.

3 Likes

Each API call is seen as an individual entity. The AI doesn’t have a natural built-in memory like you perceive in ChatGPT. It only gives the illusion of understanding your topic when you also re-send past messages.

The middle of a conversation where you send a new question can look like this:

system: You are an AI assistant.
user: what is your name?
assistant: I don’t have a name yet, would you like to give me one?
user: yes, you are bob the ai
assistant: Hi! As Bob, I’m glad to assist you with any questions or tasks.
user: How many bases are there in baseball?
assistant: Baseball has four bases.
user: the distance between them?
assistant: The distance between baseball bases is 90 feet.
user: What’s your name?
assistant: You named me “Bob”.
user: How far is the fence in most professional stadiums?

You can see I asked several questions that need the prior answers for contextual understanding. This whole conversation needs to be sent again to answer “Bob” or know that we are again talking about baseball with my most recent question to be answered.

If I talked for days, the conversation could get very long. Software can use many management techniques to send the AI just what is recent and relevant.

The “max_tokens” only refers to the size of the reply. Usually it is always set the same, like 1000 tokens, so that answers big and small can be received. The AI decides how long an answer should be, but doesn’t know the maximum setting where its answer will be cut off.

1 Like

Really? I think that information is wrong.

max_tokens usually refers to prompt and response.

2 Likes

As an API parameter, what I describe is correct. Set max_tokens to 5, and you only get a few words with a cut-off answer. You can still send a big question though.

There are places in OpenAI model documentation listing model capabilities, where the context length of a model is also called “max tokens”. This can lead to confusion.

1 Like

Ah yeah, as a parameter. Yes, looks like I got it wrong.

I think you can safely drop the modified “can” from that sentence.

Some people absolutely expect the model to be able to remember the previous 8k tokens, accept an 8k-token prompt, and also return an 8k-token response.

I mean, most of those people aren’t even trying to understand, but the documentation certainly could be made easier to grok for regular citizens.

Incidentally, it would be fabulous if, for ChatGPT, OpenAI would make it visually apparent what was in context and what wasn’t asking with how many tokens they are reserving for fresh inputs and responses.

1 Like

This sounds confusing to me.

In request 2, you will be charged for all the tokens – the initial question, the initial answer (that you have to echo back in the second request,) the second question, and the second answer.

There is a difference, in that “generated” tokens cost more than “context” tokens. In request 1, the question is “context” and the generated answer is “generated;” in request 2, the first question, the first answer, and the second question, are all “context,” and the second answer is “generated.”

You pay for both user prompts and assistant responses (that’s why I have named it a block. A block consists of the whole communication).

And to keep the context in block 2 communication you have to send the model all from block 1, plus the follow up prompt and then you get the response.

The model does not remember/save anything from previous requests.

You have to tell the model what was previously been talked about.

Really good! I understand much better now about how it’s not actually a ‘memory’ but it feeds back the previous conversations as ‘context’ for the new request and that some management settings in the software may prioritize which of the past conversations it should take as context for the new response.

As @jochenschultz explained,
"To keep the context in block 2 communication you have to send the model all from block 1, plus the follow up prompt and then you get the response.

The model does not remember/save anything from previous requests."

And as @elmstedt said, it would be fabulous if, for ChatGPT, OpenAI would make it visually apparent what was in context and what wasn’t asking with how many tokens they are reserving for fresh inputs and responses.

ChatGPT and the GPT models are two different things.

While GPT models are usable for developers and have a documentation, ChatGPT, the chat from openai does not allow trying to find out about what you just asked for.
Trying to do so might lead to loss of your account.
I havn’t read about that happening but those are the rules.

If you want to build your own chat openai won’t help you with that.

I would prefer them to prolong the memory.

I mean can it really be that hard :sweat_smile: