Hi, probably mine is a stupid question but I have this doubt: each model has a maximum number of tokens (for example the gpt-3.5-turbo model has a maximum number of tokens 4,096 tokens) but are they considered for each prompt (prompt+answer) or in the total code?
Let me explain: 4,096 tokens are the max tokens for a single prompt or if, for example, I create a code where multiple prompts are sent, 4,096 tokens are for the entire code (so for more prompts).
I don’t know if I managed to explain myself.
Not a stupid question at all, it confuses many. 4096 tokens is your entire world, you must do everything within that limit, you must ask your question and pass any history of past questions and answers and leave space for the latest answer within that same limit. Hope that helps.
A morbid way to put it but yes, on point with this statement
Ok, thanks, that’s a little clearer.
So, in summary, the max tokens include prompt and response (and any “history” in the case of chat completion).
But if I send two prompts (therefore two requests) do I have 4096 tokens for the first prompt and 4096 tokens for the second or do I have 4096 tokens for both? This is my doubt.
For both. The 4096 token limit is for 1 API call/request.
There is a total limit called Ratelimit which is at an organisation level which is for the total number of token you can send per minute
For prompt and reply A to influence prompt and reply B in some way, at least some tokens from prompt and reply A must make their way into prompt B. So it is always 4096 tokens no matter how many questions it is over, so long as you wish to retain context and relevance, if you don’t care then you have a fresh 4096 to play with each time.
ok, I know about the TPM rate limits for the API but usually the TPM is not higher than the max tokens allowed by the model? so I thought they were separate things.
I’m sorry but this token thing really confuses me
perfect, thank you very much! it’s what I needed.
@deb23 Not at all a silly question.
It has confused many people.
MAX_TOKENS is the length of generated response
But in total prompt + generated response length shouldn’t be more than the model you are using.
It’s upto you now, what MAX_TOKENS you set and what left is for your prompt (since you may be sending complete chat history)
For example:
If using gpt-3.5-turbo has limit of 4096 tokens and I set MAX_TOKENS= 1000 (now generated response would be within 1000 tokens). Now 3096 tokens are left for your prompt and you can also count the prompt token before sending to gpt.
This is how it goes.