I am using openai API with Python, and the model is gpt-4o-mini.
What I need help with is about “max_tokens”.
When designing the max_tokens variable, do I only need to think about Completion Tokens? Or should I design it by adding Prompt Tokens?
(Completion Tokens: 1098, Prompt Tokens: 6852, Total Tokens: 7950) came out, like number 1, I didn’t know the exact answer, so I tested with 8000, 10000, 13000, and 16384, but the AI’s answer changed. Is this part affected by max_tokens?
The existing parameter max_tokens can still be used with all but o1-preview and o1-mini models, but it also has a new name, long-coming, that explains its purpose better:
max_completion_tokens
It is the maximum you want to pay for generated language tokens before the output is shut off.
The distinction in “want to pay” arises because, with the new o1 model, you are also charged for tokens processed internally, even if they don’t appear in the output. Previously, and with all other models, charges only applied to the length of the output generated before language generation was terminated.
The AI’s response varies with each new inference due to built-in statistical randomness. max_completion_tokens does not impact the language quality—except in cases where output may be truncated if the AI reaches the token limit before completing its response.
There is a different calculation per-input to think of: this parameter also acts as a “reservation” of output space in the model’s context length - if you send more input and also set max_tokens for more total context than the model supports for its context window length (the memory where all token operations happen), you will get an error.
So:
It simply sets the maximum output size.
It doesn’t affect the answer.
It should be set high enough to always get complete answers (3000 is good)
It should be set MUCH higher on ‘o1’ models, as the potential cost - and the cost of not getting output - is higher (25000 is good, or just omit)