If we set max_completion_tokens to something greater than the max context length of the model minus the prompt length, we get
message: "This model's maximum context length is 4097 tokens, however you requested 5360 tokens (1360 in your prompt; 4000 for the completion). Please reduce your prompt; or completion length.",
It would be really helpful if the API could just set the maximum completion tokens to min(max_completion_tokens, model’s max context length - prompt length) rather than return an error. The parameter max_completion_tokens is only supposed to be a maximum, not the actual number of tokens requested.
The max_completion_tokens (and prior max_tokens parameter) also serves on non-reasoning models to act as a “reservation” of output space by the way the rate limiter works.
When specified, the rate limiter calculates requested maximum against both your API rate limit, and also the maximum that remains as input token space in the model context window.
These two functions could certainly be separated for more utility (and more confusion), by setting a “minimum output reservation” along with a “maximum output cutoff”, but this wasn’t done even with the new max_completion_tokens. Your own API code can apply the same logic, though.
This does have utility though: if you request 2000 tokens of response, you won’t be able to send 4000 tokens of input to a model only supporting a 4097 context window - you’d only have 97 tokens of unsatisfying response that can be produced left before termination.
Use gpt-4o-2024-11-20 as your AI model, and your context window is bumped to 125k, pretty much solving the concern.
For the current 4k model you show, setting a maximum near the maximum has no cost safety utility – just don’t send the max_completions_token parameter, and you get the arbitrary use you desire.