The new o1 series of models deprecate the max_tokens parameter in favor of a new max_completion_tokens parameter and I’d like to understand the rational for this change as it’s likely to have wide sweeping impacts for what appears to be a simple wording change.
Most of the breaking changes up to this point have made sense but I can’t understand the rational for this as it looks like you just wanted to use a different word for the same exact parameter value. My concern is that for those of use that build SDK’s on top of OpenAI we have release a patch so that customers can leverage the new o1 models (I’m doing that now) which ok… But for those of us that also need to support Azure OpenAI how do you expect us to even know that an o1 series model is being used. That’s completely obfuscated from you with Azure OpenAI.
If this change was serving some broader goal I’d get it but if that’s the case it’s not obvious to me what that broader goal is.
I think it has to do with the difference between input and output tokens. The input tokens for o1 is 128k while the output tokens (completion tokens) are around 32k tokens. The quantization has a max_completion_tokens of 64k.
I think the reason reason behind it is to make it clear what type of tokens it is talking about, but I could be wrong and there could be another reason behind it.
So I’ve always felt that the name sucks… I use max_input_tokens and max_output_tokens in my code and then map max_output_tokens to max_tokens.
Is max_completion_tokens a better name? absolutely! Should they change it and in the process break thousands of apps? absolutely not! It’s just a word. You picked one that sucks but you have to live with those choices or at least have a better sense for the impact those changes are going to have.
Hi, Atty from OpenAI here — max_tokens continues to be supported in all existing models, but the o1 series only supports max_completion_tokens.
We are doing this because max_tokens previously meant both the number of tokens we generated (and billed you for) and the number of tokens you got back in your response. With the o1 models, this is no longer true — we generate more tokens than we return, as reasoning tokens are not visible. Some clients may have depended on the previous behavior and written code that assumes that max_tokens equals usage.completion_tokens or the number of tokens they received. To avoid breaking these clients, we are requiring you opt-in to the new behavior by using a new parameter.
Ah I see… ok that’s a fair explanation. The issue is that we’re building an SDK on top of your service (the Microsoft Teams AI Library) and we can’t really pass this change onto our customers because we would get massive pushback if we told them they basically need to make changes everywhere in their code just to leverage the latest model.
The other issue is going to be Azure OpenAI. The developer can name their deployment anything they want so there’s no way to know that they’re using o1 and not gpt-4o. The only thing we know is chat vs text completions.
Because of these two issues we’re going to have no choice but to simply map max_tokens to max_completion_tokens internally for every model, including gpt-4o requests. I suspect that LangChain, LlamaIndex, and everyone else will be forced to do the same thing. And I suspect that 95% of your other customers will just do search and replace.
I personally don’t see what this is buying you. Developers already have a way of knowing that they’re using a model with different cost semantics. They have to pass in the name of the model they want to use. Making them “opt in” to new policies by having to modify their code isn’t the way to approach this.
@stevenic I’m not so sure it works like that - “reasoning tokens” are a CoT prompt that is generated based on RL optimization, and the reward function probably doesn’t take into account variable number of tokens (e.g. your max_reasoning_tokens) - possibly only some static upper number as a constraint (and this is what they provide in their guide). From my playing with RL (many years ago), getting the reward function just right is tricky, and you don’t want to impose too many constraints, otherwise it becomes exceedingly expensive, or simply won’t work well.
I’m assuming that they’re not doing RL at inference time because that would be way to expensive compute wise. They’re most likely (with 95% certainty) setting in a loop predicting assistant messages to append to the prompt and then re-prompting to predict the next message to append until they reach some form of stop state. That loop consumes inference tokens which they have to bill for which means they could easily accept a parameter giving them a policy for when they should abort due to cost.
I say with 95% certainty because this is what my reasoning engine does and I have both max tokens and max time policies.
if len(curr_reasoning_tokens) > max_reasoning_tokens:
curr_reasoning_tokens = prev_reasoning_tokens || DEFAULT_REASONING_TOKENS
break
So here they would keep the previous “reasoning tokens”, and if the newly generated reasoning tokens are greater in number than the max parameter, use the previous ones, since those ones must’ve met the policy, otherwise just use some default system prompt.
Yes… They chose to rename max_tokens to max_completion_tokens as some sort of “opt in” to the fact you may get charged for more tokens. By deprecating max_tokens you’re basically saying that all developers need to “opt in” to these additional charges. The comment was that they intend to support max_tokens forever but not for new models. So why deprecate it then? It’s because they don’t plan to support the current models much longer hence the requirement to “opt in” to the new charging policies.
I’ll cut to the chase and say what I think they should have done. They should have left max_tokens as is because nothing has really changed behavior wise with the new parameter. It still controls how many tokens you get back from your response so it’s just a name change. They should have then added a new max_reasoning_tokens parameter that gives developers a mechanism for controlling costs by limiting how long o1 is allowed to think.
stevenic is correct. If you need new api endpoint, make it, do not complicate the existing one. This is not code reuse, this is madness, like yeah… Spartaaa