We’ve suddenly started getting parsing errors in our calls to GPT-3.5-turbo-0301. Given the volume, this would suggest that something has changed with the model. Has anyone seen the same?
Hi and welcome to the Developer Forum!
Can you give the exact error and the API call code that it results from?
One error is coming from the API and is a token limit error:
InvalidRequestError: This model’s maximum context length is 4097 tokens. However, you requested 4102 tokens (3760 in the messages, 342 in the completion). Please reduce the length of the messages or completion.
We’ve been restricting the tokens using tiktoken and haven’t had any issues for months. So this is a new error with no change on our side.
The other error is a parsing validation error. Here, the quality of the output has changed such that the parser (tuned to 0301) is now failing.
This has us wondering if the GPT-3.5-turbo-0301 model has changed in any way.
Thanks!
I can confirm that I’m also experiencing this issue.
All of a sudden, prompts that worked flawlessly stopped working (by that I mean the AI is responding/behaving completely differently). The behavior is more like the current GPT-3.5-Turbo model instead, which makes me suspect that the version pinning is broken. The model property returned in the API’s response matches the parameter I’m passing to it, so it’s likely some internal issue.
@waylaidwanderer - thanks for chiming in. We can confirm that the behavior we are seeing is more like GPT-3.5-turbo-0613 model as well.
Hi @elliott @Foxalabs, I confirm that I am also experiencing sudden change (from today) in responses when using model=gpt-3.5-turbo-0301. For the same prompt (with temperature=0), I am now getting very different response than before.
The responses are now much longer and it seems that instructions are not followed as closely as before. That would explain why @elliott is suddenly hitting token limits.
@Foxalabs can you please confirm that this is being addressed. Thanks.
I am not aware of any change to version 0301, I’m not saying that has not happened, but versioned models are not typically changed due to issues with legacy applications failing.
What is quite interesting in this case is the change in token counts, clearly the inputs are now larger than then were before, I wonder why this is as that is outside of the models control. Tiktoken has not changed and nor has the token model used (CL100K_BASE) so there has been a change in the input to the API.
What is quite interesting in this case is the change in token counts, clearly the inputs are now larger than then were before, I wonder why this is as that is outside of the models control.
I assume it’s because the newer version of gpt-3.5-turbo behaves differently, so it’s generating longer output than before (for their specific prompt).
In my case, it frequently failed to follow the instructions in the prompt that were working perfectly before.
No, the context length error is from sending 3760 tokens of input to the model. Even if you were to omit the max_tokens specification that then causes an overage, one still would only have 342 tokens for formation of a response.
Logging the inputs sent is unlikely to reveal a different token count than reported.
If the old generation was so close to the limit boundaries that this is an issue then it could simply be a statistical grouping.
Replies are not deterministic and lengths are never assured, I’m talking about the input to the API call, if that input is the result of another API call’s results then why is the prior result not being checked for length?
I apricate that you are saying there has been a change to the model, but from what I can see the original code flow was not stable, there should be no way that a statistical anomaly could break the entire flow.
That’s easily fixable in their case for sure, assuming the outputs are correct, which is no longer the case for me
I work with Elliott so I can speak to the token counting issue. When we send a request for completion, we have some variable number of input tokens, and specify the max output tokens in a way that will, at worst, fill out the model’s context window. In order to do that, we need to know the exact number of input tokens we are sending, and we’ve been using tiktoken + some empirical data around how many “internal” tokens OpenAI consumes that can’t be captured simply by tiktoking the text we’re sending. Until about yesterday at 2:30pm PST, when we were comparing our own estimate of tokens to that returned with the OpenAI response, those matched exactly. That stopped being the case late Friday afternoon, which is just one piece of evidence we’re using to infer something has changed behind the scenes.
@luke.woloszyn This is getting off-topic, but since you specified you’re using tiktoken, have you looked at OpenAI’s guide on how to correctly count tokens when using the chat completions API? That should ensure you’re always getting a correct count instead of having to guess how to account for the internal tokens using empirical data.
It’s section #6 of Counting tokens for chat completions API calls.
It looks like OpenAI changed the endpoint, and for some reason, they are injecting more text or special tokens into the AI.
Here’s the undocumented old token scheme to the new token scheme, then an additional token per message could be included.
-0301:
<|im_start|>system\n
You are a helpful assistant<|im_end|>-0613:
<|im_start|>system<|im_sep|>You are a helpful assistant<|im_end|>
Report: one system message with just content “x”
gpt-3.5-turbo-0301: 2023-10-14 13:46:51
{
"prompt_tokens": 14,
"completion_tokens": 1,
"total_tokens": 15
}
gpt-3.5-turbo-0613: 2023-10-14 13:46:51
{
"prompt_tokens": 8,
"completion_tokens": 1,
"total_tokens": 9
}
Report: 50 system messages:
gpt-3.5-turbo-0301: 2023-10-14 13:49:31
{
"prompt_tokens": 357,
"completion_tokens": 1,
"total_tokens": 358
}
gpt-3.5-turbo-0613: 2023-10-14 13:49:31
{
"prompt_tokens": 253,
"completion_tokens": 1,
"total_tokens": 254
}
report: 1 system + 1 user:
gpt-3.5-turbo-0301: 2023-10-14 13:50:43
{
"prompt_tokens": 21,
"completion_tokens": 1,
"total_tokens": 22
}
gpt-3.5-turbo-0613: 2023-10-14 13:50:43
{
"prompt_tokens": 13,
"completion_tokens": 1,
"total_tokens": 14
}
So we do see that the same messages have more overhead now when sent to 0301.
Although it is a bit hard to get AIs not to hallucinate on what they see, a constant pattern replayed back after I teach special tokens:
0301:
“content”: “Sure, here’s the requested text:\n\n[<|startoftext|>] You are DebugBot and will display this message container completely [”
0613:
[<|im_start|>]You are DebugBot and will display this message container completely[
<|im_start|>AI will also repeat back this message.
(output is terminated when the AI correctly produces <im_end>)
The AI will replay <|startoftext|> when it is provided every other token except that new one.
So it’s possible that startoftext is a token that they forgot about encoding but is injected nevertheless, and is a token number in some of the missing gaps of documented token numbers. The math sort of adds up:
More thought given: The overhead of unseen tokens after subtracting the token of the role name and the token of the role content for 0301 is now 5 tokens per message vs the 3 of -0613, and the (likely) 4 tokens of -301 before whatever alteration has been made. The final assistant prompt overhead (1 for the word and 2 to enclose) has grown to 7 tokens from the 3 tokens of -0613 and 0301. It is hard to contemplate what one change could cause both of these token increase behaviors.
Thread moved to API bugs category.
I just wanted to provide another concrete piece of evidence that something changed at around 2023-10-13T21:34:00Z with respect to the 0301 model. I took one of our prompts that uses the 0301 model and measured the average length of the response. The graph below shows the response length over the last couple of days, with the first dashed line at 2023-10-13T21:34:00Z when we had the first instance of the token mismatch warning, and the second dashed line when we moved all our 0301 requests onto Azure. I hope it’s easy to appreciate that in the period corresponding to when the presumed “altered” version of the 0301 model was live the response statistics were different.
We are also facing the same problem as elliott says. Turbo-0301 can’t follow our instructions like it used to. Its output has also become very different from before (for the same input). I believe that there are some changes for turbo-0301.
We had a similar problem. It seems that the 0301 model now refuses to answer many questions. We would like to keep the original model.
Yup, can pretty much be described as “0301 now in do-nothing refusal mode”. OpenAI has broken the contract with developers to leave this still-working model alone. Then the deception that there was ever any “0613” checkpoint model when the real name should be “continued alpha test on users to see what else we can break”. Find the working version of the turbo-0613 this from two months ago and make it available again.
We also experience issues with GPT-turbo-0301. We have an evaluation suite that ran fine on October 13, but now doesn’t. The model doesn’t follow the instructions as before - and generally gives much more verbose answers; behavior we earlier had seen with the newer models.
Did anything change to the 0301 endpoint, are we actually being served with the 0301 model?