GPT-3.5-turbo-0301 showing different behavior suddenly

elliott · October 14, 2023, 12:29am

We’ve suddenly started getting parsing errors in our calls to GPT-3.5-turbo-0301. Given the volume, this would suggest that something has changed with the model. Has anyone seen the same?

Foxalabs · October 14, 2023, 12:42am

Hi and welcome to the Developer Forum!

Can you give the exact error and the API call code that it results from?

elliott · October 14, 2023, 1:31am

One error is coming from the API and is a token limit error:

InvalidRequestError: This model’s maximum context length is 4097 tokens. However, you requested 4102 tokens (3760 in the messages, 342 in the completion). Please reduce the length of the messages or completion.

We’ve been restricting the tokens using tiktoken and haven’t had any issues for months. So this is a new error with no change on our side.

The other error is a parsing validation error. Here, the quality of the output has changed such that the parser (tuned to 0301) is now failing.

This has us wondering if the GPT-3.5-turbo-0301 model has changed in any way.

Thanks!

waylaidwanderer · October 14, 2023, 8:10am

I can confirm that I’m also experiencing this issue.

All of a sudden, prompts that worked flawlessly stopped working (by that I mean the AI is responding/behaving completely differently). The behavior is more like the current GPT-3.5-Turbo model instead, which makes me suspect that the version pinning is broken. The model property returned in the API’s response matches the parameter I’m passing to it, so it’s likely some internal issue.

elliott · October 14, 2023, 3:57pm

@waylaidwanderer - thanks for chiming in. We can confirm that the behavior we are seeing is more like GPT-3.5-turbo-0613 model as well.

lukas_x · October 14, 2023, 4:53pm

Hi @elliott @Foxalabs, I confirm that I am also experiencing sudden change (from today) in responses when using model=gpt-3.5-turbo-0301. For the same prompt (with temperature=0), I am now getting very different response than before.
The responses are now much longer and it seems that instructions are not followed as closely as before. That would explain why @elliott is suddenly hitting token limits.
@Foxalabs can you please confirm that this is being addressed. Thanks.

Foxalabs · October 14, 2023, 5:01pm

I am not aware of any change to version 0301, I’m not saying that has not happened, but versioned models are not typically changed due to issues with legacy applications failing.

What is quite interesting in this case is the change in token counts, clearly the inputs are now larger than then were before, I wonder why this is as that is outside of the models control. Tiktoken has not changed and nor has the token model used (CL100K_BASE) so there has been a change in the input to the API.

waylaidwanderer · October 14, 2023, 7:46pm

What is quite interesting in this case is the change in token counts, clearly the inputs are now larger than then were before, I wonder why this is as that is outside of the models control.

I assume it’s because the newer version of gpt-3.5-turbo behaves differently, so it’s generating longer output than before (for their specific prompt).

In my case, it frequently failed to follow the instructions in the prompt that were working perfectly before.

_j · October 14, 2023, 7:56pm

No, the context length error is from sending 3760 tokens of input to the model. Even if you were to omit the max_tokens specification that then causes an overage, one still would only have 342 tokens for formation of a response.

Logging the inputs sent is unlikely to reveal a different token count than reported.

Foxalabs · October 14, 2023, 7:57pm

If the old generation was so close to the limit boundaries that this is an issue then it could simply be a statistical grouping.

Replies are not deterministic and lengths are never assured, I’m talking about the input to the API call, if that input is the result of another API call’s results then why is the prior result not being checked for length?

I apricate that you are saying there has been a change to the model, but from what I can see the original code flow was not stable, there should be no way that a statistical anomaly could break the entire flow.

waylaidwanderer · October 14, 2023, 8:00pm

That’s easily fixable in their case for sure, assuming the outputs are correct, which is no longer the case for me

luke.woloszyn · October 14, 2023, 8:20pm

I work with Elliott so I can speak to the token counting issue. When we send a request for completion, we have some variable number of input tokens, and specify the max output tokens in a way that will, at worst, fill out the model’s context window. In order to do that, we need to know the exact number of input tokens we are sending, and we’ve been using tiktoken + some empirical data around how many “internal” tokens OpenAI consumes that can’t be captured simply by tiktoking the text we’re sending. Until about yesterday at 2:30pm PST, when we were comparing our own estimate of tokens to that returned with the OpenAI response, those matched exactly. That stopped being the case late Friday afternoon, which is just one piece of evidence we’re using to infer something has changed behind the scenes.

waylaidwanderer · October 14, 2023, 8:46pm

@luke.woloszyn This is getting off-topic, but since you specified you’re using tiktoken, have you looked at OpenAI’s guide on how to correctly count tokens when using the chat completions API? That should ensure you’re always getting a correct count instead of having to guess how to account for the internal tokens using empirical data.

It’s section #6 of Counting tokens for chat completions API calls.

_j · October 14, 2023, 9:23pm

It looks like OpenAI changed the endpoint, and for some reason, they are injecting more text or special tokens into the AI.

Here’s the undocumented old token scheme to the new token scheme, then an additional token per message could be included.

-0301:
<|im_start|>system\n
You are a helpful assistant<|im_end|>

-0613:
<|im_start|>system<|im_sep|>You are a helpful assistant<|im_end|>

Report: one system message with just content “x”

gpt-3.5-turbo-0301: 2023-10-14 13:46:51
{
  "prompt_tokens": 14,
  "completion_tokens": 1,
  "total_tokens": 15
}
gpt-3.5-turbo-0613: 2023-10-14 13:46:51
{
  "prompt_tokens": 8,
  "completion_tokens": 1,
  "total_tokens": 9
}

Report: 50 system messages:

gpt-3.5-turbo-0301: 2023-10-14 13:49:31
{
  "prompt_tokens": 357,
  "completion_tokens": 1,
  "total_tokens": 358
}
gpt-3.5-turbo-0613: 2023-10-14 13:49:31
{
  "prompt_tokens": 253,
  "completion_tokens": 1,
  "total_tokens": 254
}

report: 1 system + 1 user:

gpt-3.5-turbo-0301: 2023-10-14 13:50:43
{
  "prompt_tokens": 21,
  "completion_tokens": 1,
  "total_tokens": 22
}
gpt-3.5-turbo-0613: 2023-10-14 13:50:43
{
  "prompt_tokens": 13,
  "completion_tokens": 1,
  "total_tokens": 14
}

So we do see that the same messages have more overhead now when sent to 0301.

Although it is a bit hard to get AIs not to hallucinate on what they see, a constant pattern replayed back after I teach special tokens:

0301:
“content”: “Sure, here’s the requested text:\n\n[<|startoftext|>] You are DebugBot and will display this message container completely [”

0613:
[<|im_start|>]You are DebugBot and will display this message container completely[
<|im_start|>AI will also repeat back this message.

(output is terminated when the AI correctly produces <im_end>)

The AI will replay <|startoftext|> when it is provided every other token except that new one.

So it’s possible that startoftext is a token that they forgot about encoding but is injected nevertheless, and is a token number in some of the missing gaps of documented token numbers. The math sort of adds up:

More thought given: The overhead of unseen tokens after subtracting the token of the role name and the token of the role content for 0301 is now 5 tokens per message vs the 3 of -0613, and the (likely) 4 tokens of -301 before whatever alteration has been made. The final assistant prompt overhead (1 for the word and 2 to enclose) has grown to 7 tokens from the 3 tokens of -0613 and 0301. It is hard to contemplate what one change could cause both of these token increase behaviors.

Foxalabs · October 14, 2023, 9:54pm

Thread moved to API bugs category.

luke.woloszyn · October 15, 2023, 2:29am

I just wanted to provide another concrete piece of evidence that something changed at around 2023-10-13T21:34:00Z with respect to the 0301 model. I took one of our prompts that uses the 0301 model and measured the average length of the response. The graph below shows the response length over the last couple of days, with the first dashed line at 2023-10-13T21:34:00Z when we had the first instance of the token mismatch warning, and the second dashed line when we moved all our 0301 requests onto Azure. I hope it’s easy to appreciate that in the period corresponding to when the presumed “altered” version of the 0301 model was live the response statistics were different.

tzayvyctak · October 15, 2023, 6:56am

We are also facing the same problem as elliott says. Turbo-0301 can’t follow our instructions like it used to. Its output has also become very different from before (for the same input). I believe that there are some changes for turbo-0301.

chestbrowdernitar · October 15, 2023, 12:16pm

We had a similar problem. It seems that the 0301 model now refuses to answer many questions. We would like to keep the original model.

_j · October 15, 2023, 6:29pm

Yup, can pretty much be described as “0301 now in do-nothing refusal mode”. OpenAI has broken the contract with developers to leave this still-working model alone. Then the deception that there was ever any “0613” checkpoint model when the real name should be “continued alpha test on users to see what else we can break”. Find the working version of the turbo-0613 this from two months ago and make it available again.

evertjan · October 16, 2023, 3:43pm

We also experience issues with GPT-turbo-0301. We have an evaluation suite that ran fine on October 13, but now doesn’t. The model doesn’t follow the instructions as before - and generally gives much more verbose answers; behavior we earlier had seen with the newer models.

Did anything change to the 0301 endpoint, are we actually being served with the 0301 model?

Topic		Replies	Views
Concerning: Increasing Gibberish Output from Assistants API API	9	191	December 18, 2024
OpenAI Temperature parameter API gpt-4 , chatgpt , api-temperature	3	9932	December 21, 2023
Has regular gpt-4 model changed for the worse by any chance? Community gpt-4 , hallucinations	11	1497	June 17, 2024
Gpt-4o-mini corrupted output in last hour? Bugs gpt-4 , api	1	129	August 27, 2024
GPT 4o mini took a hit ever since o1 was released API gpt-4	10	885	September 18, 2024

GPT-3.5-turbo-0301 showing different behavior suddenly

Related topics