Has anyone noticed that "gpt-3.5-turbo-16k" is behaving poorly in the last week?

Has anyone noticed that “gpt-3.5-turbo-16k” is behaving poorly in the last week?

Been experimenting with some previous work using “gpt-3.5-turbo-16k” and the output has been behaving strangely and oftentimes regurgitating parts of the prompt itself.

However, when I switch to the legacy gpt-3.5 model, “gpt-3.5-turbo-16k-0613”, the problem seems to away completely.

Was there an update in “gpt-3.5-turbo-16k” that has changed something?

1 Like

Welcome to the Forum!

I’ve swapped my 4k 3.5 over to 16k the past few weeks and I’ve not noticed, could be a use case specific issue I guess. you got any examples you are willing to share?

The problem only seems to happen when the prompt is larger. We’ve used tiktoken package to count the number of tokens in our prompt and we’re still well within the 16k token count.

The same larger prompt does not have issues when running legacy model "gpt-3.5-turbo-16k-0613”.

have you done any prompt dumps? i.e. logged the prompt text and gone over a random selection of them to make sure they are 100% built up as expected? Sometimes, especially with vector retrievals, you can have low quality stuff as context due to allowing more Top_P entries in the larger context and it throws the performance off. Not saying that’s the case here but it’s a possibility.

Log everything from prompt to reply in raw format and dump it out, might find some oddities.

Thanks for the suggestion @Foxalabs.

Yea, we double checked the logs. Prompt is exactly the same and the results are massively different: “gpt-3.5-turbo-16k” is returning a result that is regurgitating part of the prompt template, while “gpt-3.5-turbo-16k-0613” legacy model is fine.

We started observing this yesterday.

1 Like

There is no legacy version of gpt-3.5-turbo-16k. Are you sure those are the models you’re switching between?

gpt-3.5-turbo-16k should be returning results from gpt-3.5-turbo-16k-0613 since it is the latest version. You can verify this in the response in the model property.

Yea, I’m using the python API:

response_str = openai.ChatCompletion.create(
            model=model, 
            messages=[{"role": "user", "content": prompt}]
        ).choices[0].message.content

where model is gpt-3.5-turbo-16k or gpt-3.5-turbo-16k-0613.

*shrug

That is mighty strange, It totally could be some bug on the backend, but… it does add some credence to it maybe being some obscure code issue on your end. Now, I know it’s easy to to say that, and it absolutely could not be, but it sure seems odd with them both being 0613… you got any code segments you can post? I don’t mind signing an NDA if this is corporate stuff.

Yea, it is for corporate work. I’ll have to communicate with the higher-ups to see if releasing our code is ok.

The issue has been repeated 10 times in a row now and the difference is pretty stark :frowning: . I’ll update this thread as I collect even more data.

2 Likes

It’s been behaving badly for some time now.

I am told that it’s the prompt, not the model. But gpt-4 seems to answer every question gpt-3.5-turbo-16k can’t. Consistently. With the same context. So, I don’t know.

Please could you help me understand what “poor” is

For my money, this is behaving “poorly”: Gpt-3.5-turbo-16k api not reading context documents