Has anyone noticed that "gpt-3.5-turbo-16k" is behaving poorly in the last week?

freyhteviking · August 1, 2023, 8:43pm

Has anyone noticed that “gpt-3.5-turbo-16k” is behaving poorly in the last week?

Been experimenting with some previous work using “gpt-3.5-turbo-16k” and the output has been behaving strangely and oftentimes regurgitating parts of the prompt itself.

However, when I switch to the legacy gpt-3.5 model, “gpt-3.5-turbo-16k-0613”, the problem seems to away completely.

Was there an update in “gpt-3.5-turbo-16k” that has changed something?

Foxalabs · August 1, 2023, 8:45pm

Welcome to the Forum!

I’ve swapped my 4k 3.5 over to 16k the past few weeks and I’ve not noticed, could be a use case specific issue I guess. you got any examples you are willing to share?

freyhteviking · August 1, 2023, 8:49pm

The problem only seems to happen when the prompt is larger. We’ve used tiktoken package to count the number of tokens in our prompt and we’re still well within the 16k token count.

The same larger prompt does not have issues when running legacy model "gpt-3.5-turbo-16k-0613”.

Foxalabs · August 1, 2023, 8:51pm

have you done any prompt dumps? i.e. logged the prompt text and gone over a random selection of them to make sure they are 100% built up as expected? Sometimes, especially with vector retrievals, you can have low quality stuff as context due to allowing more Top_P entries in the larger context and it throws the performance off. Not saying that’s the case here but it’s a possibility.

Log everything from prompt to reply in raw format and dump it out, might find some oddities.

freyhteviking · August 1, 2023, 9:07pm

Thanks for the suggestion @Foxalabs.

Yea, we double checked the logs. Prompt is exactly the same and the results are massively different: “gpt-3.5-turbo-16k” is returning a result that is regurgitating part of the prompt template, while “gpt-3.5-turbo-16k-0613” legacy model is fine.

We started observing this yesterday.

novaphil · August 1, 2023, 9:13pm

There is no legacy version of gpt-3.5-turbo-16k. Are you sure those are the models you’re switching between?

gpt-3.5-turbo-16k should be returning results from gpt-3.5-turbo-16k-0613 since it is the latest version. You can verify this in the response in the model property.

freyhteviking · August 1, 2023, 9:19pm

Yea, I’m using the python API:

response_str = openai.ChatCompletion.create(
            model=model, 
            messages=[{"role": "user", "content": prompt}]
        ).choices[0].message.content

where model is gpt-3.5-turbo-16k or gpt-3.5-turbo-16k-0613.

*shrug

Foxalabs · August 1, 2023, 9:59pm

That is mighty strange, It totally could be some bug on the backend, but… it does add some credence to it maybe being some obscure code issue on your end. Now, I know it’s easy to to say that, and it absolutely could not be, but it sure seems odd with them both being 0613… you got any code segments you can post? I don’t mind signing an NDA if this is corporate stuff.

freyhteviking · August 2, 2023, 3:14pm

Yea, it is for corporate work. I’ll have to communicate with the higher-ups to see if releasing our code is ok.

The issue has been repeated 10 times in a row now and the difference is pretty stark . I’ll update this thread as I collect even more data.

SomebodySysop · August 3, 2023, 7:02am

It’s been behaving badly for some time now.

I am told that it’s the prompt, not the model. But gpt-4 seems to answer every question gpt-3.5-turbo-16k can’t. Consistently. With the same context. So, I don’t know.

joseicarobc · August 6, 2023, 7:03am

Please could you help me understand what “poor” is

SomebodySysop · August 6, 2023, 7:39pm

For my money, this is behaving “poorly”: Gpt-3.5-turbo-16k api not reading context documents

Topic		Replies	Views
Gpt-3.5-turbo-0613 became useless API	9	2534	September 5, 2023
GPT3.5 Turbo downgraded suddenly? API	6	1630	November 14, 2023
Gpt-3.5-turbo-1106 - API refuses to generate meaningful response, same prompt in playground works fine API gpt-35-turbo	7	1875	November 22, 2023
GPT-4 becoming dumber sometimes, for a while API	7	2819	December 18, 2023
Another huge decline lately in API text completions quality API gpt-4 , api	3	724	March 31, 2024

Has anyone noticed that "gpt-3.5-turbo-16k" is behaving poorly in the last week?

Related topics