Has anyone noticed that “gpt-3.5-turbo-16k” is behaving poorly in the last week?
Been experimenting with some previous work using “gpt-3.5-turbo-16k” and the output has been behaving strangely and oftentimes regurgitating parts of the prompt itself.
However, when I switch to the legacy gpt-3.5 model, “gpt-3.5-turbo-16k-0613”, the problem seems to away completely.
Was there an update in “gpt-3.5-turbo-16k” that has changed something?
I’ve swapped my 4k 3.5 over to 16k the past few weeks and I’ve not noticed, could be a use case specific issue I guess. you got any examples you are willing to share?
The problem only seems to happen when the prompt is larger. We’ve used tiktoken package to count the number of tokens in our prompt and we’re still well within the 16k token count.
The same larger prompt does not have issues when running legacy model "gpt-3.5-turbo-16k-0613”.
have you done any prompt dumps? i.e. logged the prompt text and gone over a random selection of them to make sure they are 100% built up as expected? Sometimes, especially with vector retrievals, you can have low quality stuff as context due to allowing more Top_P entries in the larger context and it throws the performance off. Not saying that’s the case here but it’s a possibility.
Log everything from prompt to reply in raw format and dump it out, might find some oddities.
Yea, we double checked the logs. Prompt is exactly the same and the results are massively different: “gpt-3.5-turbo-16k” is returning a result that is regurgitating part of the prompt template, while “gpt-3.5-turbo-16k-0613” legacy model is fine.
There is no legacy version of gpt-3.5-turbo-16k. Are you sure those are the models you’re switching between?
gpt-3.5-turbo-16k should be returning results from gpt-3.5-turbo-16k-0613 since it is the latest version. You can verify this in the response in the model property.
That is mighty strange, It totally could be some bug on the backend, but… it does add some credence to it maybe being some obscure code issue on your end. Now, I know it’s easy to to say that, and it absolutely could not be, but it sure seems odd with them both being 0613… you got any code segments you can post? I don’t mind signing an NDA if this is corporate stuff.
I am told that it’s the prompt, not the model. But gpt-4 seems to answer every question gpt-3.5-turbo-16k can’t. Consistently. With the same context. So, I don’t know.