have you done any prompt dumps? i.e. logged the prompt text and gone over a random selection of them to make sure they are 100% built up as expected? Sometimes, especially with vector retrievals, you can have low quality stuff as context due to allowing more Top_P entries in the larger context and it throws the performance off. Not saying that’s the case here but it’s a possibility.
Log everything from prompt to reply in raw format and dump it out, might find some oddities.
Yea, we double checked the logs. Prompt is exactly the same and the results are massively different: “gpt-3.5-turbo-16k” is returning a result that is regurgitating part of the prompt template, while “gpt-3.5-turbo-16k-0613” legacy model is fine.
That is mighty strange, It totally could be some bug on the backend, but… it does add some credence to it maybe being some obscure code issue on your end. Now, I know it’s easy to to say that, and it absolutely could not be, but it sure seems odd with them both being 0613… you got any code segments you can post? I don’t mind signing an NDA if this is corporate stuff.