The possible decline in format response and reasoning capabilities of gpt-3.5 series APIs

GPT-3.5 series APIs include GPT-3.5-turbo, GPT-3.5-turbo-1106, and GPT-3.5-turbo-0125. Their ability to respond to format requirements and reason appears to be reduced. When I conducted experiments under the same benchmark, I could achieve an accuracy of 0.87 on April 10th and before, but today, April 12th, it was only about 0.82. Is this an adjustment made by OPENAI?

1 Like

The system_fingerprint in an API return is supposed to be updated when OpenAI makes changes to the newer models which might affect determinism.

Have you recorded that from prior inputs?

Stealth modification of AI models is a modus operandi.

OpenAI should find and put back gpt-3.5-turbo-0613 exactly as it was in June – before they did multiple rounds of training and platform modification that broke system instruction following again and again. A “pro developer” edition that is not a “sorry, you only get ChatGPT now” AI.

I am a developer at the beginning stage. I just wrote a chain using langchain, and then used langchain’s output parser as the format instruction. I have no idea about system_fingerprint and have never documented it. I just intuitively feel the performance degradation of these APIs. I hope that OPENAI can maintain the stability of a model’s performance, because reproducible results are needed in many places.

I found claude-3-haidu to be better than gpt-3.5-turbo-0125 in format responsiveness and inference capabilities, so I will switch to using claude-3 for now

I’m seeing a dramatic degradation of reasoning ability too. It seems to have started around about the time of the Omni launch. This was working last week.