Last week, I ran gpt-4-turbo on my task (>1000 inferences) and it worked great. Today, I am trying the exact same prompts and it’s performing much worse. For example, it’s now consistently failing to write output in the expected format even though I included 5 exemplars for the expected format.
Has something changed with the gpt-4-turbo api recently? FWIW I am using gpt-4-turbo instead of gpt-4o because regular 4 worked better for my task when I compared the output last week.
We are experiencing the same situation. Migrated from GPT4-turbo to Omni and then back after a few days. We’re using Chat API to generate JSON, and the number of errors in JSON formatting has dramatically increased (not to mention issues with the content itself). This change is evident in our logs; something has definitely worsened.
I have found that of all formats the markdown table format is most consistent across new models. Hence if possible i would suggest try adopting your code to the markdown table format.
It’s a simple instruction. Just needs to format the response in ‘finalAnswer’ tags, but now I’m getting inconsistent tags like ‘finalFoo’, ‘finalBar’, etc. Wasn’t a problem last week when I ran 1000s of inferences.