Gpt-3.5-turbo-0125 differences in performance on different days

Models such as gpt-3.5-turbo-0125 are said to be static but I have experienced differences in performance with the same prompt and context.

I usually check the consistency of the performance of the prompt trying it 4-5 times in a row. I did this text on different days and the performance was very different. It went from 3-4 errors out of 42 executions to 9-10 errors out of 42 executions. The only difference is the day.

I have seen other posts talking about this issue but are from other models. So some questions arise:

  • Is it confirmed that the static versiones aren’t really static?
  • Could the problem be that the date is actually inserted in the API call somehow affecting the performance?
  • Does anyone know what else could be the issue?

Thanks for taking the time to read and answer the issue.

Carlos

Hi there!

In practice, model performance is not exactly static. Some volatility in performance including transient degradations are not unusual and should not necessarily be interpreted as more systemic issues. In practice, you can implement controls to test for performance degradation and then respond with mitigation actions, e.g. temporary switch to an alternative model.

There is in my mind no reason to believe that performance variations are linked to any specific day. In any case, only you control what is being sent as part of the API call and as such there is no automatic insertion of the date into the API call.

Just my two cents.

Hi!

Thanks for answering.

I do not expect the performance to be static (or deterministic) since the arquitecture of these models cannot fulfill that objective, but I do expect the model to be static.
If the model is exactly the same, with the same weights and everything, why should it suffer from these degradations? There is always going to be some variance in the responses, but if the model is static, as OpenAI claims, the variance should be mantained across time, no?

I might be missing some variable in my analysis that you (or others) could help me with.