I had a prompt that was working very well for more than a month, but today the responses are no longer stable, mainly, I’m referring to the function tools I defined; the LLM is not using them correctly.
OpenAI writes, and continues to call model names like gpt-4.1-mini-2025-04-14 “snapshots”, denying that they alter the performance…
So here is a specific example - of examples over and over of applications breaking by OpenAI changing the models in production.
If the performance is seemingly a random lottery of not employing functions correctly, you can reduce the top-p parameter to below 0.5, which should give you function calling that is not a lottery, but based on the best prediction.
If your AI generation is already inside a function, and the wrong function is utilized or values are not being filled properly, about all you can do is then overwhelm the inattentive AI with unmistakable description fields for each property, and use enums in a schema for those values with a particular set of allowed options.
Are you using o3
? They made changes to it and there is suspicion that it performs worse.
As general guidance, I would make sure the model is given clear information on what to do and when to do it, and how tools should be used. The models are known to avoid calling functions if there are any issues or contradictions in the instructions.
The quote is indeed confusing. If model behavior isn’t changed, then why make any snapshots at all?
I think the suggestion that the underlying model is the same could be taken to mean that they are using LoRA fine-tuning to make their snapshots. It would also explain why new snapshots sometimes feel more overfitted and dumber than the one before it. That’s a side effect of SFT.
I feel like reducing top_p
and temperature
tends to induce lazier responses. I think I could justify doing this in something simpler like a classifier though.
No, im using the gpt-4.1-mini
The model behavior for a given snapshot doesn’t change, but it can change between snapshots. So that is why we are saying that if you want to make sure behavior stays the same, you should use specific snapshot names instead of aliases, where the underlying snapshot can change.
We don’t have a new snapshot for 4.1-mini so if the behavior changed it’s an unexpected issue. Could you please share example prompts you were using and example outputs illustrating the change in behavior?
Thank you!
FWIW I find 4.1 is particularly intolerant to quantisation wrt to function discrimination so consider using the full 4.1 model instead or go back to 4o-mini.