This may be a stupid question: Why have snapshots on these models? Why not just upgrade gpt-4o-mini-transcribe and gpt-4o-mini-tts and gpt-realtime-mini to newer versions without snapshot naming conventions?
So now you have:
gpt-4o-mini-transcribe-2025-12-15 and gpt-4o-mini-transcribe
gpt-4o-mini-tts-2025-12-15 and gpt-4o-mini-tts
gpt-realtime-mini-2025-12-15 and gpt-realtime-mini
We would love to roll-out new versions of gpt-4o-mini-transcribe and gpt-4o-mini-tts to our customers, but not with snapshot naming conventions. Why? Because it seems so temporary.
I may be missing some context here, so my first question is whether this strategy of always upgrading to the latest snapshot, for example by using a generic slug like gpt-4o-mini-tts, is the same approach you would prefer for other model types as well?
Regardless of benchmark improvements, we typically evaluate new models against real production use cases before rolling them out to users and customers?
I’m most likely missing something. Feel free to point me in the right direction.
After a model has been rolled out, vetted and become “recommended”, the convenience alias may be pointed to that newer model. Or may never, as in the case of gpt-4o never being pointed to gpt-4o-2024-11-20 (newer than gpt-4o-2024-08-06).
If you want stability in applications, then you will use the versioned model when offered. Then you will test the new model like this release yourself, make adjustments needed to the prompting and parameters, and move to the latest version when you choose.
Do not let OpenAI decide when to switch on you and switch the behavior of your application by employing the alias. Got it?
I’m always happy to see new audio models, particularly TTS.
Apparently the new snapshot is having more trouble following instructions in comparison to the previous one (like “speak slower” or “[laughs]”), so it seems this might require more extensive tests to see what pros and cons it brings.
There is a huge disconnect between what OpenAI claimed and how this new smaller model is performing. 50% times only it follows instructions/call tools.
We have recently upgraded the tts / speech models in our app (Onsen - AI for Mental Health) to the new voice models.
We’ve received some complaints from users that the new speech model is not as emotive as the previous snapshot from March. I’ve done some testing and indeed the new speech model seems to completely ignore the “instructions” parameter which we use to provide a customer “voice personality” for our AI guides.
Can you comment if this is a bug, a regression or an intentional reduction of feature for the new speech model? It will be great if this is documented as I could not find any information in the official docs or the official blog post.
To me, this looks sounds like either a case where prompt tuning is needed because your users are suddenly experiencing a loss of emotion, or a situation where users have grown accustomed to the previous voice or model and simply do not like the new version.
I hope this helps!
Feel free to start a new topic in the Prompting category, mainly for learning purposes rather than to share any secrets.
Hi @vb, could you share the code or API request you used to generate these examples?
I am not able to replicate this on my end. I’ve recorded a short video where I contrast the older and new models using the OpenAI playground and it is very evident that the new model does not seem to follow the instructions at all.
I thought I might be using the API wrong, but the issue seems to affect the OpenAI playground too based on the video.
Hope you can take a look and let me know your thoughts.
The 2025-03 snapshot is pretty good at following instructions, but there were several complains about quality and unstable results, making different outputs sound like totally different voices. This was particularly ok to me, but it seemed to bother a lot of people.
The 2025-12 snapshot seems to improve voice quality and stability, under the cost of losing a lot of steerability from instructions. Also, the fact that the new model was not made the default when calling the alias gpt-4o-mini-tts reinforces that openai knows it, and took a cautious step to prevent apps from breaking, which is appreciated.
So, each version has its own pros and cons. Considering audio models haven’t been the highlight of all the AI hype at the end of 2025, I still consider a win that they actually released something.
Worst case scenario, we didn’t lose anything, the old model is still there and the new one does help people who just want a model with a higher quality and default settings.
In due time I hope they will manage to put things together in the next release, so that we can have both quality and good instruction following.
We ended rolling out a new version in the app that added support for both text-to-speech models.
For now we use the March snapshot for “Expressive” voices and the December snapshot for “Standard” voices - and we give the choice to the user to decide. See below:
I’ve reached out to the team to see if they can share any tips or best practices for getting gpt-4o-mini-tts-2025-12-15 to follow instructions more reliably and consistently.
Guiding the new gpt-4o-mini-tts-2025-12-15 snapshot behaves differently from the previous gpt-4o-mini-tts-2025-03-20 version.
Goal
Control the style and tone of text-to-speech output, for example a whispering voice.
Challenge
With gpt-4o-mini-tts-2025-03-20, a simple prompt like:
You are always whispering
worked reliably in most cases. With the new snapshot, the same instruction is followed far less consistently, closer to three out of ten attempts. To benefit from the improved, lower word error rate of the new snapshot, the prompting approach needs to change.
Solution
The team shared the Realtime Prompting Guide from cookbook.openai.com. The key takeaway is that the model needs to be guided similarly to realtime models when enforcing style and tone constraints. Here is an example prompt as baseline guidance, and this optimizer prompt can be used to remove ambiguity from the wording.
My experience
I struggle with this and can get proper instruction following around 50% of the time. For now I second @Dobo 's approach to use the older snapshot when precise style and tone control are needed. Maybe we should create a topic in the prompting category to learn where others can take this model with their prompting skills.
The new model ignores TTS instructions and sounds very plain as a result. I need it to be expressive like it used to in gpt-4o-mini-tts-2025-03-20. I will keep using the old model until the new one is fixed.
Just be careful. The latest version does not always mean the greatest. Test the latest (gpt-4o-mini-tts) before deploying.
It also appears that gpt-4o-mini-tts-2025-12-15 is being actively modified. The speech generated today on 2/10 is noticeably worse than output from 1/13, even though I used the exact same snapshot. The audio sounds much darker, almost as if an aggressive low-pass filter was applied, resulting in reduced clarity and degraded vocal tone.
For example, Shimmer and Nova now sound very similar, whereas they were clearly distinct in earlier versions.
I reverted to gpt-4o-mini-tts-2025-03-20.
gpt-4o-mini-tts-2025-12-15 is so awful compared to previous gpt-4o-mini-tts-2025-03-20
some of the most robotic and monotone TTS I have ever heard, the gpt-4o-mini-tts-2025-03-20 was actually great
All of the voices are completely changed and lost most of the tone and emotion
Just sounds awful, does not sound natural at all, the previous version did
Talking about using it over API with using instructions to set the tone and many other things
(text model analyzes the message to set the instructions for Accent, Emotional range, Intonation, Impressions, Speed of speech, Tone, Style, Whispering and thats sent to the TTS)
maybe its indeed just ignoring instructions, didnt think of that…but the voice just sounds completely different either way, and its not better (it might sound clearer, but it just lost its color and “naturalness”)
cant get myself to use it, it will be a sad day when its retired and this is not fixed/improved