New Audio Model Snapshots in the Realtime-API

vb · December 15, 2025, 11:30pm

New audio model snapshots are now live in the Realtime API with improvements to reliability, lower error rates, and fewer hallucinations:

gpt-4o-mini-transcribe-2025-12-15: 89% reduction in hallucinations compared to whisper-1
gpt-4o-mini-tts-2025-12-15: 35% fewer word errors as measured by Common Voice
gpt-realtime-mini-2025-12-15: 22% improvement in instruction following and 13% improvement in function calling

The new text-to-speech and speech-to-text snapshots are stronger in languages like Chinese, Japanese, Indonesian, Hindi, Bengali, and Italian.

Try the new models today:

https://platform.openai.com/audio/realtime

jeffvpace · December 16, 2025, 12:16am

This may be a stupid question: Why have snapshots on these models? Why not just upgrade gpt-4o-mini-transcribe and gpt-4o-mini-tts and gpt-realtime-mini to newer versions without snapshot naming conventions?

So now you have:

gpt-4o-mini-transcribe-2025-12-15 and gpt-4o-mini-transcribe
gpt-4o-mini-tts-2025-12-15 and gpt-4o-mini-tts
gpt-realtime-mini-2025-12-15 and gpt-realtime-mini

We would love to roll-out new versions of gpt-4o-mini-transcribe and gpt-4o-mini-tts to our customers, but not with snapshot naming conventions. Why? Because it seems so temporary.

vb · December 16, 2025, 1:18am

I may be missing some context here, so my first question is whether this strategy of always upgrading to the latest snapshot, for example by using a generic slug like gpt-4o-mini-tts, is the same approach you would prefer for other model types as well?

Regardless of benchmark improvements, we typically evaluate new models against real production use cases before rolling them out to users and customers?

I’m most likely missing something. Feel free to point me in the right direction.

_j · December 16, 2025, 1:50am

The version with a date in the name IS the model (and it cannot really be called a snapshot since OpenAI will continue to make small changes anyway).

The version without a date is a pointer to the currently recommended model, an alias

gpt-4o-mini-realtime-preview-2024-12-17 ← gpt-4o-mini-realtime-preview
gpt-realtime-mini-2025-10-06 ← gpt-realtime-mini
gpt-realtime-mini-2025-12-15

After a model has been rolled out, vetted and become “recommended”, the convenience alias may be pointed to that newer model. Or may never, as in the case of gpt-4o never being pointed to gpt-4o-2024-11-20 (newer than gpt-4o-2024-08-06).

If you want stability in applications, then you will use the versioned model when offered. Then you will test the new model like this release yourself, make adjustments needed to the prompting and parameters, and move to the latest version when you choose.

Do not let OpenAI decide when to switch on you and switch the behavior of your application by employing the alias. Got it?

jeffvpace · December 16, 2025, 3:43am

@vb @_j

The version without a date is a pointer to the currently recommended model, an alias

Ah, yes. I completely forgot about this - makes total sense now. Sorry for my confusion.

aprendendo.next · December 16, 2025, 7:59am

I’m always happy to see new audio models, particularly TTS.

Apparently the new snapshot is having more trouble following instructions in comparison to the previous one (like “speak slower” or “[laughs]”), so it seems this might require more extensive tests to see what pros and cons it brings.

chinmay1 · December 18, 2025, 8:53pm

There is a huge disconnect between what OpenAI claimed and how this new smaller model is performing. 50% times only it follows instructions/call tools.

Dobo · January 8, 2026, 2:04pm

Hi @vb ,

We have recently upgraded the tts / speech models in our app (Onsen - AI for Mental Health) to the new voice models.

We’ve received some complaints from users that the new speech model is not as emotive as the previous snapshot from March. I’ve done some testing and indeed the new speech model seems to completely ignore the “instructions” parameter which we use to provide a customer “voice personality” for our AI guides.

Can you comment if this is a bug, a regression or an intentional reduction of feature for the new speech model? It will be great if this is documented as I could not find any information in the official docs or the official blog post.

Regards,

Dobo

vb · January 8, 2026, 4:47pm

Hi @Dobo !

Here are a few examples generated with gpt-4o-mini-tts-2025-12-15.
User prompt:

I read the sign. It said, “Do not push the red button.”
I pushed the red button.
Everything exploded.
Then there was silence.

Instructions:
Read in neutral tone → Was a bit dramatic, actually
Read in a dramatic voice → Was a bit more dramatic but instruction following confirmed
Read in a quiet and neutral voice → Was quiet and neutral

To me, this ~~looks~~ sounds like either a case where prompt tuning is needed because your users are suddenly experiencing a loss of emotion, or a situation where users have grown accustomed to the previous voice or model and simply do not like the new version.

I hope this helps!
Feel free to start a new topic in the Prompting category, mainly for learning purposes rather than to share any secrets.

Dobo · January 8, 2026, 8:51pm

Hi @vb, could you share the code or API request you used to generate these examples?

I am not able to replicate this on my end. I’ve recorded a short video where I contrast the older and new models using the OpenAI playground and it is very evident that the new model does not seem to follow the instructions at all.

I thought I might be using the API wrong, but the issue seems to affect the OpenAI playground too based on the video.

Hope you can take a look and let me know your thoughts.

aprendendo.next · January 8, 2026, 10:24pm

my 2 cents:

The 2025-03 snapshot is pretty good at following instructions, but there were several complains about quality and unstable results, making different outputs sound like totally different voices. This was particularly ok to me, but it seemed to bother a lot of people.
The 2025-12 snapshot seems to improve voice quality and stability, under the cost of losing a lot of steerability from instructions. Also, the fact that the new model was not made the default when calling the alias gpt-4o-mini-tts reinforces that openai knows it, and took a cautious step to prevent apps from breaking, which is appreciated.

So, each version has its own pros and cons. Considering audio models haven’t been the highlight of all the AI hype at the end of 2025, I still consider a win that they actually released something.

Worst case scenario, we didn’t lose anything, the old model is still there and the new one does help people who just want a model with a higher quality and default settings.

In due time I hope they will manage to put things together in the next release, so that we can have both quality and good instruction following.

Muzammil_Sheikh · January 10, 2026, 2:50am

There us a huge discount between that open AI claimecd and how this new smaller model is performing . 55 % time onlyit follows instructions/call tool

Dobo · January 12, 2026, 2:50pm

We ended rolling out a new version in the app that added support for both text-to-speech models.

For now we use the March snapshot for “Expressive” voices and the December snapshot for “Standard” voices - and we give the choice to the user to decide. See below:

I do really like the quality of the new December voices, it’s a shame that emotive instructions stopped working. Hope this gets resolved soon.

vb · January 12, 2026, 8:19pm

I’ve reached out to the team to see if they can share any tips or best practices for getting gpt-4o-mini-tts-2025-12-15 to follow instructions more reliably and consistently.

Fingers crossed.

vb · January 16, 2026, 5:51am

Guiding the new gpt-4o-mini-tts-2025-12-15 snapshot behaves differently from the previous gpt-4o-mini-tts-2025-03-20 version.

Goal
Control the style and tone of text-to-speech output, for example a whispering voice.

Challenge
With gpt-4o-mini-tts-2025-03-20, a simple prompt like:

You are always whispering

worked reliably in most cases. With the new snapshot, the same instruction is followed far less consistently, closer to three out of ten attempts. To benefit from the improved, lower word error rate of the new snapshot, the prompting approach needs to change.

Solution
The team shared the Realtime Prompting Guide from cookbook.openai.com. The key takeaway is that the model needs to be guided similarly to realtime models when enforcing style and tone constraints. Here is an example prompt as baseline guidance, and this optimizer prompt can be used to remove ambiguity from the wording.

My experience
I struggle with this and can get proper instruction following around 50% of the time. For now I second @Dobo 's approach to use the older snapshot when precise style and tone control are needed. Maybe we should create a topic in the prompting category to learn where others can take this model with their prompting skills.

@aprendendo.next: FYI: The default snapshot for this model has been updated to the newer version. In case others are wondering why their output did suddenly change.

jeffvpace · January 16, 2026, 2:00pm

We will modify our UI to the above. The question is how stable will this approach be: a question whch has no answer?

I never use snapshots for any API because I always assume that the alias points to the latest and greatest. But this is different.

Just preaching to the choir…

ltnew007 · January 16, 2026, 9:38pm

The new model ignores TTS instructions and sounds very plain as a result. I need it to be expressive like it used to in gpt-4o-mini-tts-2025-03-20. I will keep using the old model until the new one is fixed.

tiffiana · February 10, 2026, 4:44pm

Just be careful. The latest version does not always mean the greatest. Test the latest (gpt-4o-mini-tts) before deploying.
It also appears that gpt-4o-mini-tts-2025-12-15 is being actively modified. The speech generated today on 2/10 is noticeably worse than output from 1/13, even though I used the exact same snapshot. The audio sounds much darker, almost as if an aggressive low-pass filter was applied, resulting in reduced clarity and degraded vocal tone.
For example, Shimmer and Nova now sound very similar, whereas they were clearly distinct in earlier versions.
I reverted to gpt-4o-mini-tts-2025-03-20.

Neoony · February 12, 2026, 5:07pm

gpt-4o-mini-tts-2025-12-15 is so awful compared to previous gpt-4o-mini-tts-2025-03-20

some of the most robotic and monotone TTS I have ever heard, the gpt-4o-mini-tts-2025-03-20 was actually great
All of the voices are completely changed and lost most of the tone and emotion
Just sounds awful, does not sound natural at all, the previous version did

Talking about using it over API with using instructions to set the tone and many other things
(text model analyzes the message to set the instructions for Accent, Emotional range, Intonation, Impressions, Speed of speech, Tone, Style, Whispering and thats sent to the TTS)

maybe its indeed just ignoring instructions, didnt think of that…but the voice just sounds completely different either way, and its not better (it might sound clearer, but it just lost its color and “naturalness”)

cant get myself to use it, it will be a sad day when its retired and this is not fixed/improved

Topic		Replies	Views
TTS no longer follows instructions parameter Bugs	13	416	February 25, 2026
New audio models in the API + tools for voice agents Announcements	28	6215	July 13, 2025
Gpt-4o-mini-tts speed and unnatural voice Feedback tts	3	171	January 16, 2026
Audio Models in the API - live stream at 10 AM PT API	15	1161	March 29, 2025
Prompting GPT-5 is different API gpt-5	12	13080	September 9, 2025

New Audio Model Snapshots in the Realtime-API

Related topics