New Audio Model Snapshots in the Realtime-API

Guiding the new gpt-4o-mini-tts-2025-12-15 snapshot behaves differently from the previous gpt-4o-mini-tts-2025-03-20 version.

Goal
Control the style and tone of text-to-speech output, for example a whispering voice.

Challenge
With gpt-4o-mini-tts-2025-03-20, a simple prompt like:

You are always whispering

worked reliably in most cases. With the new snapshot, the same instruction is followed far less consistently, closer to three out of ten attempts. To benefit from the improved, lower word error rate of the new snapshot, the prompting approach needs to change.

Solution
The team shared the Realtime Prompting Guide from cookbook.openai.com. The key takeaway is that the model needs to be guided similarly to realtime models when enforcing style and tone constraints. Here is an example prompt as baseline guidance, and this optimizer prompt can be used to remove ambiguity from the wording.

My experience
I struggle with this and can get proper instruction following around 50% of the time. For now I second @Dobo 's approach to use the older snapshot when precise style and tone control are needed. Maybe we should create a topic in the prompting category to learn where others can take this model with their prompting skills.

@aprendendo.next: FYI: The default snapshot for this model has been updated to the newer version. In case others are wondering why their output did suddenly change.

3 Likes