From about a month ago the text to speech api is no longer following the instructions parameter. No matter what instructions I enter, the speech is always the same. It happens both accessing the api via the node lib ( “openai”: “^6.16.0”) and via the openai.fm demo, that directly posts a http request.
For example. the following instructions generate the same speech:
instructions: ‘Voice: Deep, hushed, and enigmatic, with a slow, deliberate cadence that draws the listener. Phrasing: Sentences are short and rhythmic, building tension with pauses and carefully placed suspense. Punctuation: Dramatic pauses, ellipses, and abrupt stops enhance the feeling of unease and anticipation. Tone: Dark, ominous, and foreboding, evoking a sense of mystery and the unknown.’
instructions: ‘Voice: Very happy.\r\n\r\nSpeed: Extremely fast.’
Tried with coral, alloy and shimmer. With the input in spanish or english. The instructions always in english. Event tried formats mp3 and wav. Also tried with models gpt-4o-mini-tts-2025-12-15 and gpt-4o-mini-tts
Even with the docs example request, changing the instructions to something in line with “Speak in a dark, gloomy and slow tone” gives the same result
curl -X POST v1/audio/speech \
-H "Authorization: Bearer xxxx \
-H “Content-Type: application/json” \
-d ‘{
“model”: “gpt-4o-mini-tts”,
“input”: “The quick brown fox jumped over the lazy dog.”,
“voice”: “alloy”, “instructions”: “Speak in a cheerful and positive tone”
}’ \
YES! I noticed the same. I do a live stream where I used several distinct AI characters on screen and the TTS instructions are very important!
I used to get very expressive delivery but now everything seems monotone like the AI is bored. Last Saturday, this was working as expected but last night (Wednesday) the voice was different.
It seems to be an actual update to the TTS model. But I found a fix, I think.
It seems that gpt-4o-mini-tts and gpt-4o-mini-tts-2025-12-15 are broken but gpt-4o-mini-tts-2025–03-20 still works. I think. It seems to, I need to test it more but I think that must be what we were using before and things got switched up.
Instructions: Voice: Deep, hushed, and enigmatic, with a slow, deliberate cadence that draws the listener. Phrasing: Sentences are short and rhythmic, building tension with pauses and carefully placed suspense. Punctuation: Dramatic pauses, ellipses, and abrupt stops enhance the feeling of unease and anticipation. Tone: Dark, ominous, and foreboding, evoking a sense of mystery and the unknown.
You are right. The instructions were not properly followed - resulted in a rather bland mood.
EDIT: Just tested with a completely different voice instruction and the result was not much different from the first.
This is very disappointing. I don’t use snapshot models.
I have a daily automation that sends me an audio message using the gpt-4o-mini-tts model along with a fixed set of instructions for emotion and tone.
The audio I received on Tuesday (1/13/25) sounded great and matched the expected tone. However, the one from Wednesday (1/14/25) was awful!! Completely monotone, no expression at all. Just flat and boring.
I re-ran the automation manually, in case it was a one-off execution issue, but that didn’t help! Then on Thursday (1/16/25), I got the same strange result again.
I tried the recommendation mentioned in this thread and switched to the gpt-4o-mini-tts-2025-03-20 model. After a few manual runs, the audio now seems to have the correct tone and emotion based on the instructions, and the results are consistent with what I was getting before.
The default snapshot for this model has been updated very recently and the newer model snapshot behaves decidedly differently from the old one.
I have created a little write-up here but agree that the older version is a lot more consistent at following style and tone instructions.
This is just to explain the root cause for some of the challenges reported in this topic.
Hope this helps!
I just tried 2025–03-20, and it seems to follow my instructions better. However, the audio quality with the latest model is much clearer with less Audio artifacts. Maybe it uses better audio tokenizer or vocoder to reconstruct speech.
Haha, thats funny, I wasted time trying to figure out what happened. Thought one of my AI engineering systems changed something. good to know now I can sleep better at night.
Wonder if they are going to fix/retrain or if they know what the cause was.
gpt-4o-mini-tts-2025-12-15 is so awful compared to previous gpt-4o-mini-tts-2025-03-20
some of the most robotic and monotone TTS I have ever heard, the gpt-4o-mini-tts-2025-03-20 was actually great
All of the voices are completely changed and lost most of the tone and emotion
Just sounds awful, does not sound natural at all, the previous version did
Talking about using it over API with using instructions to set the tone and many other things
(text model analyzes the message to set the instructions for Accent, Emotional range, Intonation, Impressions, Speed of speech, Tone, Style, Whispering and thats sent to the TTS)