I’m building an AI voice training tool for reception training. GPT-realtime roleplays as virtual hotel guests with distinct accents and personalities, and the user can practice real scenarios with. With gpt-realtime v1, I had reached a point where the voices conveyed strong emotions and convincing accents. It felt genuinely realistic.
Tested gpt-realtime-1.5 and it’s a clear downgrade:
- Accents are almost entirely gone. In v1, the model could deliver speech with convincing regional accents that made conversations feel authentic. In 1.5, this is essentially stripped out.
- The voice sounds noticeably more robotic. Whatever was making v1 feel natural and human-like has been lost. The output in 1.5 feels flat and synthetic in comparison.
- Emotion is barely there. v1 had genuine warmth, inflection, and emotional range in its delivery. 1.5 sounds like it’s reading a script with no feeling behind it.
I’ve seen the benchmarks OpenAI published: +5% audio reasoning, +10% alphanumeric transcription, +7% instruction following, but none of that matters for our use case if the voice itself sounds worse. It seems that lately, OpenAI has been chasing the benchmarks above all else, even when it’s been proven time and time again that they do not drive adoption or real-world use.
Even so, instruction following and tool calling improvements are great for enterprise agent workflows, but not at the cost of the qualities that made realtime voice compelling in the first place.
For now, we are staying on gpt-realtime (v1) and will continue to monitor updates.
Is anyone else seeing the same thing? I’d like to know whether this is being tracked internally or if there are plans to bring back the expressiveness that v1 had in 1.5.