I’ve been building a voice AI product using the Realtime API and noticed something counterintuitive.
According to OpenAI’s own specs, gpt-realtime has a higher performance rating (5/5) compared to GPT-4o Realtime (3/5). However, in my real-world testing, the older model consistently delivers better results.
Models tested:
gpt-4o-realtime-preview-2025-06-03(older)
Bettergpt-realtime-2025-08-28(newer)
Worse
What I’m seeing:
- Voice quality: gpt-4o-realtime sounds significantly more natural
- Speech structure: Responses are more coherent and conversational
- Overall feel: gpt-realtime feels “dumber” and less fluid despite the higher performance rating
Important context: I’m using these models primarily in Spanish, so this might be language-specific.
Has anyone else experienced this? Is the “Performance” metric perhaps measuring something different than voice naturalness? Would appreciate any insights from the community or team.