GPT-4o Realtime outperforms newer gpt-realtime in voice quality despite lower "Performance" rating (Spanish)

I’ve been building a voice AI product using the Realtime API and noticed something counterintuitive.

According to OpenAI’s own specs, gpt-realtime has a higher performance rating (5/5) compared to GPT-4o Realtime (3/5). However, in my real-world testing, the older model consistently delivers better results.

Models tested:

  • gpt-4o-realtime-preview-2025-06-03 (older) :white_check_mark: Better
  • gpt-realtime-2025-08-28 (newer) :cross_mark: Worse

What I’m seeing:

  • Voice quality: gpt-4o-realtime sounds significantly more natural
  • Speech structure: Responses are more coherent and conversational
  • Overall feel: gpt-realtime feels “dumber” and less fluid despite the higher performance rating

Important context: I’m using these models primarily in Spanish, so this might be language-specific.

Has anyone else experienced this? Is the “Performance” metric perhaps measuring something different than voice naturalness? Would appreciate any insights from the community or team.