GPT-4o Realtime outperforms newer gpt-realtime in voice quality despite lower "Performance" rating (Spanish)

Eugenio_Rodriguez · January 27, 2026, 10:23pm

I’ve been building a voice AI product using the Realtime API and noticed something counterintuitive.

According to OpenAI’s own specs, gpt-realtime has a higher performance rating (5/5) compared to GPT-4o Realtime (3/5). However, in my real-world testing, the older model consistently delivers better results.

Models tested:

gpt-4o-realtime-preview-2025-06-03 (older) Better
gpt-realtime-2025-08-28 (newer) Worse

What I’m seeing:

Voice quality: gpt-4o-realtime sounds significantly more natural
Speech structure: Responses are more coherent and conversational
Overall feel: gpt-realtime feels “dumber” and less fluid despite the higher performance rating

Important context: I’m using these models primarily in Spanish, so this might be language-specific.

Has anyone else experienced this? Is the “Performance” metric perhaps measuring something different than voice naturalness? Would appreciate any insights from the community or team.

Topic		Replies	Views
Real Time API Voice vs Chat GPT Real Time Voice! Feedback	0	1079	December 20, 2024
Gpt-4o-mini-transcribe and gpt-4o-transcribe not as good as whisper Feedback api	4	7547	January 10, 2026
Gpt-4o-mini-tts speed and unnatural voice Feedback tts	3	181	January 16, 2026
Question about the differences in the Realtime models vs. base models API	0	305	January 29, 2025
Real Time API - Parameters affecting comprehension of Model API	1	133	March 25, 2025

GPT-4o Realtime outperforms newer gpt-realtime in voice quality despite lower "Performance" rating (Spanish)

Related topics