Yes, voice quality and accent quality definitely matter in demos, especially for client perception.
In our case, though, the bigger issue is not raw intelligence or even voice expressiveness. The older gpt-realtime-mini-2025-10-06 snapshot was already good enough for Romanian voice quality and much more reliable for our production constraints.
The serious regression for us is faithfulness to supplied business data.
For example, in one clinic test case, the newer gpt-realtime-mini told a patient that the clinic had a neurology department, even though no such department existed in the database/context it had access to. That kind of hallucination did not happen with the dated snapshot under the same type of flow. The older model stayed much more faithful to the information we provided.
Mispronunciations or less natural phrasing would be more forgivable. Annoying, but manageable. Inventing non-existing departments, services, or operational details in a healthcare/clinic setting is a completely different level of risk. That can directly damage client trust and create operational problems for the business using the agent.
So for us the issue is not simply:
“the newer model sounds worse”
but rather:
“the newer model is less reliable as a production voice agent because it confabulates facts that were not in the business data.”
I also can’t really comment on whether the larger non-mini Realtime models solve this, because our unit economics are razor-thin by design. We are trying to offer some of the lowest possible AI voice pricing per minute in our local market, while competing with local companies that already offer very cheap per-minute voice solutions. Even if a larger gpt-realtime model performed better, switching from gpt-realtime-mini to the more expensive Realtime tier is not a viable migration path for our business model at scale.
That is why gpt-realtime-mini-2025-10-06 matters so much to us. It gave us the right balance of cost, Romanian performance, and faithfulness to supplied data. The current replacement does not appear behaviorally equivalent for our use case.