Real Time API - Parameters affecting comprehension of Model

Hi,

I am using the Real Time API and building a voice bot with WebRTC. I have the subjective feeling that the understanding of the model is better or worse, depending on the voice I configure. ‘verse’, for example, seems to understand me worse than ‘alloy’. Am I wrong in this impression or has anyone else noticed this?

I also realised that when I use the ChatGPT app, the conversation is much more natural and the model also seems to have a better understanding of what I am saying. Are there parameters I can set to improve the performance in my application?

I currently use the model “gpt-4o-realtime-preview-2024-12-17”.

Thanks in advance for any inputs!

Hmm, this is most likely caused by session configuration, not the voice itself.

  • The voice setting (e.g. verse, alloy) only affects output audio, not how the model understands you. The model listens to raw audio directly (not TTS), so the choice of voice shouldn’t impact comprehension.

  • What does affect understanding is the audio input, turn detection, and session-level parameters like temperature and instructions.

Some tuning points that made a noticeable difference for me:

"instructions": "Be conversational and natural. Keep answers clear and brief unless more detail is requested."
//my actual instructions block is like 20+ lines designed towards my use-case 
"temperature": 0.6 // play around with this?

(Lower temperature = more deterministic replies. Default is 0.8.)

I was experimenting with gpt-4o-mini-realtime-preview-2024-12-17 and honestly found it pretty solid. That said, it’s tough to match the experience you get in the ChatGPT app :sweat_smile: — they’ve likely tuned a lot under the hood that we don’t have direct access to i’m guessing. So if it feels different, you’re not imagining it.

Still, with the right config tweaks, you can get surprisingly close.