Real Time API - Parameters affecting comprehension of Model

Hmm, this is most likely caused by session configuration, not the voice itself.

  • The voice setting (e.g. verse, alloy) only affects output audio, not how the model understands you. The model listens to raw audio directly (not TTS), so the choice of voice shouldn’t impact comprehension.

  • What does affect understanding is the audio input, turn detection, and session-level parameters like temperature and instructions.

Some tuning points that made a noticeable difference for me:

"instructions": "Be conversational and natural. Keep answers clear and brief unless more detail is requested."
//my actual instructions block is like 20+ lines designed towards my use-case 
"temperature": 0.6 // play around with this?

(Lower temperature = more deterministic replies. Default is 0.8.)

I was experimenting with gpt-4o-mini-realtime-preview-2024-12-17 and honestly found it pretty solid. That said, it’s tough to match the experience you get in the ChatGPT app :sweat_smile: — they’ve likely tuned a lot under the hood that we don’t have direct access to i’m guessing. So if it feels different, you’re not imagining it.

Still, with the right config tweaks, you can get surprisingly close.