Realtime API interrupts too aggressively on filler words (hmm, ok, etc.)

Hi everyone,

I’m currently building a real-time voice AI agent using the OpenAI Realtime API (gpt-realtime), and I’m facing a major issue with interruptions.

In natural conversations, users often say filler words like “hmm”, “uh-huh”, “ok”, or “right” just to acknowledge they are listening. However, the Realtime API seems to treat these as actual interruptions.

As a result:

  • The assistant stops speaking immediately

  • It assumes the user wants to take over the conversation

  • This breaks the flow and creates a very unnatural experience

In many cases, the user is not trying to interrupt at all they are just passively acknowledging.

I’ve already experimented with multiple parameters such as:

  • vad_threshold

  • prefix_padding

  • silence_duration

But none of these seem to reliably solve the issue.

Another related problem is that when interruptions happen, the transcript/context handling is not always consistent. The assistant sometimes behaves as if the user heard the full message, even though it was cut mid-way, which leads to confusing or disjointed responses.

What I’m looking for:

  • A way to distinguish between acknowledgment sounds vs real interruptions

  • Better control over interruption sensitivity

  • Best practices for tuning turn detection in real-world conversations

Has anyone found a reliable workaround for this?

Or is there an official recommendation for handling this kind of scenario?

Thanks in advance :folded_hands: