I am using the realtime api to navigate IVR systems. Generally it’s performing pretty well, but there is one specific case that produces odd results.
I am using the following turn-detection settings:
'turn_detection': {
# Default is 500ms
'silence_duration_ms': 1000,
'type': 'server_vad'
},
Originally when I used the default silence_duration_ms of 500 ms, the realtime api performed poorly as this had a tendency to create conversation items that captured a small portion of a complete thought (ex: “What is your?”) and the realtime api would often jump in with an incorrect response. Bumping this up to 1000 ms has reduced this.
However, I do see issues when there is close to the 1000 ms threshold. For example:
IVR: Now tell me the member's date of birth including the year (~500 ms pause ... about to say "For example")
OpenAI: Jan 1, 2025
IVR: For example, (interrupted)
OpenAI: Jan 1, 2025
IVR: That's January (interrupted)
OpenAI: Yes
When this is triggered, it seems like the silence_duration_ms is no longer respected and the realtime api is quick to interrupt. This is also where I see the most hallucinations. Sometimes they do not appear in the transcript as well (similar to Creepy bug of Realtime API + Function Calling: Extra Audio Not in Transcription - #8 by tsar).
Increasing silence_duration_ms to more than 1000 ms can triggers errors navigating IVRs as the response latency is too slow. However, as mentioned above, it seems like silence_duration_ms is being ignored when this scenario begins.
I’ve simulated this w/vanilla gpt-4o chat completions and cannot reproduce this behavior … it feels like something related to the turn-detection.
I’m curious if others have come up with ways to prevent the realtime api from aggressively responding and/or if you have run into this as well.