Hi everyone,
I’m implementing a voice bot using the Realtime API over WebSocket (not WebRTC/SIP), and I’m running into an issue with idle_timeout_ms.
The Problem
When I configure idle_timeout_ms in my session:
json
{
"type": "session.update",
"session": {
"turn_detection": {
"type": "server_vad",
"idle_timeout_ms": 6000
}
}
}
Both of these scenarios trigger the same input_audio_buffer.speech_stopped event:
-
User finishes speaking →
speech_stopped -
Idle timeout triggers →
speech_stopped(also)
According to the official docs:
“When the timeout is triggered, the server sends an input_audio_buffer.timeout_triggered event”
However, this event only exists for WebRTC/SIP connections, not WebSocket.
The Question
Is there any way to distinguish between these two cases in WebSocket?
I need to know when the model is responding due to idle_timeout_ms vs. when it’s responding to actual user speech, so I can handle them differently in my application logic.
What I’ve Tried
-
Checking
audio_start_msandaudio_end_msduration (unreliable) -
Looking for additional fields in the event (none found)
-
Reviewing all server events documentation
Workaround
Currently considering implementing the idle timeout logic manually in my application code instead of relying on idle_timeout_ms, but it would be much cleaner to have a proper way to detect this from the API events.
Has anyone found a solution for this, or is this a feature that should be added to the WebSocket interface?
Thanks!