WebSocket: Can't distinguish idle_timeout from regular speech_stopped - is this expected?

Hi everyone,

I’m implementing a voice bot using the Realtime API over WebSocket (not WebRTC/SIP), and I’m running into an issue with idle_timeout_ms.

The Problem

When I configure idle_timeout_ms in my session:

json

{
  "type": "session.update",
  "session": {
     "turn_detection": {
       "type": "server_vad",
       "idle_timeout_ms": 6000
    }
  }
}

Both of these scenarios trigger the same input_audio_buffer.speech_stopped event:

  1. User finishes speakingspeech_stopped

  2. Idle timeout triggersspeech_stopped (also)

According to the official docs:

“When the timeout is triggered, the server sends an input_audio_buffer.timeout_triggered event”

However, this event only exists for WebRTC/SIP connections, not WebSocket.

The Question

Is there any way to distinguish between these two cases in WebSocket?

I need to know when the model is responding due to idle_timeout_ms vs. when it’s responding to actual user speech, so I can handle them differently in my application logic.

What I’ve Tried

  • Checking audio_start_ms and audio_end_ms duration (unreliable)

  • Looking for additional fields in the event (none found)

  • Reviewing all server events documentation

Workaround

Currently considering implementing the idle timeout logic manually in my application code instead of relying on idle_timeout_ms, but it would be much cleaner to have a proper way to detect this from the API events.

Has anyone found a solution for this, or is this a feature that should be added to the WebSocket interface?

Thanks!