Issue:Sideband WebSocket connection cannot access video/video-derived context when client uses WebRTC AV streams
Summary
When using the Realtime API with browser-side WebRTC (audio/video/datachannel) and a server-side sideband WebSocket, I found that:
- The browser WebRTC client can send video frames and the model can correctly analyze video content.
- But the sideband WebSocket connection cannot access or leverage any video context.
- If I send messages from the sideband channel (with the same callId), the Realtime model responds as if it cannot see the video stream at all.
This makes the sideband connection effectively “blind” to the AV context of the session.
Detailed Behavior
What works
-
Browser establishes a WebRTC connection including:
- video track
- audio track
- datachannel
-
Model can correctly analyze video frames sent via WebRTC.
-
When the browser sends text through the datachannel (e.g., “What is happening in the video?”), Realtime provides correct video-aware answers.
✘ What does not work
When my server connects to the same session via sideband WebSocket:
POST /v1/realtime/sessions/{callId}
Sending text from sideband:
{
"type": "response.create",
"response": {
"instructions": "Question from server: what is happening in the video?"
}
}
→ The model’s reply does not include video understanding.
It behaves as if the video stream does not exist.
This means:
- Browser → Model (video-aware)

- Sideband → Model (video-unaware) ✘
- But both are connected to the same Realtime session with the same
callId.
Expected Behavior
Since the WebRTC client and the sideband WebSocket share the same session, with one shared conversation state, I expect:
- The sideband connection should have access to the same multimodal context (audio/video) that the browser client provides.
- The model should answer video-based questions no matter which connection provides the input.
Key Question
Is the sideband WebSocket supposed to have full access to the session’s AV context?
If yes, is this a bug?
If not, what is the expected architecture for server-side logic that depends on video content?
In particular:
- Does the sideband connection have access to video frames sent from the browser?
- Should tool calls triggered by sideband input still be able to consider video content?
- Is there a recommended way to allow server-side logic to use the same multimodal context as the WebRTC client?
Additional Notes
-
ToolCalls work correctly on both browser and sideband.
-
Both channels receive the same
response.*events. -
Only video-derived reasoning is missing on sideband input.
-
I am using:
- WebRTC AV for browser → Realtime
- WS sideband connecting via
POST /v1/realtime/calls/{callId}
-
No SDK issues; same behavior using direct HTTP + WebSocket wire protocol.
Request
Please clarify:
- Is this a limitation in the current Realtime sideband implementation?
- If so, is there a roadmap to allow sideband connections to access the multimodal context?
- Is there a recommended alternative approach for server-side logic that needs video understanding?