Sideband WebSocket connection cannot access video/video-derived context when client uses WebRTC AV streams

:memo: Issue:Sideband WebSocket connection cannot access video/video-derived context when client uses WebRTC AV streams

Summary

When using the Realtime API with browser-side WebRTC (audio/video/datachannel) and a server-side sideband WebSocket, I found that:

  • The browser WebRTC client can send video frames and the model can correctly analyze video content.
  • But the sideband WebSocket connection cannot access or leverage any video context.
  • If I send messages from the sideband channel (with the same callId), the Realtime model responds as if it cannot see the video stream at all.

This makes the sideband connection effectively “blind” to the AV context of the session.


:magnifying_glass_tilted_left: Detailed Behavior

:check_mark: What works

  1. Browser establishes a WebRTC connection including:

    • video track
    • audio track
    • datachannel
  2. Model can correctly analyze video frames sent via WebRTC.

  3. When the browser sends text through the datachannel (e.g., “What is happening in the video?”), Realtime provides correct video-aware answers.

✘ What does not work

When my server connects to the same session via sideband WebSocket:

POST /v1/realtime/sessions/{callId}

Sending text from sideband:

{
  "type": "response.create",
  "response": {
    "instructions": "Question from server: what is happening in the video?"
  }
}

→ The model’s reply does not include video understanding.
It behaves as if the video stream does not exist.

:red_exclamation_mark:This means:

  • Browser → Model (video-aware) :check_mark:
  • Sideband → Model (video-unaware) ✘
  • But both are connected to the same Realtime session with the same callId.

:pushpin: Expected Behavior

Since the WebRTC client and the sideband WebSocket share the same session, with one shared conversation state, I expect:

  • The sideband connection should have access to the same multimodal context (audio/video) that the browser client provides.
  • The model should answer video-based questions no matter which connection provides the input.

:thinking: Key Question

Is the sideband WebSocket supposed to have full access to the session’s AV context?
If yes, is this a bug?
If not, what is the expected architecture for server-side logic that depends on video content?

In particular:

  1. Does the sideband connection have access to video frames sent from the browser?
  2. Should tool calls triggered by sideband input still be able to consider video content?
  3. Is there a recommended way to allow server-side logic to use the same multimodal context as the WebRTC client?

:test_tube: Additional Notes

  • ToolCalls work correctly on both browser and sideband.

  • Both channels receive the same response.* events.

  • Only video-derived reasoning is missing on sideband input.

  • I am using:

    • WebRTC AV for browser → Realtime
    • WS sideband connecting via POST /v1/realtime/calls/{callId}
  • No SDK issues; same behavior using direct HTTP + WebSocket wire protocol.


:folded_hands: Request

Please clarify:

  1. Is this a limitation in the current Realtime sideband implementation?
  2. If so, is there a roadmap to allow sideband connections to access the multimodal context?
  3. Is there a recommended alternative approach for server-side logic that needs video understanding?

1 Like

Welcome to the developer community, @L_J.W!

The realtime API can be accessed via WebSockets, SIP, and WebRTC.

Out of these three, the following enable a client to connect directly with the API servers:

  1. WebRTC
  2. SIP

This brings up the requirement to have tool use and other business logic reside on the application server to keep this logic private and client-agnostic.

Hence, the realtime API now has sideband options for both SIP and WebRTC connections to keep tool use, business logic, and other details secure on the server side.

Quoting from the docs:

A sideband connection means there are two active connections to the same Realtime session: one from the user’s client and one from your application server. The server connection can be used to monitor the session, update instructions, and respond to tool calls.

1 Like

Thanks for the explanation.

However, I believe there is still an important issue:

Since the server-side connection is a sideband connection that shares the same callId and the same Realtime session, logically the server–Realtime interaction should have access to the full context, including the video stream provided by the client’s WebRTC channel.

But in practice, this does not happen.

When the server sends messages through the sideband WebSocket, the Realtime model responds as if it cannot see the client’s video input at all, even though the browser client and the server share one session.

Please take this issue into consideration — the expectation is that sideband should have equal access to the multimodal context of the session.

1 Like

Yes, via conversation.item.retrieve after a video frame has been added to the context.

Yes, although some refinement of the API may be needed to make this work as you expect. Can you say more about your use case?

1 Like

We want to implement proactive monitoring of the video stream so that when an abnormal situation is detected in the video, the Realtime API can speak proactively. Since the Realtime API currently cannot monitor video or initiate speech on its own, our idea is to periodically send a text-type message to the Realtime API via the sideband connection, asking whether it has detected any abnormality in the video frame, and use that to trigger the model to speak.

However, in our tests, when sending messages through the sideband connection, the API’s responses do not have access to the video context. Not only the sideband connection — even when using WebRTC in the browser and sending messages through the dataChannel, the API still cannot see the video context when generating responses.