WebRTC DataChannel requests do not include video context, but audio requests do

I am integrating the Realtime API using WebRTC with a browser client .
During testing, I found an unexpected inconsistency:


:red_exclamation_mark: Problem Summary

When I send user queries via audio (microphone), the model correctly receives both audio + video context from the peer connection.
The model can analyze the live camera feed and respond based on video content.

However, when I send the exact same query through the WebRTC dataChannel:

dataChannel.send(JSON.stringify({
  type: "response.create"
}));

the model does NOT have access to any video context.
It behaves as if no video stream is attached to the session at all.

This is surprising because:

  • Both audio and video tracks are already flowing to the Realtime API

  • DataChannel and media tracks share the same RTCPeerConnection

  • They share the same callId and session

I expected that any request within the same WebRTC session would have access to the same multimodal context.


:check_mark: What I expected

Since the DataChannel and media tracks share the same WebRTC session and callId, I expected:

  • Text requests (DataChannel)

  • Voice requests (AudioTrack)

…to be processed with the same multimodal context, including video.

Especially because the documentation describes the session as unified.


:cross_mark: What actually happens

Input type Does the model see video? Expected?
:microphone: Audio input (speech) :check_mark: Yes — video included :check_mark: Yes
:speech_balloon: DataChannel input (text) :cross_mark: No — video missing ✘ No

So the model only has vision capabilities when the prompt is delivered via audio.


:magnifying_glass_tilted_left: Implementation Notes

  • WebRTC connection created with:
navigator.mediaDevices.getUserMedia({ video: true, audio: true })

  • Video track is attached to the RTCPeerConnection and visible in the Realtime playground

  • Audio and video are continuously streamed to OpenAI

  • The only difference is the modality of the user query (audio vs DataChannel text)


:thinking: Question

Is this expected behavior?

Should DataChannel-based requests not inherit video context by design?

Or is this a missing feature / bug in multimodal input handling for text requests inside a WebRTC session?

If intentional, could the team clarify:

  • How should we trigger multimodal inference via DataChannel?

  • Should the client explicitly send a video frame (e.g., via RTCVideoSender)?

  • Is this a known limitation of the current Realtime model?


:folded_hands: Why this is important

We are building a continuous video monitoring use case (e.g., “tell me when something appears in the camera”), and need the server to periodically send DataChannel text requests to the model while the camera feed is active.

The current behavior forces all requests to go through audio, which is not always appropriate.


Thanks for your help!
I’d appreciate guidance on whether this is:

  • expected behavior,

  • a known limitation,

  • or a bug that needs a fix.