I’m currently building a Realtime WebRTC agent using openai-agents-js.
Everything works great: audio in/out, video track streaming, model understanding the video content, tool calls, etc.
However, I ran into one behavior that I cannot solve and would like clarification on whether this is expected or a missing feature.
Problem
Although the Realtime model can see the video stream (confirmed — I tested object recognition, environment understanding, etc.), it never proactively speaks based on what it sees in the video.
It only responds after I speak (i.e., after VAD detects my audio and triggers a user turn).
In other words:
-
The model correctly receives & understands the video track
-
But visual events do NOT trigger model output
-
The model only speaks after my audio input, not when the video changes
This happens even if I update the session with instructions like:
"You are allowed to speak at any time based on video input.
If you observe something important in the video, please proactively speak."
The instructions are accepted, but the model still does not speak unless an audio turn is triggered first.
What I expected
When the model sees something meaningful in the video (e.g., person enters the frame, an object appears), I expect:
-
the model to create its own turn, and
-
proactively generate audio output
-
even if I haven’t spoken anything
Basically, I expected functionality similar to the ChatGPT app’s live video mode, where the assistant can speak proactively based on visual input.
What actually happens
-
Video stream is successfully sent via WebRTC (environment
facingMode, 720p, etc.) -
Model understands the video when asked after I speak
-
But visual input never triggers the model to speak on its own
-
The model seems to only generate audio after a voice turn is initiated
It looks like visual input is currently passive, not an event that can trigger a model response.
Question
Is this behavior expected in the current Realtime API?
Specifically:
Does the Realtime model currently support proactive generation triggered solely by video events, without user audio turn?
If not:
-
Will proactive video-triggered turns be supported in a future update?
-
Is there a recommended workaround (e.g., forcing turns from the client/server, or using tools, or sending a manual trigger)?
-
Does OpenAI plan to expose video-frame events or auto-vision-turn creation?
Additional context
-
Using WebRTC client →
openai-agents-js -
Sideband connection on the server for tool execution
-
Video track confirmed working (model can describe objects/people)
-
Audio VAD works correctly
-
Issue only affects proactive output based on video