Realtime WebRTC video works, but the model never speaks proactively based on video events

L_J.W · November 22, 2025, 12:24pm

I’m currently building a Realtime WebRTC agent using openai-agents-js.
Everything works great: audio in/out, video track streaming, model understanding the video content, tool calls, etc.

However, I ran into one behavior that I cannot solve and would like clarification on whether this is expected or a missing feature.

Problem

Although the Realtime model can see the video stream (confirmed — I tested object recognition, environment understanding, etc.), it never proactively speaks based on what it sees in the video.

It only responds after I speak (i.e., after VAD detects my audio and triggers a user turn).
In other words:

The model correctly receives & understands the video track
But visual events do NOT trigger model output
The model only speaks after my audio input, not when the video changes

This happens even if I update the session with instructions like:

"You are allowed to speak at any time based on video input.
If you observe something important in the video, please proactively speak."

The instructions are accepted, but the model still does not speak unless an audio turn is triggered first.

What I expected

When the model sees something meaningful in the video (e.g., person enters the frame, an object appears), I expect:

the model to create its own turn, and
proactively generate audio output
even if I haven’t spoken anything

Basically, I expected functionality similar to the ChatGPT app’s live video mode, where the assistant can speak proactively based on visual input.

What actually happens

Video stream is successfully sent via WebRTC (environment facingMode, 720p, etc.)
Model understands the video when asked after I speak
But visual input never triggers the model to speak on its own
The model seems to only generate audio after a voice turn is initiated

It looks like visual input is currently passive, not an event that can trigger a model response.

Question

Is this behavior expected in the current Realtime API?

Specifically:

Does the Realtime model currently support proactive generation triggered solely by video events, without user audio turn?

If not:

Will proactive video-triggered turns be supported in a future update?
Is there a recommended workaround (e.g., forcing turns from the client/server, or using tools, or sending a manual trigger)?
Does OpenAI plan to expose video-frame events or auto-vision-turn creation?

Additional context

Using WebRTC client → openai-agents-js
Sideband connection on the server for tool execution
Video track confirmed working (model can describe objects/people)
Audio VAD works correctly
Issue only affects proactive output based on video

juberti · November 25, 2025, 6:20am

what you’re looking for is programmable turn detection, which we currently don’t support. Currently all triggering is either voice-based or manual.

Topic		Replies	Views
WebRTC DataChannel requests do not include video context, but audio requests do API realtime , api-realtime	0	65	November 22, 2025
Sideband WebSocket connection cannot access video/video-derived context when client uses WebRTC AV streams Feedback realtime , api-realtime	4	118	November 25, 2025
Is There an API for ChatGPT’s Video Chat (Advanced Voice Mode)? API video , api , advanced-voice	2	1005	February 9, 2026
Realtime API: No Response After Function Calling Until Next User Turn (gpt-4o-realtime-preview-2025-06-03) Bugs tool , tool-choice , realtime , api-realtime , api-realtime-speech	0	314	June 24, 2025
Realtime API: Unexpected `response.text.delta` instead of audio events Bugs	0	106	August 21, 2025