I’m using the Realtime API with modalities: [“text”, “audio”] and sending a session.update immediately after the data channel opens to confirm the modalities.
The session is created successfully with both text and audio modalities confirmed in the payload but during the session:
-
I only receive audio events (response.audio.done, etc.).
-
I never receive any response.text.delta, response.text.done or response.output_item.added events containing assistant text.
-
this happens even when the AI says full sentences — not just tiny utterances.
-
no response.content_part.added or text delta events either.
I’ve checked everything on my end.. the connection is healthy and stays open.
session.update is acknowledged successfully.
Model used: gpt-4o-realtime-preview-2024-12-17. Prompts are simple and clean. This happens consistently across dozens of sessions.
questions:
-
Is this a known issue?
-
Are there any specific conditions under which the Realtime API would suppress text output entirely while streaming audio? For eg does function calling block assistant transcripts from coming in?