Missing response.text.done and response.text.delta events, receiving only audio responses

I’m using the Realtime API with modalities: [“text”, “audio”] and sending a session.update immediately after the data channel opens to confirm the modalities.

The session is created successfully with both text and audio modalities confirmed in the payload but during the session:

  • I only receive audio events (response.audio.done, etc.).

  • I never receive any response.text.delta, response.text.done or response.output_item.added events containing assistant text.

  • this happens even when the AI says full sentences — not just tiny utterances.

  • no response.content_part.added or text delta events either.

I’ve checked everything on my end.. the connection is healthy and stays open.

session.update is acknowledged successfully.

Model used: gpt-4o-realtime-preview-2024-12-17. Prompts are simple and clean. This happens consistently across dozens of sessions.

questions:

  1. Is this a known issue?

  2. Are there any specific conditions under which the Realtime API would suppress text output entirely while streaming audio? For eg does function calling block assistant transcripts from coming in?