Realtime API: Unexpected `response.text.delta` instead of audio events

Description

I’m using @openai/realtime-api-beta and seeing cases where the model outputs text-only despite requesting both audio and text. The event stream sometimes switches from audio to response.text.delta only.

Steps to Reproduce

  1. Create a realtime session
const session = new RealtimeSession(clientId, userId, rtc);
await new Promise(async (resolve) => {
  session.onceRaw("session.created", (data) => {
    console.log("session created data", data);
    resolve(null);
  });
  await rtc.connect();
});
  1. Update the session with audio settings
await session.updateSession({
  model: "gpt-4o-realtime-preview",
  input_audio_transcription: {
    model: "whisper-1" as never,
  },
  modalities: ["audio", "text"],
  input_audio_format: "pcm16",
  output_audio_format: "pcm16",
  turn_detection: {
    type: "server_vad",
    threshold: 0.5,
    prefix_padding_ms: 300,
    silence_duration_ms: 500,
    create_response: false,
    interrupt_response: true,
  } as never,
});
  1. Update again with tool choice + voice
await session.updateSession({
  tool_choice: "auto",
  modalities: ["audio", "text"],
  output_audio_format: "pcm16",
  voice: ensureValidVoice(character.voice),
  tools: [
    {
      type: "function",
      name: "parseTags",
      parameters: { type: "object", properties: {} },
    },
    {
      type: "function",
      name: "notifyConversationEnd",
      description: "Signal that the conversation has concluded.",
      parameters: {
        type: "object",
        properties: {
          END_OF_CONVO: {
            type: "boolean",
            description: "Set to true if the conversation has concluded.",
          },
        },
        required: ["END_OF_CONVO"],
      },
    },
  ] as never,
});
  1. On speechStopped, request a response
session.on("speechStopped", () => {
  session.rtc.realtime.send("response.create", {
    response: {
      tool_choice: "auto",
      instructions:
        "Always speak your answer out loud and also return text\n" +
        "Always use audio output modality\n" +
        character.context +
        "\n" +
        (character.ttsInstructions ?? ""),
      modalities: ["audio", "text"],
      output_audio_format: "pcm16",
    },
  });
});

Expected Behavior

Consistently receive:

  • response.audio.delta
  • response.audio_transcript.delta

Actual Behavior

  • Sometimes only response.text.delta is received (no audio events).
  • Appears to silently fall back to text-only mode.
  • I must then synthesize audio via TTS as a fallback, which increases latency and degrades UX.

Impact

  • Breaks realtime voice interaction expectations.
  • Adds noticeable latency due to manual TTS fallback.

Questions for Support

  • Is this behavior expected under any conditions?
  • How can I ensure audio is always returned when modalities: ["audio", "text"] and output_audio_format: "pcm16" are set?