Description
I’m using @openai/realtime-api-beta and seeing cases where the model outputs text-only despite requesting both audio and text. The event stream sometimes switches from audio to response.text.delta only.
Steps to Reproduce
- Create a realtime session
const session = new RealtimeSession(clientId, userId, rtc);
await new Promise(async (resolve) => {
session.onceRaw("session.created", (data) => {
console.log("session created data", data);
resolve(null);
});
await rtc.connect();
});
- Update the session with audio settings
await session.updateSession({
model: "gpt-4o-realtime-preview",
input_audio_transcription: {
model: "whisper-1" as never,
},
modalities: ["audio", "text"],
input_audio_format: "pcm16",
output_audio_format: "pcm16",
turn_detection: {
type: "server_vad",
threshold: 0.5,
prefix_padding_ms: 300,
silence_duration_ms: 500,
create_response: false,
interrupt_response: true,
} as never,
});
- Update again with tool choice + voice
await session.updateSession({
tool_choice: "auto",
modalities: ["audio", "text"],
output_audio_format: "pcm16",
voice: ensureValidVoice(character.voice),
tools: [
{
type: "function",
name: "parseTags",
parameters: { type: "object", properties: {} },
},
{
type: "function",
name: "notifyConversationEnd",
description: "Signal that the conversation has concluded.",
parameters: {
type: "object",
properties: {
END_OF_CONVO: {
type: "boolean",
description: "Set to true if the conversation has concluded.",
},
},
required: ["END_OF_CONVO"],
},
},
] as never,
});
- On
speechStopped, request a response
session.on("speechStopped", () => {
session.rtc.realtime.send("response.create", {
response: {
tool_choice: "auto",
instructions:
"Always speak your answer out loud and also return text\n" +
"Always use audio output modality\n" +
character.context +
"\n" +
(character.ttsInstructions ?? ""),
modalities: ["audio", "text"],
output_audio_format: "pcm16",
},
});
});
Expected Behavior
Consistently receive:
response.audio.deltaresponse.audio_transcript.delta
Actual Behavior
- Sometimes only
response.text.deltais received (no audio events). - Appears to silently fall back to text-only mode.
- I must then synthesize audio via TTS as a fallback, which increases latency and degrades UX.
Impact
- Breaks realtime voice interaction expectations.
- Adds noticeable latency due to manual TTS fallback.
Questions for Support
- Is this behavior expected under any conditions?
- How can I ensure audio is always returned when
modalities: ["audio", "text"]andoutput_audio_format: "pcm16"are set?