gpt-realtime-2 splits acknowledgment + next-step into separate turns, causing
5-20s caller silence
Summary
After upgrading a production voice agent from gpt-realtime (1.5) to
gpt-realtime-2, callers started experiencing 5-20+ seconds of dead
air between consecutive assistant utterances. The model emits an
acknowledgment / intent statement, ends the audio response, commits
the turn, and waits for the caller — even when the natural next
utterance is a follow-up question or a tool call. We rolled back to
gpt-realtime-1.5 with identical prompts, tools, library, and VAD
config and the pause disappears immediately — so the model is the
only variable.
Stack
- Model:
gpt-realtime-2(vsgpt-realtime-1.5baseline) openai-pythonSDK +livekit-agents1.5.10 +livekit-plugins-openai1.5.10- Turn detection:
semantic_vad(also triedeagerness=lowandeagerness=high—
no effect on this pattern) - Session config: default.
reasoning.effortis not exposed by our SDK / plugin
version.
Repro
Expected behavior
USER: “I’d like to book a consultation.”
AGENT: “Sure — what type of consultation are you interested in?”
Single utterance, single audio response.
Actual behavior on gpt-realtime-2 (caller called out the silence directly):
USER: “I’d like to book a consultation.”
AGENT: “Got it, I can help with that. Let me check availability for you.”
↓ 20+ seconds of dead air ↓
USER: “I have a question about you. You just say you can help with
that, let you check the ability for me, but you just silent
more than 20 seconds. Why?”
AGENT: “Sorry about that pause. It can happen if there’s a brief
delay while I’m getting set to check options. I’m here now.
Which consultation type do you want?”
The agent dispatched no tool between the two assistant utterances.
The “checking” was purely verbal intent; the model then committed the
audio, ended the turn, and waited for user input to continue.
Another variant, where the model does split into two adjacent
assistant audio responses without a tool call:
USER: “I need a facelift consultation.”
AGENT: “Got it — you’re a new patient, so I’ll quickly check which
locations and surgeons are available for facelift consults.”
↓ pause ↓
AGENT: “Do you have a specific doctor you’d like to see?”
The two assistant utterances arrive as two separate audio responses
with a 5+ second pause in between. From the caller’s perspective the
call feels stalled.
What we tried that did not eliminate the pause
- Upgrade
livekit-agents1.4.1 → 1.5.10 (which contains
fix(voice): cancel realtime generation when speech is interrupted).
Library installs cleanly, our LLM eval baseline holds, but live
audio still shows the pause. OPENAI_TURN_EAGERNESS=low(semantic_vad) — cookbook
recommendation for reasoning models. No effect —eagerness
governs user-side endpointing, not how the agent schedules its
own generation cycles.- Prompt rule forbidding standalone intent statements before
questions, requiring “ask the question directly OR immediately
dispatch the tool”. Partial — agent sometimes fuses the next
question into the same turn, but still frequently emits the
intent statement as a separate audio response with a pause.
What did eliminate the pause: rollback
Switching OPENAI_REALTIME_MODEL back to gpt-realtime-1.5 —
same prompts, same tools, same library, same VAD config — and
the pause disappears immediately. Clean A/B: the only changed
variable is the model.
Question for OpenAI
Is there a session-level knob or recommended pattern to keep the
reasoning realtime model from committing audio (and ending the turn)
between an acknowledgment / intent thought and the follow-up question
or tool call?
- Is
reasoning.effortavailable on the realtime session config
somewhere we haven’t found, and wouldminimalchange this
behavior? - Is there a way to force the model to either fuse acknowledgment +
next-action into a single audio response, or skip the
acknowledgment altogether? - Is this expected behavior for
gpt-realtime-2, and if so, what is
the recommended UX pattern for a voice agent that needs to call a
tool immediately after an acknowledgment?
The official preamble guidance (“Use a preamble when you are about
to call a tool that may take noticeable time, and output it
immediately before substantive reasoning or tool use”) addresses
silence DURING a tool call. The pattern we’re hitting is silence
BETWEEN the acknowledgment and the tool call (or follow-up
question), which the preamble guidance doesn’t address.
Happy to provide additional reproduction traces (sanitized) if useful.
Thanks!