Gpt-realtime-2 splits acknowledgment + next-step into separate turns, causing 5-20s caller silence (rollback to gpt-realtime-1.5 confirmed as A/B fix)

gpt-realtime-2 splits acknowledgment + next-step into separate turns, causing

5-20s caller silence

Summary

After upgrading a production voice agent from gpt-realtime (1.5) to
gpt-realtime-2, callers started experiencing 5-20+ seconds of dead
air between consecutive assistant utterances. The model emits an
acknowledgment / intent statement, ends the audio response, commits
the turn, and waits for the caller — even when the natural next
utterance is a follow-up question or a tool call. We rolled back to
gpt-realtime-1.5 with identical prompts, tools, library, and VAD
config and the pause disappears immediately — so the model is the
only variable.

Stack

  • Model: gpt-realtime-2 (vs gpt-realtime-1.5 baseline)
  • openai-python SDK + livekit-agents 1.5.10 + livekit-plugins-openai 1.5.10
  • Turn detection: semantic_vad (also tried eagerness=low and eagerness=high
    no effect on this pattern)
  • Session config: default. reasoning.effort is not exposed by our SDK / plugin
    version.

Repro

Expected behavior

USER: “I’d like to book a consultation.”
AGENT: “Sure — what type of consultation are you interested in?”

Single utterance, single audio response.

Actual behavior on gpt-realtime-2 (caller called out the silence directly):

USER: “I’d like to book a consultation.”
AGENT: “Got it, I can help with that. Let me check availability for you.”
↓ 20+ seconds of dead air ↓
USER: “I have a question about you. You just say you can help with
that, let you check the ability for me, but you just silent
more than 20 seconds. Why?”
AGENT: “Sorry about that pause. It can happen if there’s a brief
delay while I’m getting set to check options. I’m here now.
Which consultation type do you want?”

The agent dispatched no tool between the two assistant utterances.
The “checking” was purely verbal intent; the model then committed the
audio, ended the turn, and waited for user input to continue.

Another variant, where the model does split into two adjacent
assistant audio responses without a tool call:

USER: “I need a facelift consultation.”
AGENT: “Got it — you’re a new patient, so I’ll quickly check which
locations and surgeons are available for facelift consults.”
↓ pause ↓
AGENT: “Do you have a specific doctor you’d like to see?”

The two assistant utterances arrive as two separate audio responses
with a 5+ second pause in between. From the caller’s perspective the
call feels stalled.

What we tried that did not eliminate the pause

  1. Upgrade livekit-agents 1.4.1 → 1.5.10 (which contains
    fix(voice): cancel realtime generation when speech is interrupted).
    Library installs cleanly, our LLM eval baseline holds, but live
    audio still shows the pause.
  2. OPENAI_TURN_EAGERNESS=low (semantic_vad) — cookbook
    recommendation for reasoning models. No effect — eagerness
    governs user-side endpointing, not how the agent schedules its
    own generation cycles.
  3. Prompt rule forbidding standalone intent statements before
    questions, requiring “ask the question directly OR immediately
    dispatch the tool”. Partial — agent sometimes fuses the next
    question into the same turn, but still frequently emits the
    intent statement as a separate audio response with a pause.

What did eliminate the pause: rollback

Switching OPENAI_REALTIME_MODEL back to gpt-realtime-1.5
same prompts, same tools, same library, same VAD config — and
the pause disappears immediately. Clean A/B: the only changed
variable is the model.

Question for OpenAI

Is there a session-level knob or recommended pattern to keep the
reasoning realtime model from committing audio (and ending the turn)
between an acknowledgment / intent thought and the follow-up question
or tool call?

  • Is reasoning.effort available on the realtime session config
    somewhere we haven’t found, and would minimal change this
    behavior?
  • Is there a way to force the model to either fuse acknowledgment +
    next-action into a single audio response, or skip the
    acknowledgment altogether?
  • Is this expected behavior for gpt-realtime-2, and if so, what is
    the recommended UX pattern for a voice agent that needs to call a
    tool immediately after an acknowledgment?

The official preamble guidance (“Use a preamble when you are about
to call a tool that may take noticeable time, and output it
immediately before substantive reasoning or tool use”) addresses
silence DURING a tool call. The pattern we’re hitting is silence
BETWEEN the acknowledgment and the tool call (or follow-up
question), which the preamble guidance doesn’t address.

Happy to provide additional reproduction traces (sanitized) if useful.

Thanks!

We are experiencing similar but worse issues with the gpt-realtime-2 model. After swapping to the new model, it at times stop mid sentence. According to all events, this is not caused by barge-in/VAD or loss of connection but simply that the model thinks its done speaking. The result is that both audio and transcript is incomplete. The audio most often doesn’t get as far as the transcript, stopping a few words before. It seems it stops abruptly on “output_audio_buffer.stopped”.
This occurs quite often on the new model and I have found no workaround except rolling back to 1.5 which has no issue like this.

Example of logs. I log ALL events. Note the ‘unhandled’ is simply the events which I don’t currently handle:
“”"
[voice] unhandled event: conversation.item.added
[voice] unhandled event: conversation.item.done
[voice] unhandled event: response.created
[voice] output_item added
[voice] unhandled event: conversation.item.added
[voice] unhandled event: response.content_part.added
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: output_audio_buffer.started
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] unhandled event: response.output_audio_transcript.delta
[voice] output_audio done:
[voice] output_audio_transcript done
[assistant transcript]: I kan komme i gang med opsætningen ret hurtigt, for eksempel via Google Tag Manager på få minutter. Der findes også
[voice] unhandled event: response.content_part.done
[voice] unhandled event: conversation.item.done
[voice] output_item done
[voice] response done
[voice] unhandled event: output_audio_buffer.stopped
“”"

I have even seen the transcript stop with an unfinished word

Confirming exact similar behavior as others.

I have just discovered that this seems to be a case of false positive content_filter. From what I can find, the new model is highly sensitive and bad at this for other languages than English…
You can detect it via the response.done event:
“”"
response done: namespace(type=‘response.done’, event_id=‘…’, response=namespace(object=‘realtime.response’, id=‘…’, status=‘incomplete’, status_details=namespace(type=‘incomplete’, reason=‘content_filter’), output=[namespace(id=‘…’, type=‘message’, status=‘incomplete’, role=‘assistant’, content=[namespace(type=‘output_audio’, transcript=‘Hej, dejligt at fortsætte efter vores korte introduktion. Sig endelig’)], phase=‘final_answer’)], conversation_id=‘…’, output_modalities=[‘audio’], max_output_tokens=‘inf’, audio=namespace(output=namespace(format=namespace(type=‘audio/pcm’, rate=24000), voice=‘marin’)), usage=namespace(total_tokens=1462, input_tokens=1312, output_tokens=150, input_token_details=namespace(text_tokens=1312, audio_tokens=0, image_tokens=0, cached_tokens=0, cached_tokens_details=namespace(text_tokens=0, audio_tokens=0, image_tokens=0)), output_token_details=namespace(text_tokens=75, audio_tokens=75)), metadata=None))
“”"