Hey brentlyjr,
I took a look at your use case. I was able to find some logs and below are my suggestions.
Rule out “end-of-utterance interruption” (VAD cutting the tail)
What you’re describing really smells like the assistant getting interrupted right at the end of speech. With the newer gpt-realtime models, semantic VAD + interruption is more aggressive than in the older preview snapshots.
Two things to try (even temporarily, just to prove the cause):
Disable interruption-on-user-speech (or equivalent) so background noise can’t cancel the last word.
Or pause mic streaming while the assistant is speaking. A common pattern is: stop sending input audio as soon as you receive response.output_audio.delta, and resume only after response.output_audio.done.
Also try headphones vs speakers. If headphones fix it, that’s almost definitive proof that mic bleed or background noise is triggering an interruption before the final token is spoken.
Docs: Realtime turn detection & interruption
https://platform.openai.com/docs/realtime/voice#turn-detection
Make sure you’re draining all audio deltas before treating the response as finished
On WebSockets, response.output_audio.done is not audio — it’s just a marker. You must:
Play/buffer every response.output_audio.delta
Only consider the utterance complete once response.output_audio.done arrives
If your playback pipeline assumes the last chunk arrives with the done event, you’ll consistently lose the last word or phrase — especially noticeable with short endings like “Ready?” or “zero.”
Docs: Realtime audio events
https://platform.openai.com/docs/realtime/events#response-output-audio
Consider WebRTC if this is production voice
OpenAI explicitly calls out WebRTC (and SIP) as the recommended path for production voice apps. WebSockets work, but they’re more sensitive to timing, buffering, and VAD edge cases — exactly the kind that can clip the tail of an utterance.
Even though this feels like a model regression, switching to WebRTC often eliminates these “last-word missing” issues entirely because audio capture, playback, and interruption are better synchronized.
Docs: WebRTC vs WebSockets for Realtime
https://platform.openai.com/docs/realtime/webrtc