Realtime API with web sockets skips words

I am using the latest realtime API with web sockets and Twilio. I am seeing it consistently drop a word of phrase at the end of a segment. So if the phrase is “I am going to start. Ready?” The “Ready” speech segment never arrives. I have simplified the app down to its barebones and have still been seeing this issue. Has anyone else encountered this problem? I don’t think I can make the app any simpler, but prompting it to count down from 10 to 0 with pauses. It will oftentimes skip the 0.

OPENAI_REALTIME_URL = “wss://api.openai.com/v1/realtime?model=gpt-realtime”

2 Likes

Couple updates to this.

  1. I removed Twilio from the equation and was still able to reproduce this. So essentially I am able to reproduce the problem with web sockets, open ai and the microphone.

  2. When I tested this with an older model, it doesn’t seem to be an issue. So this leads me to believe it is a regression with the more recent real-time models.

wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-10-01 (Cannot reproduce with this version)

wss://api.openai.com/v1/realtime?model=gpt-realtime. (CAN reproduce with this version)

Curious if anyone else is running into this with the newer model

2 Likes

Thanks for reporting. No time to test here, but I’ve passed it along to support.

Hope you stick around. We’ve got a great community!

1 Like

I have tested all the “preview” models and none seem to exhibit this behavior.

gpt-4o-realtime-preview-2025-06-03

gpt-4o-realtime-preview-2024-12-17

gpt-4o-realtime-preview-2024-10-01

the only one that exhibits it is - gpt-realtime.

Happy to post my simple test case that I use to reproduce it.

Sorry to keep adding on, but I have done more testing on this and have more data to add. The best use case to reproduce this is to have the agent count backwards from 10 slowly. If it goes slow enough, it will start to skip numbers and think it said them. Getting the transcript often shows that it “thinks” it said the number. So the test case that exhibits this the most is small short phrases of one word, where it needs to pause. It will then not provide the audio for the short phrase, but “think” it said it.

I also WAS able to reproduce it in the older models, it just occurs so much more frequently in the most recent model.

And out of curiosity, I decided to see if it was only a websocket issue. I built a simple test app using Twilio SIP to OpenAI and was also able to reproduce it with some regularity. So it honestly seems like a pretty wide spread issue with short phrases in the real-time api.

Hey brentlyjr, Can you please share the latest request id of the calls you are making to the realtime API. Please make sure it is not older than 3 days. Once we have the request id, we can debug the issue further. Thank you!

1 Like

I am guessing this is the open ai request id you are looking for. But let me know if that is not the case. I just reproduced this issue at 12:25pm PST on 1/28, so should be recent enough.

[1769631878.978] Raw JSON message:

{

“type”: “response.created”,

“event_id”: “event_D36Na5SifHKoySyZ5WigJ”,

“response”: {

"object": "realtime.response",

"id": "resp_D36NaexZ790rZQBk4As8q",

"status": "in_progress",

"status_details": null,

"output": \[\],

"conversation_id": "conv_D36NUh6QaIxMTvJbgkI2D",

"output_modalities": \[

  "audio"

\],

"max_output_tokens": "inf",

"audio": {

  "output": {

    "format": {

      "type": "audio/pcm",

      "rate": 24000

    },

    "voice": "alloy"

  }

},

"usage": null,

"metadata": null

}

}

1 Like

Hey brentlyjr,

I took a look at your use case. I was able to find some logs and below are my suggestions.

Rule out “end-of-utterance interruption” (VAD cutting the tail)

What you’re describing really smells like the assistant getting interrupted right at the end of speech. With the newer gpt-realtime models, semantic VAD + interruption is more aggressive than in the older preview snapshots.

Two things to try (even temporarily, just to prove the cause):

Disable interruption-on-user-speech (or equivalent) so background noise can’t cancel the last word.

Or pause mic streaming while the assistant is speaking. A common pattern is: stop sending input audio as soon as you receive response.output_audio.delta, and resume only after response.output_audio.done.

Also try headphones vs speakers. If headphones fix it, that’s almost definitive proof that mic bleed or background noise is triggering an interruption before the final token is spoken.

Docs: Realtime turn detection & interruption

https://platform.openai.com/docs/realtime/voice#turn-detection

Make sure you’re draining all audio deltas before treating the response as finished

On WebSockets, response.output_audio.done is not audio — it’s just a marker. You must:

Play/buffer every response.output_audio.delta

Only consider the utterance complete once response.output_audio.done arrives

If your playback pipeline assumes the last chunk arrives with the done event, you’ll consistently lose the last word or phrase — especially noticeable with short endings like “Ready?” or “zero.”

Docs: Realtime audio events

https://platform.openai.com/docs/realtime/events#response-output-audio

Consider WebRTC if this is production voice

OpenAI explicitly calls out WebRTC (and SIP) as the recommended path for production voice apps. WebSockets work, but they’re more sensitive to timing, buffering, and VAD edge cases — exactly the kind that can clip the tail of an utterance.

Even though this feels like a model regression, switching to WebRTC often eliminates these “last-word missing” issues entirely because audio capture, playback, and interruption are better synchronized.

Docs: WebRTC vs WebSockets for Realtime

https://platform.openai.com/docs/realtime/webrtc

1 Like