I am using the latest realtime API with web sockets and Twilio. I am seeing it consistently drop a word of phrase at the end of a segment. So if the phrase is “I am going to start. Ready?” The “Ready” speech segment never arrives. I have simplified the app down to its barebones and have still been seeing this issue. Has anyone else encountered this problem? I don’t think I can make the app any simpler, but prompting it to count down from 10 to 0 with pauses. It will oftentimes skip the 0.
OPENAI_REALTIME_URL = “wss://api.openai.com/v1/realtime?model=gpt-realtime”
2 Likes
Couple updates to this.
-
I removed Twilio from the equation and was still able to reproduce this. So essentially I am able to reproduce the problem with web sockets, open ai and the microphone.
-
When I tested this with an older model, it doesn’t seem to be an issue. So this leads me to believe it is a regression with the more recent real-time models.
wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-10-01 (Cannot reproduce with this version)
wss://api.openai.com/v1/realtime?model=gpt-realtime. (CAN reproduce with this version)
Curious if anyone else is running into this with the newer model
2 Likes
Thanks for reporting. No time to test here, but I’ve passed it along to support.
Hope you stick around. We’ve got a great community!
1 Like
I have tested all the “preview” models and none seem to exhibit this behavior.
gpt-4o-realtime-preview-2025-06-03
gpt-4o-realtime-preview-2024-12-17
gpt-4o-realtime-preview-2024-10-01
the only one that exhibits it is - gpt-realtime.
Happy to post my simple test case that I use to reproduce it.
Sorry to keep adding on, but I have done more testing on this and have more data to add. The best use case to reproduce this is to have the agent count backwards from 10 slowly. If it goes slow enough, it will start to skip numbers and think it said them. Getting the transcript often shows that it “thinks” it said the number. So the test case that exhibits this the most is small short phrases of one word, where it needs to pause. It will then not provide the audio for the short phrase, but “think” it said it.
I also WAS able to reproduce it in the older models, it just occurs so much more frequently in the most recent model.
And out of curiosity, I decided to see if it was only a websocket issue. I built a simple test app using Twilio SIP to OpenAI and was also able to reproduce it with some regularity. So it honestly seems like a pretty wide spread issue with short phrases in the real-time api.