[Realtime API] Audio is randomly cutting off at the end

Same happening for us, using a Ruby client we wrote ourselves. Can provide logs/details if it would be helpful.

1 Like

Thank you! I am facing the same issue. It cuts off the end, making it totally unusable. I just reverted back to 4o mini and Whisper because it actually says the full sentence.

We’re still experiencing this too, and it’s really a big problem - there are a number of real production use cases we’d like to be using the Realtime API for, but the cutting off midsentence behavior is just not acceptable for production use.

Which is too bad, because this API is incredible next generation tech and not being able to use it for real customers is a shame!

I’d greatly appreciate a fix, and even an indication of how soon a fix might be realistic would be very helpful.

2 Likes

Completely agree. I’ve resorted to using the realtime api constrained to text only, and then add a final TTS step. Adds latency of course but a lot more stable.

If it helps, I use AVM almost daily using the ChatGPT app and it’s always cutting off at the end as well, so I doubt there’s any solution besides wait.

Interesting, Ashwin, do you mean that you grab the transcript of what the Realtime API is outputting and then run it through TTS on the side? Thanks for the idea.

Any findings so far? I am reading below about people using text-only realtime + TTS to work around this…

Typical conversation for us this days:

“Ey, can you tell me what is the energy consumption right now in the building?”
<We see in the log the LLM doing all the magic in the background with its tools (retrieves real time information from field devices, uses a python env to make some basic math, crazy low latency… and then answers: > " I have retrieved the consumption of xyz devices and calculated that the consumption of the building right now is …" and then the audio message cuts off.

And we are paying for that to be clear…

I disable audio output (https://platform.openai.com/docs/api-reference/realtime-client-events/session/update) so its just responding with a text completion, not technically a transcript. And then yes send that to TTS.

User audio → ai text → ai speech instead of user audio → ai audio, but still faster than the traditional user audio → user text → ai text → ai audio.

Oh awesome, thanks very much for the tip. Hopefully this kind of hack won’t be necessary for much longer but appreciate the pointer for how to try getting this thing into production asap…

1 Like

@ashwinnayak How much delay you notice in text to speech and speech to text approach against original speech to speech approach

Really just depends on the TTS implementation. The added latency = time to first audio chunk. You can also implement streaming text so that you are sending text chunks to TTS as they are received from the realtime API.

It is sad though that for such an expensive API we have to work around this and use TTS. Regardless of the latency you might achieve with this approach, it defeats the whole purpose of having gpt4o itself provide the right intonation and speed to its speech…

I’m not sure if any of this is relevant to your issue, but here are my observations:

I’ve been experimenting with the Realtime API for a rather unique use case: an Unreal Engine Realtime API plugin for talking 3D avatars. In our initial tests, the Realtime voice frequently cut off. After some investigation, I discovered the issue stemmed from the Voice Activity Detection (VAD) mistaking its own speech for mine. This was likely because the Unreal AudioCapture setup probably lacks active microphone cancellation. I resolved the problem by adjusting the VAD sensitivity to 0.8 and using a narrow microphone. This configuration worked reliably in a point-of-contact setup.

In our second experiment, we developed 2D talking avatars in browsers. For this, we used the audio backend from OpenAI’s Console open-source demo, which is stable on the same desktop computer - likely thanks to better microphone cancellation in the browser (although I’m not entirely sure). That said, there were still very occasional instances where the Realtime voice cut off, typically when desktop speakers were loud enough to cause interference.

This might be one of the causes, but it’s not only one because in our case we use push-to-talk and have disabled VAD, but it still cuts off occasionally. Often it’s the content filter mischaracterizing something and cutting off, but there have been occasions where there is no error and audio simply cuts towards the end.

1 Like

negative. it happens using headphones too. and like many others say its always the very last few words of a message. We can easily notice, since in our case, those few last words most of the time contain critical information. e.g. a temperature value, etc.

I am convinced its a matter of audio not filling enough of the last streaming packet in some buffer in their pipeline and then the message been dropped. Thats why it happens sometimes and not others. Depends on message size and how it fits in packets

This might well happen “occasionally” in our case also. Of course the Realtime API is in “Preview” but still works well enough that we’re pushing a new project into production next couple of weeks, with a “preview” disclaimer on the voice, and our interface is hybrid voice/text chat so the user can always move to text chat if voice fails for some reason.

agree robertb. depending on the usecase this might well be production ready. for us its not really possible beyond PoC and demonstrations.

We had many times situations where the AI answer is:

“I calculated the average current consumption of xzy equipment devices, and it is …” and values are lost. Sure, we get transcription and could show it but not the idea.

I’ve had this happen to me while using the ChatGPT iOS app, so it seems to affect OpenAI themselves as well. I had mute enabled at the time so it wasn’t due to noise on my end or anything. The transcript had the fully reply but the voice got cut off.