High Cost Due to Silent response.audio.delta Segments in Real-Time API

I’ve noticed an issue with the real-time API’s response.audio.delta. At times, it returns a lot of response.audio.delta segments containing just silence, which significantly increases the cost.

Has anyone else experienced this issue?

I’ve seen this with chat completions with silence at the end, but usually when using high temperature that can skip over a logical stop sequence point by alternate token selection. Funny noise also can continue.

You have temperature=0.6 as an option.

A temperature below 0.6 (from default 0.8) or top_p control could improve the situation.

The lack of low temperature is likely enforced to break up any repetitive output that could continue forever or make invalid audio output, which can happen at top/greedy sampling.

One other consideration is that the AI may have a quality of mirroring user audio from chat history. It also should be cleaned and trimmed (which you can only do with your own voice activity detection).

1 Like

Faced a similar issue with gpt-4o-audio-preview. Changing the temperature within the range [0.6, 1.2], as mentioned in the API reference solved the issue.

I already use temperature: 0.6
I use the default top_p, what value is best with realtime api ?

Can you eloborate on this, How to do this techinically
I mean when api requests should I use etc.

You can use a voice detector, such as webRTCVAD, and where it has identified the start and end point for you to chop a spoken input out of a buffer, you can tighten up the leading and trailing silence even more with a higher threshold.

The VAD will have an element of “learning” to it, adapting to background noise over five seconds or so, so if you have a silence profile of the user mic or can pull out the quietest section of some listening, you can append that 5 seconds to the start of audio, run VAD to map the input (in 20ms increments), discard the prefix, and really chop the start and end down to just what is spoken above a VAD and volume threshold.

That also reduces your expense of token input, besides not providing in-context repeatable training on background noise tokens.

Avoid instructing the style of voice also, to keep it on a typical path of output. We can assume that if voice is instructed to be a giddy and anxious 8-bit robot from Texas, the post-training on narrator voice - and ending appropriately - will be further deviated away from.