High Costs Due to Silence or Noisy Segments in gpt-4o-audio-preview Outputs

avishekh · January 6, 2025, 12:36pm

I wanted to flag an issue we encountered while using the gpt-4o-audio-preview model, which led to unexpected costs and might be worth discussing for those using similar setups.

Here’s the situation:

We were running the audio-preview model with the temperature parameter set to 0, which is our default for text models. However, this caused the model to emit continuous noise, rather than silence or a logical halt when there was no audio content. This behavior wasn’t explicitly mentioned in the documentation, though we later found guidance in the Realtime API section suggesting a temperature range of [0.6, 1.2] for audio. Once we adjusted the temperature to 0.8, the issue resolved.

Unfortunately, during testing, we allowed the connection to run for several minutes, and the model kept generating output, which quickly racked up a large number of tokens, resulting in a much higher bill than anticipated. While part of this was our oversight in not closing the connection earlier, the unexpected nature of the behavior made it difficult to foresee.

Issues

Noise in the Output with Temperature 0: Even when the temperature is zero, the model produces continuous noise, leading to extraneous token generation and high costs.
Silent or Missing Audio Outputs: In some cases, the model does not send the actual audio data (only an audio ID), which can result in a confusing user experience.

Temporary Solutions

Use a Response Timeout: configure a timeout that automatically cuts off the response after X seconds to prevent unnecessary token generation and billing due to prolonged noisy outputs.
Detect and Retry Missing Audio Events: Implement a mechanism to detect when the model does not send audio data/event after say 2 secs and automatically retry.

arata · January 6, 2025, 12:49pm

Reporting quality: Share your experience with BOTH available versioned audio models…2024-12-17 is available to look for improvements.

better temporary solution: constrain your costs with max_completion_tokens, where 2048 tokens is about 80 seconds of audio.

Parsing: look for “content” alongside “transcript” to ensure you can present some output if audio modality does not continue. As backup, this 40x cheaper text output can be sent to TTS upon such a fail.

Audio IDs must be used for assistant chat history to ensure it keeps talking.

avishekh · January 6, 2025, 1:22pm

Thanks. We switched to 2024-12-17 but the issue persisted.
We use the audio_id, but the problem isn’t with using the audio_id itself. The issue is twofold:

Sometimes the audio doesn’t come in the response, yet we still get the audio_id at the end.
The output often contains a lot of noise when the temperature parameter is < 0.6.

qernelwiz · January 10, 2025, 10:21pm

Have you been able to figure this out? I think I just paid $30 for a very short text for setting my temp=0.1! Does openai reimburse for the noise generated?

avishekh · January 11, 2025, 4:35am

We’ve reached out through Customer support chat.
Maybe @stevenh could help move this along more quickly?

mokhir56 · February 24, 2025, 3:46pm

Experience similar issue with gpt-4o-mini-audio-preview - a lot of silent audio chunks generated.

Topic		Replies	Views
Text to voice generate 13 minutes noise sound for a small text Bugs api	1	234	January 10, 2025
High Cost Due to Silent response.audio.delta Segments in Real-Time API Bugs realtime	4	410	January 7, 2025
About gpt4o audio (wasteful credit consumption) Bugs api	1	314	December 13, 2024
[Realtime API] gpt-4o-realtime-preview models audio downtime (no output) on Jun 2 API realtime , api-realtime	8	840	July 2, 2025
TTS models returning blank audio and repetitions API	1	240	April 6, 2025

High Costs Due to Silence or Noisy Segments in gpt-4o-audio-preview Outputs

Related topics