High Costs Due to Silence or Noisy Segments in gpt-4o-audio-preview Outputs

I wanted to flag an issue we encountered while using the gpt-4o-audio-preview model, which led to unexpected costs and might be worth discussing for those using similar setups.

Here’s the situation:

We were running the audio-preview model with the temperature parameter set to 0, which is our default for text models. However, this caused the model to emit continuous noise, rather than silence or a logical halt when there was no audio content. This behavior wasn’t explicitly mentioned in the documentation, though we later found guidance in the Realtime API section suggesting a temperature range of [0.6, 1.2] for audio. Once we adjusted the temperature to 0.8, the issue resolved.

Unfortunately, during testing, we allowed the connection to run for several minutes, and the model kept generating output, which quickly racked up a large number of tokens, resulting in a much higher bill than anticipated. While part of this was our oversight in not closing the connection earlier, the unexpected nature of the behavior made it difficult to foresee.

Issues

  1. Noise in the Output with Temperature 0: Even when the temperature is zero, the model produces continuous noise, leading to extraneous token generation and high costs.

  2. Silent or Missing Audio Outputs: In some cases, the model does not send the actual audio data (only an audio ID), which can result in a confusing user experience.

Temporary Solutions

  1. Use a Response Timeout: configure a timeout that automatically cuts off the response after X seconds to prevent unnecessary token generation and billing due to prolonged noisy outputs.
  2. Detect and Retry Missing Audio Events: Implement a mechanism to detect when the model does not send audio data/event after say 2 secs and automatically retry.
3 Likes

Reporting quality: Share your experience with BOTH available versioned audio models…2024-12-17 is available to look for improvements.

better temporary solution: constrain your costs with max_completion_tokens, where 2048 tokens is about 80 seconds of audio.

Parsing: look for “content” alongside “transcript” to ensure you can present some output if audio modality does not continue. As backup, this 40x cheaper text output can be sent to TTS upon such a fail.

Audio IDs must be used for assistant chat history to ensure it keeps talking.

1 Like

Thanks. We switched to 2024-12-17 but the issue persisted.
We use the audio_id, but the problem isn’t with using the audio_id itself. The issue is twofold:

  1. Sometimes the audio doesn’t come in the response, yet we still get the audio_id at the end.
  2. The output often contains a lot of noise when the temperature parameter is < 0.6.
2 Likes

Have you been able to figure this out? I think I just paid $30 for a very short text for setting my temp=0.1! Does openai reimburse for the noise generated?

We’ve reached out through Customer support chat.
Maybe @stevenh could help move this along more quickly?