I wanted to flag an issue we encountered while using the gpt-4o-audio-preview model, which led to unexpected costs and might be worth discussing for those using similar setups.
Here’s the situation:
We were running the audio-preview model with the temperature parameter set to 0, which is our default for text models. However, this caused the model to emit continuous noise, rather than silence or a logical halt when there was no audio content. This behavior wasn’t explicitly mentioned in the documentation, though we later found guidance in the Realtime API section suggesting a temperature range of [0.6, 1.2] for audio. Once we adjusted the temperature to 0.8, the issue resolved.
Unfortunately, during testing, we allowed the connection to run for several minutes, and the model kept generating output, which quickly racked up a large number of tokens, resulting in a much higher bill than anticipated. While part of this was our oversight in not closing the connection earlier, the unexpected nature of the behavior made it difficult to foresee.
Issues
-
Noise in the Output with Temperature 0: Even when the temperature is zero, the model produces continuous noise, leading to extraneous token generation and high costs.
-
Silent or Missing Audio Outputs: In some cases, the model does not send the actual audio data (only an audio ID), which can result in a confusing user experience.
Temporary Solutions
- Use a Response Timeout: configure a timeout that automatically cuts off the response after X seconds to prevent unnecessary token generation and billing due to prolonged noisy outputs.
- Detect and Retry Missing Audio Events: Implement a mechanism to detect when the model does not send audio data/event after say 2 secs and automatically retry.