Realtime API is extremely fast

I have been testing the Realtime API over the past few days and observed that the token generation is excessively fast, particularly in audio mode. The issue arises because the “response.cancel” event seems redundant, as all tokens are generated and delivered before the audio is fully processed. While I understand that the multimodal nature of the Realtime API may justify this behavior in text mode, it creates challenges in audio mode, where tokens should be generated at a pace aligned with the actual playback time of the audio.

This fast token generation results in a problem: when a user interrupts the model mid-sentence, the remaining tokens have already been delivered and billed, which is particularly concerning given the high API cost.

I attempted to limit the output tokens and request additional responses when the output limit is reached, but this approach led to undesirable behavior. Furthermore, it is not currently possible to request only audio tokens; the API forces a request for either text+audio or just text tokens.

I understand this API is still in beta, but I would suggest the following improvements:

  1. Enable the option to set the modality to audio only.
  2. Adjust the generation rate of audio tokens to match the actual audio duration.

Thank you for your consideration.

3 Likes