Realtime API re-consuming it's own output audio as input audio

j0rdan · November 7, 2024, 12:53pm

Yes, that’s indeed what happens. Assuming the API generates both text and audio in each turn and you send only text, the conversation is usually like this:

---- Turn 1 ----
User: input_text1
Assistant: output_text1 + output_audio1
---- Turn 2 ----
User: input_text2 + (input_text1 + output_text1 + output_audio1)
Assistant: output_text2 + output_audio2

And so on.

They released prompt caching a couple of days ago to reduce this cumulative cost which is s welcome update, but it’s still too expensive for my use case.

The only thing that looks a bit odd to me is that it seems like the output_text1 (which corresponds to the response audio transcript) is also being fed to the model each turn (Although this is hard to confirm just by looking at the usage that is returned in each response.done event). What I’m saying is that, if the output audio is already being passed in each turn to maintain context, is it necessary to pass the transcript in as well? I hope an OpenAI dev could provide an answer to this.

Topic		Replies	Views
Realtime API input audio tokens increase even if text is entered. API realtime	2	238	November 14, 2024
How to get text only output from the Realtime API? API api , realtime	14	4102	June 20, 2025
Is there a way to prevent gpt-4o-audio-preview from returning audio? API audio	8	622	December 17, 2024
Realtime API extremely expensive Feedback realtime	66	7489	December 4, 2024
Realtime API Pricing: VAD and Token Accumulation - A KILLER Community token , pricing , tokenization , realtime	22	4283	September 4, 2025

Realtime API re-consuming it's own output audio as input audio

Related topics