Realtime API, input audio tokens exploding

bojo · August 17, 2025, 10:23pm

I have a usecase, where I keep the session for 30+ mins. I can see audio tokens adding up very quickly.

I understand that apparently the audio buffer keeps building up. Even though I do a forced commit every 10-20s. But apparently since VAD is off, the buffer builds up.

My questions are:

Is my understanding correct, that unless I do input_audio_buffer.clear, the buffer keeps adding up, even though I did a commit, and response.create?
If I do “input_audio_buffer.clear“, would I lose previous context? or as long as its after a commit, the conversation items would still have the needed context?

_j · August 18, 2025, 9:43am

It’s important to understand that underneath the “realtime” facade, the AI model in use is still turn-based. A server-side conversation is maintained.

The audio buffer is containing the live audio that you are streaming to the endpoint. There are two operational modes:

with server voice activity detection: a new “create a response” is triggered whenever there is a sufficient gap in speech
by sending a “create” trigger yourself and not using the server detection of turns.

The input is a new “message” for the AI, and then it creates an output back to you. Both that input and output are added to a conversation history, extending the context that is sent to the model each time.

That chat grows for the entire session, only with an awkward per-message deletion feature added, where you’d have to list previous messages to even get an idea what can be deleted. There are also odd patterns of “commit” for adding some audio to the buffer but not triggering an immediate response.

Thus: the input tokens sent to the model grows and grows, each turn having more input than the last.

Your question about audio buffer-clearing is only for the received audio that might be used for a turn, and it is only practical if you are manually triggering the “create response”. A UI could have an interface like “send” or “cancel”, where cancel wipes previous speech instead of making a response.

bojo · August 18, 2025, 12:44pm

Thank you. I disabled server vad. And do manual responce.create.

In this case i need to manually clear the input audio buffer to avoid the accumulated audio. Right?

Or ais it automatically cleared at commot/response.create?

_j · August 18, 2025, 1:21pm

The buffer is cleared when it is tokenized and employed, or committed.

You can buzz through all the client events in the API reference, the only way you can really get a picture of how to interact with realtime.

bojo · August 18, 2025, 2:23pm

I tried, but it confused me. because it mentions that input_audio_buffer.clear is not needed in service vad. But can be used if server vad is none. There is no mention that its cleared anways on commit (which makes sense).

But now I do not understand.. why is the number if audio tokens increasing so much?

Topic		Replies	Views
Realtime API audio input tokens usage adding up every question Bugs api , assistants-api	2	173	January 22, 2025
Lets break down the input/output token details together! API realtime	3	1316	October 6, 2024
Realtime API re-consuming it's own output audio as input audio API audio , realtime , api-realtime , api-realtime-speech	10	1010	January 10, 2025
Realtime API input audio tokens increase even if text is entered. API realtime	2	234	November 14, 2024
Reset Conversation in Realtime API API realtime	11	1495	December 5, 2024

Realtime API, input audio tokens exploding

Related topics