I have a usecase, where I keep the session for 30+ mins. I can see audio tokens adding up very quickly.
I understand that apparently the audio buffer keeps building up. Even though I do a forced commit every 10-20s. But apparently since VAD is off, the buffer builds up.
My questions are:
Is my understanding correct, that unless I do input_audio_buffer.clear, the buffer keeps adding up, even though I did a commit, and response.create?
If I do “input_audio_buffer.clear“, would I lose previous context? or as long as its after a commit, the conversation items would still have the needed context?
It’s important to understand that underneath the “realtime” facade, the AI model in use is still turn-based. A server-side conversation is maintained.
The audio buffer is containing the live audio that you are streaming to the endpoint. There are two operational modes:
with server voice activity detection: a new “create a response” is triggered whenever there is a sufficient gap in speech
by sending a “create” trigger yourself and not using the server detection of turns.
The input is a new “message” for the AI, and then it creates an output back to you. Both that input and output are added to a conversation history, extending the context that is sent to the model each time.
That chat grows for the entire session, only with an awkward per-message deletion feature added, where you’d have to list previous messages to even get an idea what can be deleted. There are also odd patterns of “commit” for adding some audio to the buffer but not triggering an immediate response.
Thus: the input tokens sent to the model grows and grows, each turn having more input than the last.
Your question about audio buffer-clearing is only for the received audio that might be used for a turn, and it is only practical if you are manually triggering the “create response”. A UI could have an interface like “send” or “cancel”, where cancel wipes previous speech instead of making a response.
I tried, but it confused me. because it mentions that input_audio_buffer.clear is not needed in service vad. But can be used if server vad is none. There is no mention that its cleared anways on commit (which makes sense).
But now I do not understand.. why is the number if audio tokens increasing so much?