I’ve been experimenting with the OpenAI Realtime API and noticed something about the billing that I don’t fully understand.
It looks like each new request from the client becomes more expensive than the previous one, even if the new input is very short (for example, just a single word or a short audio clip).
From what I can tell, this happens because the tokens (both text and audio) accumulate in the conversation buffer. In other words, every request to the model includes not only the latest input, but also all the previous messages and responses, which means the input token count grows over time.
So my questions are:
Is this understanding correct — that the Realtime API always includes the full accumulated context in each new request?
Does this mean that every new request will naturally cost more than the previous one, unless I manually trim or reset the conversation history?
What are the recommended practices to avoid paying for unnecessary tokens (e.g., clearing old context, caching, or limiting history size)?
Yes, the context window that is sent to the actual model for each generation of audio (or text) continues to grow, and on realtime is managed by OpenAI server-side.
The gpt-realtime model has a maximum context window length of 16k, most of which is used for input. It is only after this is exceeded that OpenAI must start discarding the old audio, they don’t give you a settable size. At maximum size, it’s $0.75 when conversations are being discarded, per turn.
The realtime audio endpoint has a cache of previously encoded input that is usually hit, a significant discount when the growing input doesn’t have a start point that changes. Both the AI’s previous spoken audio and the new user input is not part of a cache, and must pay the full price.
There is a truncation API option, with the amount of discarding you can do when you hit the maximum input expense. Otherwise, you must limit the number of turns or the length of the audio session to manage costs.
A realtime session demonstrates the turn-based memory context needed for answering the latest question.