Why does each new request in Realtime API get more expensive? Are tokens accumulating?

ALPALP · September 5, 2025, 1:13pm

Hi everyone,

I’ve been experimenting with the OpenAI Realtime API and noticed something about the billing that I don’t fully understand.

It looks like each new request from the client becomes more expensive than the previous one, even if the new input is very short (for example, just a single word or a short audio clip).

From what I can tell, this happens because the tokens (both text and audio) accumulate in the conversation buffer. In other words, every request to the model includes not only the latest input, but also all the previous messages and responses, which means the input token count grows over time.

So my questions are:

Is this understanding correct — that the Realtime API always includes the full accumulated context in each new request?
Does this mean that every new request will naturally cost more than the previous one, unless I manually trim or reset the conversation history?
What are the recommended practices to avoid paying for unnecessary tokens (e.g., clearing old context, caching, or limiting history size)?

Thanks in advance for clarifying!

_j · September 5, 2025, 1:50pm

Yes, the context window that is sent to the actual model for each generation of audio (or text) continues to grow, and on realtime is managed by OpenAI server-side.

The gpt-realtime model has a maximum context window length of 16k, most of which is used for input. It is only after this is exceeded that OpenAI must start discarding the old audio, they don’t give you a settable size. At maximum size, it’s $0.75 when conversations are being discarded, per turn.

The realtime audio endpoint has a cache of previously encoded input that is usually hit, a significant discount when the growing input doesn’t have a start point that changes. Both the AI’s previous spoken audio and the new user input is not part of a cache, and must pay the full price.

There is a truncation API option, with the amount of discarding you can do when you hit the maximum input expense. Otherwise, you must limit the number of turns or the length of the audio session to manage costs.

A realtime session demonstrates the turn-based memory context needed for answering the latest question.

Topic		Replies	Views
Lets break down the input/output token details together! API realtime	3	1357	October 6, 2024
Why are my context tokens used so quickly? API api	3	2921	January 5, 2024
Confusion Between Per-Minute Audio Pricing vs. Token-Based Audio Pricing API realtime	3	6140	December 30, 2024
Realtime API, input audio tokens exploding API realtime	4	101	August 18, 2025
Help me understand the true cost of the RealTime API API api , realtime	2	1282	March 26, 2025

Why does each new request in Realtime API get more expensive? Are tokens accumulating?

Related topics