What does "auto" truncation in realtime api actually do?

There is no mention of what does “auto” truncation strategy actually does wrt the realtime api. does it summarise the whole existing thread, does it remove messages from beginning/ middle? Would be great to understand the underlying straegy to better adapt for our usecase. For context - our users have really long conversations with the realtime api (30 min +)

2 Likes

"Fraction of post-instruction conversation tokens to retain" says the retention ratio method.

“post-instruction” is your clue. Thus “auto” described as in-the-middle truncation elsewhere is removing the tail of oldest conversation tokens that cannot fit but preserving the initial re-run system message.

That it does not say “turns” but tokens as its consideration, assume:

You use a model with a 16kToken context window, and if a true context window and not actually referring to the input portion, there is a reservation for the output needed.

Let’s say the reservation for output production is the maximum 4k, so we consider the token threshold of discarding to be the cheapest optimistically, 12k input that can be run. You also minimize system message so you have the most for memory. Instead of an error, you opt for “auto” token-based turn-based discarding at the maximum without a clear delineation described “per message”.

On the cheapest gpt-realtime at $32/M, (vs 4o versions where the oldest snapshot still can run you $100/M), you then have:

$0.75 for ever response triggered, $0.75 for every accidental interruption, plus the output billed, with every turn cache-breaking of the highly-discounted cache.

The alternate truncation choice with an extremely low retention_ratio setting is almost mandatory over “auto”. With caching being an 80x discount, anything higher than you discarding all but 1/80 is a huge bump back up in price for the discarding turn. So terminating when run up to maximum makes economic sense, but happens pretty quickly. The memory is toast anyway, and unlike running on chat completions with a large context, you can’t substitute in text transcription except by restarting or continue to grow because of cache discount - and that means restarting a session on realtime to place text and know what is happening, and can’t be done when it is a remote client with full control of your API key with WebRTC.

2 Likes

I only understood half of this.. is your advice to use a very low retention_ratio or the opposite? My understanding:

  1. retention_ratio is used trim the input_context so a low ratio e.g 0.2 will result in 80% of the conversations tokens being dropped
  2. all the earliest input tokens that fall into that percentage are dropped - and are not specific to assistant or user messages (right ??)
  3. “aggressively” dropping the tokens using a low retention ratio allows the cache to build up again without it busting at every turn, therefore passing on token usage savings

In my own tests though, I do not see the input_tokens decreasing, even after the token_limit is reached:
2025-11-27 16:25:57,967 - WARNING - Input token limit exceeded: 53027 > 40000
..
2025-11-27 16:25:58,279 - ERROR - Response failed: rate_limit_exceeded, message=Rate limit reached for gpt-4o-realtime.. on tokens per min (TPM): Limit 40000, Used 37721, Requested 5527. Please try again in 4.872s. Visit https://platform.openai.com/account/rate-limits to learn more.

Isn’t the truncation setting supposed to prevent this from occurring BEFORE the limit is breached?

Here’s the reasoning of the realtime API truncation strategy (and what OpenAI doesn’t offer on the Responses API even though it is even more applicable):

For conversation history management:

  • If, at every new input, any old messages that exceeded the context window were trimmed automatically for the maximum context window “loading”, but no more than input beyond the maximum was removed, then it would be essentially impossible to get a cache discount, because the start of the context window would be different each turn.
  • Instead, by the technique of expiring a large chunk of the conversation history at a time, then follow-up turns that can grow on that same truncated input without any discarding can again receive the cache discount for a while.

The cost of realtime audio models is extreme and the discount is high, so you would want to take advantage of this mechanism, except where maximum “memory” is something you are willing to pay for. The context window of the realtime audio models is pretty small, expecially given that audio uses tokens at about 5x the rate as the similar text.

Okay this I understand. Thanks for clarifying. Are you or anyone else seemingly hitting the rate limit very fast? I see rate limit errors within the first 3-4 conversation turns for a relatively small input audio tokens.

Here is an example of the token usage across a conversation:

“token_details”: {
“event_CgpZgtKkM4esQT5ehpO9y”: {
“input_tokens”: {
“text_tokens”: 5088,
“audio_tokens”: 22,
“image_tokens”: 0,
“cached_tokens”: 0,
“cached_tokens_details”: {
“text_tokens”: 0,
“audio_tokens”: 0,
“image_tokens”: 0
}
},
“output_tokens”: {
“text_tokens”: 57,
“audio_tokens”: 0
}
},
“event_CgpZhTcFFqZ6XFaVD7Pup”: {
“input_tokens”: {
“text_tokens”: 5156,
“audio_tokens”: 22,
“image_tokens”: 0,
“cached_tokens”: 5120,
“cached_tokens_details”: {
“text_tokens”: 5120,
“audio_tokens”: 0,
“image_tokens”: 0
}
},
“output_tokens”: {
“text_tokens”: 8,
“audio_tokens”: 16
}
},
“event_CgpZmnUo4Ht6lTvX2505F”: {
“input_tokens”: {
“text_tokens”: 5174,
“audio_tokens”: 64,
“image_tokens”: 0,
“cached_tokens”: 5184,
“cached_tokens_details”: {
“text_tokens”: 5120,
“audio_tokens”: 64,
“image_tokens”: 0
}
},
“output_tokens”: {
“text_tokens”: 47,
“audio_tokens”: 0
}
},
“event_CgpZnnmOzY8F5N0L8jB0h”: {
“input_tokens”: {
“text_tokens”: 5230,
“audio_tokens”: 48,
“image_tokens”: 0,
“cached_tokens”: 4864,
“cached_tokens_details”: {
“text_tokens”: 4864,
“audio_tokens”: 0,
“image_tokens”: 0
}
},
“output_tokens”: {
“text_tokens”: 8,
“audio_tokens”: 26
}
},
“event_CgpZsXjkTaKC3NbkEVWPX”: {
“input_tokens”: {
“text_tokens”: 5248,
“audio_tokens”: 92,
“image_tokens”: 0,
“cached_tokens”: 5312,
“cached_tokens_details”: {
“text_tokens”: 5248,
“audio_tokens”: 64,
“image_tokens”: 0
}
},
“output_tokens”: {
“text_tokens”: 24,
“audio_tokens”: 0
}
},
“event_CgpZsqs7Z8UBXUg62yE60”: {
“input_tokens”: {
“text_tokens”: 5283,
“audio_tokens”: 66,
“image_tokens”: 0,
“cached_tokens”: 4864,
“cached_tokens_details”: {
“text_tokens”: 4864,
“audio_tokens”: 0,
“image_tokens”: 0
}
},
“output_tokens”: {
“text_tokens”: 8,
“audio_tokens”: 21
}
},
“event_CgpZz8suUsUVqR9ZXWRJ6”: {
“input_tokens”: {
“text_tokens”: 5088,
“audio_tokens”: 29,
“image_tokens”: 0,
“cached_tokens”: 4864,
“cached_tokens_details”: {
“text_tokens”: 4864,
“audio_tokens”: 0,
“image_tokens”: 0
}
},
“output_tokens”: {
“text_tokens”: 47,
“audio_tokens”: 51
}
},
“event_Cgpa0XlmBip35WmsnAXGM”: {
“input_tokens”: {
“text_tokens”: 5127,
“audio_tokens”: 0,
“image_tokens”: 0,
“cached_tokens”: 5056,
“cached_tokens_details”: {
“text_tokens”: 5056,
“audio_tokens”: 0,
“image_tokens”: 0
}
},
“output_tokens”: {
“text_tokens”: 16,
“audio_tokens”: 55
}
},
“event_Cgpa4nZru0iRsB1yNWCRi”: {
“input_tokens”: {
“text_tokens”: 0,
“audio_tokens”: 0,
“image_tokens”: 0,
“cached_tokens”: 0,
“cached_tokens_details”: {
“text_tokens”: 0,
“audio_tokens”: 0,
“image_tokens”: 0
}
},
“output_tokens”: {
“text_tokens”: 0,
“audio_tokens”: 0
}
},

Eventually my own custom conversation.item.delete logic kicks in, but it is not fast enough to prevent the rate_limit_exceeded error. I find it extremely doubtful that for a standard tier subscription, we reach the limit so quickly (within 3-4 turns).

If you have any advice or thoughts on this, or indeed a relative comparison of your own use of the realtime API, I would love to know!

Thanks

You have over 5k of input being sent each time that a ‘create response’ is triggered.

The tier-1 rate limit of realtime models is 40k, and the input is estimated, as well as consideration for maximum output tokens taking some potential limit.

Thus tier 1 is not at the point of having even one conversation that continues at a large input or maximum context loading,because someone can interrupt even faster than the model response can be heard.

A model:

https://platform.openai.com/docs/models/gpt-realtime-mini

How to much to pay in after a waiting time to increase your cumulative payments and thus tier (which does not happen “automatically”, only at the time of making a new payment, and considering payment amounts and not usage):

https://platform.openai.com/docs/guides/rate-limits/usage-tiers?context=tier-one

Thanks for the quick response. Most of those tokens in the “create response” trigger are actually in the cached_tokens (system prompt). How does that affect hitting the rate limit? I thought if they were in the cache, they should not count towards the token consumption of the 40k limit? Are you aware of any more docs talking through these details? Or indeed, any best practices that you recommend? Thanks!