We are seeing a consistent progressive latency increase in long-running voice sessions with gpt-realtime-1.5.
This is not limited to tool-calling turns. It also affects normal assistant responses that do not call any tools.
We use the model in a real-time phone / voice interview workflow. Early in the session, the model is fast and responsive. As the session gets longer, the latency grows significantly. By the later parts of the call, the end-to-end delay from user answer to next assistant question can be roughly 3x or more compared to the beginning.
What makes this especially confusing is that we already clear old conversation items during the session, and we also compact old context into a short summary. So we would expect latency to improve after a memory cleanup, but in practice the model often remains slow even after cleanup.
What we observed:
-
Early in the call, assistant-turn latency is often around 1 to 2 seconds.
-
Later in the same call, assistant-turn latency often grows to around 6 to 8 seconds.
-
In some later turns, we also see much larger spikes, for example 20 seconds or more.
-
This happens both:
-
on turns that include tool calls
-
and on turns that are just normal assistant speech without any tool call
-
-
Our own server-side processing time increases only moderately over the same period, so the main slowdown appears to be on the model side, not in our orchestration layer.
More specifically, our measurements show roughly this pattern:
-
Early session:
-
model / assistant turn: about 1 to 2 seconds
-
our server / graph resume: about 0.7 to 1.1 seconds
-
-
Later session (9min+):
-
model / assistant turn: about 6 to 8+ seconds
-
our server / graph resume: about 1.7 to 2.2 seconds
-
-
At minute 15 it gets awfully slow and at minute 20 unuseable
- I am building a plattform for Telephone Surveys, so long running interviews are especially important, as those are the biggest cost saver.
So the server does get somewhat slower, but the much larger increase is in the assistant generation time.
We also checked what happens after memory cleanup. We periodically delete older conversation items and replace them with a short summary of prior context. However, even after this cleanup, latency often remains high. So the issue does not appear to be explained simply by “too many raw conversation items still present”.
Additional observations:
-
Late in the session, input token counts still appear relatively high even after cleanup.
-
We often still see input sizes in the rough range of about 9k to 10.5k input tokens.
-
We also see cached tokens and audio tokens present, which makes us wonder whether the effective real-time context remains large even after deleting old items.
-
Because latency remains elevated even after cleanup, we suspect one or more of the following:
-
deleted items are not reducing effective model-side context as much as expected
-
the compacted summary still leaves enough contextual burden to keep latency high
-
long-running audio / realtime sessions have some latency growth characteristics independent of visible conversation item count
-
there may be model-side state accumulation or performance degradation over long sessions
-
What we would like to understand:
-
Is progressive latency increase over long gpt-realtime-1.5 sessions expected behavior?
-
Does deleting conversation items in a live realtime session fully reduce the model’s effective context cost, or only partially?
-
If older turns are replaced by a compact summary in instructions, should we still expect substantial latency reduction, or is the effect usually limited?
-
Are there recommended best practices for keeping long realtime voice sessions fast over time?
-
Is there any known issue with long-running realtime sessions where assistant-turn latency grows even when old items are deleted?
A minimal summary of the problem:
-
Long realtime voice session
-
Early turns are fast
-
Latency grows progressively over time
-
Happens even on non-tool turns
-
We already delete old conversation items
-
We also summarize prior context
-
Cleanup does not reliably restore initial latency
-
The main increase appears to be on the model side, not our server side
Please fix fast :D.