I am dealing with a chatbot application that uses the responses API.
For conversation state, I am using the conversation API instead of passing the previous_response_id param: I create a conversation per user, once, and pass it as the conversation param to the client.responses.create method.
My question is: given that I couldn’t find anything about conversation pruning in the docs, am I supposed to manually manage the pruning my calling delete-item on older items in a conversation?
If I don’t, will the context given to the LLM just keep increasing as new messages are added until the context window limit of the model is reached, or is there any other type of pruning employed?
The point is that my application doesn’t provide explicit conversation management to the user (just a single thread; no “new chat” button - although it wouldn’t be a problem for the user to eventually lose access to very old messages, as they move out of some “time window”), so I’m a bit scared at the idea of just keeping a single conversation object per user forever.
Every I am having similar issue, So what I did was to delete the conversation by setting in redis for 15 min. When redis key expries I will trigger delete of converstion
No mechanism is offered to fit conversation length to a budget. This is a shortcoming that makes server-side conversation state on Responses unsuitable for use, either by reuse of a response ID or by a “conversation”.
There is a parameter “truncation”. It decides if you get an error message if the conversation is larger than the model input context, or if instead, only at the maximum, are there turns dropped based on the model.
If OpenAI can provide their truncation method with the context window length of the model, they can darn-well provide it with your own parameter for the maximum input. They don’t, though.
Manually deleting messages is data loss. If you want a conversation to run on gpt-4, you’d have to prune the input plus the latest input down to 6k in order for room to form a response. Then you switch to gpt-4.1, and you’ve still deleted the one million tokens of input you could be billed on that model.
The best case is not to use the offered server-side conversation state, until they come up with a budget setting that is also cache-aware: for determining which turns are candidates for advancing a cutoff pointer in large increments, by model, by expiry time, by cache discount.
If you want to use a conversation method, there is not even Assistants endpoint’s maximum number of turns threshold.
A lot of wasted effort to not deliver practicality.
There is an automatic process for deleting messages after 30 days
There is also an option for automatically truncating large responses, disabled or auto (when exceeding the Model’s Window Size). This feature is a good start but is limited. For instance, I would like to have more control over the type of messages that are truncated - tool messages are kept and can be quite large.
I believe the AI Agent API offers more options in relation to this.
This is an excerpt from the Response API document - Response Object section
truncation
string or null
The truncation strategy to use for the model response.
auto: If the input to this Response exceeds the model’s context window size, the model will truncate the response to fit the context window by dropping items from the beginning of the conversation.
disabled (default): If the input size will exceed the context window size for a model, the request will fail with a 400 error.
And from the Microsoft Learn site - Response API - Delete Response section
When you keep a single conversation id, OpenAI stores every turn. The model will only see content that fits its context window—older text is dropped at runtime—but the stored conversation keeps growing.
Best practice:
Keep a small rolling window and summarize older parts.
Delete unneeded items with conversation.item.delete.
Optionally set truncation: "auto" as a safety net.
Use store: false for exchanges you don’t need to keep.
This keeps token usage, cost, and latency predictable and avoids uncontrolled growth of long-running chats.
Sorry, ChatGPT, but based on your answer provided by only repeating what was already seen in this forum topic, an incorrect assumption was made.
You would not want to damage a conversation that may need retrieval for customer presentation. You would not want to damage a conversation that may be run against different context length models.
What you would want is an effective “truncation” interface that does better than OpenAI, taking any kind of parameter short of running a million old chat tokens per turn, per internal iteration.