While integrating compaction into our workflow, we encountered some ambiguity. The docs tell us to send the entirety of the conversation so far as input for the compaction, yet there is also a previous_response_idavailable on the request. We assumed we could simply hit the /compactendpoint with that id and get the compacted response object, but the docs are unclear:
The unique ID of the previous response to the model. Use this to create multi-turn conversations. Learn more about conversation state. Cannot be used in conjunction with conversation.
What does it means to use this to create “multi-turn conversations”? In this context, wouldn’t it make more sense to simply use the cached response of that previous id as input for the compaction?
It is essentially for self-management of conversation, but a platform lock-in of self-management if you use that mechanism and discard the original conversation.
It is indeed something you’d want to do only when the cache has already been broken, such as with an idled conversation after any cache would time out, or the context absolutely can’t be run due to length.
A previous response ID as conversation state doesn’t do much to encourage better cache; it is just a different way to send messages, and in fact, if you are using “truncation”:“auto”, it is the API deciding when to discard messages and break the cache on you.
I assume it was a (poor) “copy paste” from the create endpoint parameter. Definitely could use a better wording.
Anyways, I think what they really meant was that you can use the previous_response_id parameter instead of passing the whole input again, that would be it.
Yes, you can have it produce upon an input ID of a previous response.
One might think it could then modify that response ID. Retrieve and see if the messages are gone and it is not a permanently stateful ID?
Then you’d have to run that compaction as one turn of conversation input, getting pretty close to “why don’t I do this all myself”.
One other curiosity: You’d think that a response ID is a discrete state of the input that was run. But no. You will find instead that a response ID is a chain of references to previous response IDs. Delete one of the earlier IDs (if you are collecting them all for a chat session like you could just collect all the messages) - you then have a conversation where the input history stops where the chain is broken. Perhaps an easier “compaction” to discard some chaining, killing old IDs on demand, when you feel like it is a good time to truncate - as long as it keeps working that way.
Oh, that’s what I was thinking too. Which was the main purpose of my question: does it work this way? Has anyone successfully used it with that previous response id?
Just reporting back, seems to work as @aprendendo.next reported, so really appreciate it. Threw me off when the object back was just inputs. I know it says it in the docs, but thought maybe the compactions would be in-between inputs - instead it was just the last object (encrypted object).
OAI Logs in Dashboard also not reporting compaction threw me off a bit - only reporting back a string of inputs.
On a follow-up request, referred to info “lost” in the encrypted info, and it came back 100% accurate. Will be interesting to see if there is a true savings here given our current approach (using previous_response_id). Not apparent given this first test, and would love more data from OAI on it, but again thanks for the help.
The main thing to consider: scheduling a compaction run powered by AI doesn’t really fit any pattern of “excellence”.
Are you going to do it preemptively just in case someone doesn’t abandon a chat but instead revisits it? “Paying it, forward”.
Are you going to make someone wait while an AI generates thousands of tokens that are not thinking or a response?
Are you switching models with different context lengths? Can you know when to do it if you are switching between gpt-4.1 (1M) and gpt-4o (<128k)? There’s no “here’s how much to keep, or don’t drop anything at all” parameter.
Are you going to schedule it with knowledge of when the cache is already broken by timeout, broken on a particular model or not existing on another? Are you using server-side conversation storage exclusively, needing to retrieve messages and count tokens to even inform your calls to this endpoint?
The main thing I would do, since you cannot observe the quality delivered: only pass the oldest turns for compaction. Then you’d still potentially destroy “here’s the code base we’ll be discussing this whole chat” inputs with summation.
For long-running conversations with the Responses API, you can use the /responses/compactendpoint to shrink the context you send with each turn.
Compaction is stateless: you send the full window to the endpoint, and it returns a compacted window that you provide in the next /responses call.
All prior user messages are kept verbatim.
Prior assistant messages, tool calls, tool results, and encrypted reasoning are replaced with a single encrypted compaction item that preserves the model’s latent understanding while remaining opaque and ZDR-compatible.
Usage flow
Send Responses requests as usual with user messages, assistant replies, and tool interactions.
When the context window grows large, call /responses/compact with the full window (it must still fit within the model’s max context size).
Use the returned compacted window as the input for the next /responses request and continue the workflow.
Instructions (optional)
The instructions field lets you include a system-style message that applies only to the compaction request. We recommend using this field only if you also supply instructions when creating responses, and ensuring that the same instructions are passed to both the Responses and Compact endpoints.