Hey everyone quick question to see if anyone else ran into this.
I’m using the Responses API with conversation_id (no previous_response_id, and I’m not sending any input either just letting it build purely from the conversation history).
What I’m seeing is that once I hit the token threshold, the response includes a compaction item in the output as expected. However, in the next response, it doesn’t seem like that compacted state is being reused it looks like it’s still processing the full conversation instead of the compacted version.
My understanding was that compaction is handled fully server-side when using conversation_id, so I wouldn’t need to manually pass anything (like the compacted result) into subsequent calls.
Has anyone experienced this?
Am I supposed to explicitly send the compacted content somewhere, or is this expected behavior?
Tested with gpt-5-mini and gpt-5.2, same result in both.
Yeah, your read of the compaction guide makes sense.
With conversation_id, compaction is supposed to be handled server-side, and the compacted state should automatically carry forward into subsequent turns without needing to resend anything manually.
What stands out in your case is:
You’re not sending any new input
And the token usage doesn’t appear to drop after compaction is triggered
That combination suggests the compacted state may not actually be getting reused on the next turn, which doesn’t seem to match the expected behavior from the docs.
I haven’t personally run into this exact pattern, so a couple things that could help narrow it down:
Does it still happen if you include even a minimal input in the next turn?
Does the token usage resemble full history vs. what you’d expect from a compacted context?
If you can share a request ID + timestamp, that would make it easier to verify whether this is expected behavior or something off relative to the compaction docs.
I’m not sending input because I’m not running a 1:1 interaction flow. Instead, I push multiple client messages into the conversation, and then generate a single response after a debounce period (i.e. N messages → 1 response). So I rely entirely on the conversation state rather than passing new input on each turn.
Regarding compaction, I can confirm that it does trigger, I see the compaction item being emitted in the response output. However, it doesn’t seem to persist in the conversation or be taken into account in the next response generation. Token usage also looks consistent with full history rather than a compacted context.
In this conversation_id, the input token usage increased as follows: 6993 → 8316 → 11274 → 11536 → 14278 → 15227 → 18839 → 19260, where the threshold is 10k.
Let me know if you need anything else from my side.
The main point to be aware of is that only certain parts of a conversation can be compacted. If a conversation consists mostly of user messages, those will not be compacted, so the gain is usually close to zero. Model reasoning chains and tool calls, on the other hand, can be removed while still preserving most of the informational value of the compacted item.
Lastly, I was also a bit confused about the non-use of input, because a conversation can be created without using it, but a Responses API call with a conversation ID returns an error if input is not provided.
Hey @vb, thanks for the clarification, that actually helps a lot.
I didn’t realize user messages are not compacted, I was under the impression they were included in the process, so that explains a big part of what I’m seeing.
Also, you’re right about input, that was my mistake in how I explained it. I am sending it, but as an empty array (input: []), since otherwise the API throws an error.
Given that, I have a follow-up question:
If compaction doesn’t really help reduce input tokens (especially when most of the conversation is user messages), what’s the recommended way to control token growth? Since in my case it keeps increasing linearly.
Would you happen to know if there’s any official guidance from OpenAI on this, or is it generally expected that we handle it manually? For example, pruning older messages, generating summaries, or replacing parts of the conversation.
I’d also love to hear how others are approaching this in practice.
Thanks again for your help, really appreciate your response!
Glad this helps clarify the underlying issue. I have already asked the team whether they want to update the documentation to make this clearer going forward.
Regarding your follow-up question, you will likely get a better answer if you can share a bit more about your exact use case. I usually summarize based on project-specific requirements, but that is also partly shaped by my own preference and experience with the structures I create to help me evaluate and optimize later.
Thanks, appreciate you checking with the team, I’ll keep an eye on the docs and the community in case this gets clarified further.
Regarding my use case, it’s basically a bot that operates on social media for online stores. I receive multiple messages in parallel from users, and then respond using a debounce strategy to make sure everything gets answered in a single response.
Since these are e-commerce scenarios, conversations can get quite long. Users ask multiple questions, trigger catalog searches that can return a lot of items (via function calls), and there are also other tool calls to persist certain pieces of information in the chat.
Because of that, input token usage grows pretty aggressively over time. I’ve seen conversations reaching up to ~280k input tokens, which is starting to become unsustainable.
That’s why I’m trying to understand what the best approach would be here, since relying on compaction alone doesn’t seem to help much in this kind of setup.