Compact a Response with Conversations API

matias_torem · April 8, 2026, 6:27pm

Hey everyone quick question to see if anyone else ran into this.

I’m using the Responses API with conversation_id (no previous_response_id, and I’m not sending any input either just letting it build purely from the conversation history).

What I’m seeing is that once I hit the token threshold, the response includes a compaction item in the output as expected. However, in the next response, it doesn’t seem like that compacted state is being reused it looks like it’s still processing the full conversation instead of the compacted version.

My understanding was that compaction is handled fully server-side when using conversation_id, so I wouldn’t need to manually pass anything (like the compacted result) into subsequent calls.

Has anyone experienced this?
Am I supposed to explicitly send the compacted content somewhere, or is this expected behavior?

Tested with gpt-5-mini and gpt-5.2, same result in both.

OpenAI_Support · April 9, 2026, 5:16pm

Hi @matias_torem,

Yeah, your read of the compaction guide makes sense.

With conversation_id, compaction is supposed to be handled server-side, and the compacted state should automatically carry forward into subsequent turns without needing to resend anything manually.

What stands out in your case is:

You’re not sending any new input
And the token usage doesn’t appear to drop after compaction is triggered

That combination suggests the compacted state may not actually be getting reused on the next turn, which doesn’t seem to match the expected behavior from the docs.

I haven’t personally run into this exact pattern, so a couple things that could help narrow it down:

Does it still happen if you include even a minimal input in the next turn?
Does the token usage resemble full history vs. what you’d expect from a compacted context?

If you can share a request ID + timestamp, that would make it easier to verify whether this is expected behavior or something off relative to the compaction docs.

~ SD

matias_torem · April 9, 2026, 6:07pm

Hi @OpenAI_Support, thank you for following up on my case.

A bit more context on my setup:

I’m not sending input because I’m not running a 1:1 interaction flow. Instead, I push multiple client messages into the conversation, and then generate a single response after a debounce period (i.e. N messages → 1 response). So I rely entirely on the conversation state rather than passing new input on each turn.

Regarding compaction, I can confirm that it does trigger, I see the compaction item being emitted in the response output. However, it doesn’t seem to persist in the conversation or be taken into account in the next response generation. Token usage also looks consistent with full history rather than a compacted context.

Here are some example IDs in case it helps debug:

conversation_id: conv_69d7e784bc0c8193ad250b10eadb1e840024960519c827b5
request_id: req_d7db7bb4106f4a8abc22e34e66773575
timestamp: 1775757401

In this conversation_id, the input token usage increased as follows: 6993 → 8316 → 11274 → 11536 → 14278 → 15227 → 18839 → 19260, where the threshold is 10k.

Let me know if you need anything else from my side.

vb · April 9, 2026, 6:14pm

Hi @matias_torem,

The main point to be aware of is that only certain parts of a conversation can be compacted. If a conversation consists mostly of user messages, those will not be compacted, so the gain is usually close to zero. Model reasoning chains and tool calls, on the other hand, can be removed while still preserving most of the informational value of the compacted item.

Lastly, I was also a bit confused about the non-use of input, because a conversation can be created without using it, but a Responses API call with a conversation ID returns an error if input is not provided.

Hope this helps!

matias_torem · April 9, 2026, 6:32pm

Hey @vb, thanks for the clarification, that actually helps a lot.

I didn’t realize user messages are not compacted, I was under the impression they were included in the process, so that explains a big part of what I’m seeing.

Also, you’re right about input, that was my mistake in how I explained it. I am sending it, but as an empty array (input: []), since otherwise the API throws an error.

Given that, I have a follow-up question:

If compaction doesn’t really help reduce input tokens (especially when most of the conversation is user messages), what’s the recommended way to control token growth? Since in my case it keeps increasing linearly.

Would you happen to know if there’s any official guidance from OpenAI on this, or is it generally expected that we handle it manually? For example, pruning older messages, generating summaries, or replacing parts of the conversation.

I’d also love to hear how others are approaching this in practice.

Thanks again for your help, really appreciate your response!

vb · April 9, 2026, 7:49pm

Glad this helps clarify the underlying issue. I have already asked the team whether they want to update the documentation to make this clearer going forward.

Regarding your follow-up question, you will likely get a better answer if you can share a bit more about your exact use case. I usually summarize based on project-specific requirements, but that is also partly shaped by my own preference and experience with the structures I create to help me evaluate and optimize later.

matias_torem · April 9, 2026, 8:24pm

Thanks, appreciate you checking with the team, I’ll keep an eye on the docs and the community in case this gets clarified further.

Regarding my use case, it’s basically a bot that operates on social media for online stores. I receive multiple messages in parallel from users, and then respond using a debounce strategy to make sure everything gets answered in a single response.

Since these are e-commerce scenarios, conversations can get quite long. Users ask multiple questions, trigger catalog searches that can return a lot of items (via function calls), and there are also other tool calls to persist certain pieces of information in the chat.

Because of that, input token usage grows pretty aggressively over time. I’ve seen conversations reaching up to ~280k input tokens, which is starting to become unsustainable.

That’s why I’m trying to understand what the best approach would be here, since relying on compaction alone doesn’t seem to help much in this kind of setup.

Topic		Replies	Views
Compact a Response with Previous Response ID API	9	466	January 26, 2026
Migrating Assistant to Responses - long thread management API	2	755	May 13, 2025
Responses API: Question about managing conversation state with previous_response_id API responses-endpoint	22	7661	October 31, 2025
Conversations API - When does pruning happen? API responses-api , conversation	7	625	September 13, 2025
A conversation using the API API	6	3154	December 16, 2023

Compact a Response with Conversations API

Related topics