Documentation issues: Responses endpoint storage, storage persistence, truncation

API reference - Responses, truncation

Issue: the description makes no sense. And frankly, neither does the parameter name.

truncation - string or null, optional, defaults to disabled

The truncation strategy to use for the model response.

  • auto: If the context of this response and previous ones exceeds the model’s context window size, the model will truncate the response to fit the context window by dropping input items in the middle of the conversation.
  • disabled (default): If a model response will exceed the context window size for a model, the request will fail with a 400 error.

Truncate: the end is clipped or removed.

The endpoint itself will return a truncated output if AI writes up to the end of the context window length without finishing.

What does this parameter do though? The confused writer tells us “the model will truncate the response to fit”. Wrong. Cut off the output as a feature?

What is the intent?

In fact, “auto” is supposed to discard some in-the-middle chat turns, to keep the input below a token length that would cause an error, as the “auto” behavior. Default is to error out, but we don’t know exactly when it errors out, if it is when there is 20 or 2000 tokens left after your input.

Thus: what is that input threshold of how much is tossed, and how much remains against a model context window length specification? Some context window reservation must be left for forming the output - how much remaining memory for a response is “not” input? The max response length of the model, the sent max_response_tokens? More for a reasoning model, or by its reasoning effort parameter?

Then, does it act on only discarding stored input, or can I send 1M of many turns in input and that also will be affected? What is the behavior of this “in the middle”. Docs don’t tell. Only that the default behavior is to maximize the input billed, also without explaining the threshold where a limit is reached and error is returned by default (123k of 125k).

I’d try answer, but that would take work and exploration and inference based on probing the API at large inputs.


Stored response persistence

Documentation says:

Response objects are saved for 30 days by default. They can be viewed in the dashboard logs page or retrieved via the API. You can disable this behavior by setting store to false when creating a Response.

Evidence in the Responses logs is contrary, suggesting that “default” so far is “forever”…

I have more I could write, but there’s nobody reading, apparently.

5 Likes

Well, I do read these :slight_smile:

We recently had another topic with a similar observation. I honestly didn’t know that to say.

@WhiteRabbit

1 Like

Hello

Best option for now (for devs):
Use a manual sliding window strategy based on token counting (tiktoken or similar).
It’s the only way to avoid silent context drops or unexpected charges from the Responses API, especially when using long conversations.

Trim conversation before API call

while token_count(messages) + max_response > limit:
messages.pop(0)

Would love to see OpenAI expose context_tokens_used and a truncation_warning flag directly in the response.
Nicholas

1 Like

The answer is unchanged from my 0-day reply in the Responses topic:

There is no real management, the server-side state will run a model up to the maximum cost, with your option of paying or getting an error.

The needed parameter would be something like:

input_budget: [ int | str=“auto”, “disabled” ]

Discard the oldest chat turns, wholly, that appear after “system” “developer” or “tools” specification context, until the integer number of tokens parameter supplied by the developer is no longer exceeded by input.

“auto” is the model context window length, minus max_output_tokens, minus max_tokens, or then falling back to the maximum model response length or 8192 tokens max and half-context if an unlimited model.

Note: will also act on the newest multi-turn input, thus can act partially or independently on a call’s input also.

OpenAI already necessarily has such, so that Responses and past IDs could work from gpt-4 (8k) to gpt-4.1 (1M). They just don’t expose it. I suspect the reasoning may be that non-changing input has even more benefit for OpenAI by caching than by the discount you’d want this to optimize for.

The point where “discarding (somewhere in the middle?)” could be discovered by exceeding the model context with input against a prior call near maximum, and then seeing what “cached tokens” reports for the remaining start of identity (that you ensure is > 1024).

But why? It would just report on undesirable action out of your control.

You’re right
We can’t prevent mid-context discards. But detecting them still matters.

Here’s a practical method for devs:

  1. Inject identifiable markers at known positions in your prompt (e.g., “MARKER_512”, “MARKER_1024”…).

  2. Saturate the context window, then observe which markers survive in the response context or logs.

  3. If “MARKER_512” is gone but “MARKER_2048” remains → discard happened in the earlier turns.

Why this helps:

Lets you avoid placing critical logic where the model silently trims.

Helps restructure prompts or apply a manual sliding window.

Enables debug visibility for large workflows (fine-tuning, long chains, audit trails).

OpenAI exposing context_tokens_used and a truncation_warning would make this clean but until then, this gives devs control.

Nicholas

This AI doesn’t even know the difference between prompts and messages.

In the future, please answer without the assistance of AI (that only convinces you) and only by your own first-hand knowledge. Everyone here has an AI to ask, and this topic is not seeking a resolution other than action by OpenAI, in terms of correcting documentation and documenting algorithms.

1 Like

Dear _j - thank you for your feedback.

We would appreciate if you could be respectful when interacting with other members of the community, who are genuinely trying to help and suggest workarounds. This promotes community engagement and great discussions.

If I am reading your first post correctly, you must be asking for two things:

1/ Truncation algorithm and providing more details into it's actual behaviour - I can check with Engineering and ask for a more detailed documentation.
2/ You are stating that the Response objects are saved forever - what leads you to these conclusions, can you please elaborate?

Thank you for feedback and will look forward to clarifications on point #2!

2 Likes

…who are genuinely using AI to cause widespread forum disruptions and do not offer any understanding of the core issue being discussed. See the mod history on that one…

Here’s how to read and reach the same conclusion.

  1. Find and look below the header “Stored response persistence”
  2. observe the quoted passage from current OpenAI documentation
  3. read the linked documentation and confirm the statement “Response objects are saved for 30 days by default” exists.
  4. Continue reading what I wrote, “Evidence in the Responses logs is contrary”
  5. Look at the pictured screenshot, which is taken in an organizations dashboard logs
  6. See the screenshot made in Responses logs, with stored items from March 12, (practically the date of the introduction of the endpoint) still persist
  7. Look at today’s date. Compare today to March 12. Here is a helper:

The screenshot in the first post that is offered to anyone to read right below the documentation that is in error, and which shares the same conclusion to be inferred. I show an organization’s logs, and unlike the documentation that says that stored responses persist for 30 days by default, or a “minimum of 30 days” of other pages, nothing at all has ever been automatically removed or expired at all, ever, from my organization logs of the Responses endpoint since the introduction of the Responses endpoint and the logging of “store” by default was turned on. The age is now 80 days and growing.

image

1 Like

I will investigate and escalate both issues to Engineering, as needed. Thank you.

1 Like

Thank you! :grinning_face:

I’ll try to revisit this forum topic and any documentation to see what information is offered about the truncation threshold and where it occurs. Also, if a practical threshold parameter to limit costs will be offered.


I am revisiting a week later. And the documentation still is unchanged and unimproved, in both:

  • How and where and by what increments of turns and at what length vs the context window does “truncate” drop past messages that exceed the context budget?
  • When will stored responses ever expire? 30 days remains in the documentation; March 12 remains in the logs.

OpenAI was just ordered by a court to retain all chats and API outputs indefinitely, so it could be like that for a while. Not sure why those weren’t already deleted tho.

The “store” entries can be deleted by the responses ID. This persistence cannot serve that purpose and didn’t change recently or at all.