I’ve been tinkering and want to make sure I’m tracking correctly. It appears that managing conversation state on the server side by passing the
previous_response_id is limited to only the last message?
Am I missing something? I’m looking for a way to extend this functionality such that I could, for example, pass a single parameter that would fetch the entire history of the conversation, not just the last message.
It seems to get the whole thing. I tried it just now:
Hello! This is a test of whether my script’s history is working. In a few turns, I’ll ask you what the magic word is (it’s banana). Don’t mention it until then!
Got it! I’ll wait for you to ask.
In the meantime, how are you today?
I’m just a program, but I’m here and ready to help! How are you doing?
Doing OK I guess.
I’m glad to hear that! If there’s anything you’d like to chat about or need help with, feel free to let me know.
Interesting. I was testing last night and did not get this result. My test case was a little more convoluted though. I’ll have to try this out the way you did. Thank you!
Indeed, on each call we have to provide a previous_response_id to act as a conversation — but this increase tokens usage so much… A first response which use ~3000 tokens goes to ~15000 tokens after a few turns… Are those charged or cached and not charged?
I’m also wondering if instructions use tokens each turn.
It becomes very expensive in my case because I use RAG to provide context as system message which are linked in previous response on each turn which increase token usage so much..
Workaround could be to store conversation and conversation messages and give N last messages as assistant/user messages, but doing this make previous_response_id irrelevant — So I’m not sure it should be the way to go.
Anyone has faced those issues?
Yes, you have discovered that Responses is not production grade. Neither its server-side chat state nor its internal tools are ultimately useful.
Repeated inputs still would not be “free”, as your input context just grows and grows.
The cache discount that is offered is that for over 1024 tokens, if routed to the same server, if not expired out in a 5-60 minute service window, you can receive a 50% discount on that input that is in common, the amount under a 128 token segmentation (75% on gpt-4.1).
It would be possible for a backend to persist a prior k-v cache context window with the response id, but that is not what OpenAI offers you. In close analysis of latency trials, one also discovers that there is commonality in cacheable performance when below the threshold to receive a discount, but no discount for you.
Instructions are also input tokens, and altering instructions, which are the first message you can control, would be cache-breaking.
Workaround would be a “threshold” API parameter that actually decides something between the maximum billing or an error when sending a model a million tokens. OpenAI offers no length management. Or simply don’t use the technology, use Chat Completions.
Hi, I hope you’re doing well. I have a question regarding the use of the previous_response_id parameter in OpenAI.
If I use previous_response_id, will it significantly impact the cost?
Is there an optimized way to pass conversation history manually?
I want to understand — does using previous_response_id reduce the cost compared to when we used to manually pass the entire conversation history in each API call?
Also, does the previous_response_id parameter work with the Chat Completions API method and in Structured Data Parsing as well?
Thanks and could you please let me know if there is any limit to how much conversation history is retained when using the previous_response_id parameter?
The stored response is just a way for the entire chat to be resent - without any interface for self-management, without any value you can pass for the budget you want to spend.
The limit is that all conversation is retained, nothing is discarded until you would exceed the total context window of the AI model that you are using. Then you get an API error unless you switch to sending “truncation”: “auto” as an API parameter, which finally allows the chat to continue, only discarding turns at the model’s maximum, not at any setting you can send to limit your cost per turn.
Basically, the chosen model’s context limits of input tokens for the accumulated conversation history.
What previous_response_id does is just a backward retrieval of the previous requests to reconstruct a new one internally, but if the chosen model can’t handle the context window, it will fail like it would in any normal request.