I’m comparing the Completions API with the new Responses API.
With Completions, I send the full conversation history in each request. With Responses, I only reference the previous message using previous_response_id, so I expected lower input token usage.
But in practice, I’m seeing almost the same number of input tokens on both sides.
Does the Responses API still internally process the full history? I thought it would manage context more efficiently and reduce cost.
Any insights?
1 Like
Exactly, it is still the same input context.
Just a bit more convenient.
The benefit with previous_response_id
is that you don’t have to store your own context. All you need is the ID. It still stands that whenever the context is being changed, the entire thing has to be ran through the LLM all over again and you’re billed for all input tokens in each iteration.
If your context is over 1000 tokens, you may benefit from input cache savings, but this is regardless of whether or not you use the new Responses API and happens entirely automatically. Other methods of reducing cost include downgrading to a cheaper model, distillation (into another OpenAI model), optimizing your prompts and output format, and using the Batch API or flex processing.