I am going to start transitioning from my Assistants api to responses api sooner rather than later.
There is currently no docs on how to do this, so i’m trying to figure it out on the fly.
My main issue is understanding thread_id alternative in responses api.
I understand that you can use previous_responses_id to call the conversation, but what is actually being fetched.
My concern is tokens. In my use case in assistants api i use file search. this retrieves chunks, adds to the input and then gives a response.
With responses api i have read that previous_responses_id adds ALL previous inputs into the current input. Now some of my previous inputs have token lengths have 20k input (90% coming from retrieval). If response api adds all previous inputs thats going to get expensive and fast.
Can someone please shed some light on transitioning from assistants api to responses baring in mind the above concern?
Would really appreciate some help! Thanks in advance!
I have the exact same set up in Assistants api and responses api, except obviously I am having to use previous_response_id in every turn for responses api.
Both use filesearch with the exact same settings.
same conversation: 5 messages in and out
Tokens used
assistant api: 58,363 total (in/out) Responses api: 145,809 (in/out).
I can see an output of cached tokens which keeps increasing on each turn.
Ok CLEARLY this is not feasible for two reasons, context window and costs.
If this is the new way, or the way it should be then openai just buried themselves, but i’m sure they cant be that stupid.
I was thinking about making the same transition in my application by replacing the assistant with the response API.
However, it seems interesting to wait a little longer, I believe that in the future the Response API will have more features that could justify the change.
How is this different from Assistants API? Clearly the token count is different, but why aren’t you limiting it the way you would do previously with max_tokens?
I started the transition this week too, and found the results to be insanely faster and much richer in terms of context gathered (file_search).
In my app, I’m keeping the Thread model, but then instead of a bunch of messages, now there are a bunch of responses (with a bunch of items).
The main concern here is the large amount placed into the model input context from a growing “history” of a sequence of API calls.
It is indeed a concern. There is no management of how much you want to spend.
With a large server-side conversation state with no management, you pay for input, and were it to invoke a file search, then you pay for input again when internally the AI is given another generation to respond to the results of a file search - which can be yet another attempt at a search. Piling on to the input costs, the billing for one call can be larger than the entire model context.
The threshold parameter only serves to give you continued operation at maximum, instead of a default of an API error when you have exceeded the model context (which can happen even internally).
Thus: server-side state not ready for a “chat” production environment. At best, you can potentially pay $0.50 per gpt-4o API call if the AI doesn’t persist with tool calls.