We have a production system built on top of the Assistants API and I’ve just finished rewriting it to use the new Responses API. For those who might be interested, I am sharing my experiences.
For reference, ours is a chat application with hundreds of different RAG-based assistants, 10s of thousands of users, with thousands of files in each different vector store. Conversation history is important in our application. We are built primarily on the gpt-4o-mini model for cost reasons.
Our experiences up until this change were:
After a lot of work on instructions, our response times have been acceptable (5-10s to start streaming, 10-20s to finish streaming about 90% of the time). We would prefer faster, but can accept these numbers.
The vector stores are flakey. Uploading files works but we maintain a database of what we have uploaded and often find that when we list files via API, we find that they are out of sync and have to delete and upload to correct this. Also the time to upload a file and get it processed is slower than we’d like – 1-10s per file.
The Assistants API is fairly convenient. Our mapping between assistants, threads, and messages has worked well for us.
The absence of metadata on vector store files has been a hassle, but we’ve lived with it.
The transition to the Responses API took only about 2 hours of work. The structure of the API is completely different, but it’s simple and straightforward and since we use Typescript, the typedefs helped immensely in making this process smooth. We had a streaming client up and running almost immediately with minimal problems.
Here’s what I learned:
This uses the same vector stores we already used so there was no change on that part. I noticed that they have now added support for batch uploads which we’ll want to come back to and exploit. It appears that they also now support metadata, which we can also take advantage of – but are currently using the infrastructure we already had to deal with that absence.
The 10k file limit in vector stores continues to be a serious problem for us, limiting the customers that we can address.
Happily, it appears that exactly the same instructions that we were using for Assistants are working equally well with the Responses API.
The new system for history/context management works well and replaces threads with no hassle. We are not yet taking advantage of being able to fork conversations or anything else, but I appreciate the new approach.
Happily, the gpt-4o-mini model is supported – even though the documentation implies that it is not. That’s critical for us.
Regrettably, the speed appears to be unchanged. No surprise, I suppose, given that the same file_search tool is used and the same model is being used. But I had hoped we might see a speed-up because of assistant inefficiencies.
I want to compliment OpenAI on a nice clean new API for Responses. Admittedly, I’m frustrated that the Assistants API was left in beta for so long, but I admit that the new API is cleaner and better overall. I’m not 100% sure why we still have both Chat Completions and Responses. But at least we won’t also have Assistants to be confused by.
Regarding speed, it sounds like we are having better luck than a lot of others on this forum. Nevertheless, we have also implemented our solution on top of Google Gemini as well and it is MUCH FASTER. My impression in our application is that responses from OpenAI using gpt-4o-mini are slightly better than Google Gemini 2.0 FLASH.
That’s good to know.
I’ve waiting for the promised migration documentation (best practices to move from threads to “previous_id” or to change assistants or add additional_instructions in som runs). Also in this community some have alerted about the tokens consumption.
I can tell you that in our case, we’re using gpt-4o-mini and it is very inexpensive, even when a lot of input tokens are being used. In our case, we provide different instructions along with every question. We did this using the Assistants API and continue to do that using the Responses API. That all works nicely. (That works a lot better than having a base set of instructions and supplementing them for each new question/message.)
I just noticed something that I’d like others to be aware of too – and something I’ve definitely not happy about!
On the Pricing page, I just spotted that OpenAI has chosen to charge for every call to the file_search tool – but only from the Responses API. That’s $2.50 per 1000 calls. That doesn’t sound like a lot, but it adds up. Think about the fact that with gpt-4o-mini as the model, we are paying $0.15 for 1m input tokens. Our typical input using about 20-40k tokens. That’s about $5 per 1000 question/answers. But this new charge adds $2.50 to that (since we always need to consult the RAG content).
That’s a sneaky way of increasing the effective price (compared with Assistants) by 50%!
I had always appreciated that OpenAI kept its pricing models very simple. At least this should have been clearly stated in the announcement about the new Responses API.
Thank you for sharing.
There’s a lot of interesting information in your post. In my case, I’m halfway between migrating from the assistants API to the responses and agents API… in either case, what I’ve been missing (although it can be solved by preparing a suitable interface layer) is an environment for creation, consultation, and modification like the assistants have, which help us keep the different automations we create organized and grouped by functionalities. I suppose OpenAI will make this kind of tool available to everyone in the APIs management dashboard, (you can tell they released the SDK and the response API a bit earlier than expected, I imagine in response to the new developments that others are presenting).
You raise an excellent point. The Assistants playground was really really useful – especially when we were first developing our solution. And there is no equivalent now for Responses.
Thank you for sharing!
You’re right about metadata management and search, vector store require quite heavy infra to make them useful in the UI.
One question about the design. How’s the UI works for users to attach files in the conversation? Do they attach whole vector stores or you create the vector stores on the fly for the documents they need?
The latter is more work but I think more cost efficient.
In our application, we are providing what is essentially an expert system that is “trained” on a large block of content – which is the same content for all users of one of our customers. So we create a vector store for each customer, upload all of their content, and then each user doesn’t have anything to upload.
Of course, each application is going to be different in terms of its needs.
I started out using the assistants API, then implemented chat completions and responses so I can use any one of the 3:
My high-points:
Assistants API is slow, clearly outside the range of what I could make a user accept even with tricks. It also occasionally get stuck with a thread in the running state… which you can cancel but sometimes the cancelling gets stuck, which leaves the only solution is to throw away the thread (which kind of defeats the purpose of keeping it).
Chat completions: Simple, fast(er) but don’t feel good about shuttling the entire thread context back + forth. Also the lower level of JSON schema around tool function and types is different (ex. in assistants API I can specify number or string as [“number”, “string”], in chat completions I must pick one, this makes a difference where strictness is a topic). Also the API endpoint to get message history gives me a 500 error 80% of the time in a way that seems clearly an issue at their end.
Responses: Not as fast as chat completions (but fast than assistants API).
My gut feel: It all (the APIs) feels beta level, for a company with that amount of money they could have a more robust and mature API offering, but then sometimes if you got lots of money it can foster degree of over confidence and not caring, unless the plan is to achieve AGI and have the AI build the APIs rather than building out a top-tier API team.
I have a question that might be slightly off-topic from what you discussed in this post. I’m using the Assistants API and noticed a huge discrepancy between the responses from the Playground and the ones from the API. Is there any way to resolve this issue?
We’ve used the playground extensively. At first we were also see big differences, but those turned out to be because we were using it incorrectly. Note that there are settings within the playground that you would normally control via the API – instructions, tools, files, settings. These normally match the way you created the corresponding assistant, but they can get out of sync. Check those. In our case, at least, once we understood it, we always had equivalent responses.
Here, on my side, I just pass the assistant’s ID when running the response based on the message I previously added to the thread. The response that comes back is usually quite vague compared to the one I get through the playground.
Note that the thread is very important here. It is carrying the context for the conversation. So if you pull a question out of context and ask it in a new thread, you’re not going to get the same answer.
Using the responses API, we are using the new feature to pass in the response ID of the last message in the thread as a way of automatically populating the context for the next message. It’s simple and easy.
I have not used the Agents SDK so cannot comment on that.
Hi: thanks a lot for the detailed write-up! We are currently deciding between Assistants vs Responses for a new implementation. My concern with Responses is the lack of “Threads”. They make it the responsibility of the developer to pass the previous_response_id. But even with that, we are only passing the previous ID with each call and not the entire conversation history
Isnt this a problem for the LLM to get the whole context of the conversation? How has been your experience with this?