Responses API... not highly responsive (& what about assistants)?

Hello

At times the responses AP takes about 7 seconds (yes, 7) to give a response (when it already has about 5 previous messages in its context, thanks to previous response IDs)

I never tried the assistants API as i was waiting for that to be production ready first.

So…is the Assistants API production ready now? I read somewhere that it will be deprecated in early 2026. When is Responses API expected to be production grade?

Thanks in advance!

1 Like

Any internet connected application should make the assumption that a response will never return. With that in mind, you also need to keep the user informed of the current status of any request being made by some mechanism.

While 7 seconds is a long time for a conversational application, the nature of AI requiring large amounts of compute that is being shared amongst millions of other users means that return times will only ever be an estimate.

OpenAI does offer dedicated instances of models if that makes economic sense for your use case, the last time I checked a couple of years ago this was at 450million tokens per day.

1 Like

The responses endpoint is not published as a “beta”, taking a beta header or having “beta” in the SDK method (like Assistants that is destined to never leave that status). So we must regard it as production-ready and as feature complete parameter-wise as Chat Completions (that also just got new API input parameters).

Responsiveness

I have two metrics I would consider:

  • time to first token (latency)
  • token production rate (response time varies by length)

These come about from the model employed. A delegating endpoint such as Assistants or Responses (or even Batch) uses the underlying model with its own operations that could add some processing time.

Responses

A server-side chat state requires a database lookup, but could save you network transmission time. Finding the impact of using prior responses is difficult, and would affect the latency, but we could see how long it takes you to alternately manage your own conversation and send it each time.

Classifying performance

Libraries generally don’t present the actual transmission time of your full API request (httpx that openai Python SDK uses gives a full “response.elapsed”), but you can be a bit tricky and yield the JSON body with a generator to see how fast it is consumed.

That total time elapsed can be compared to the openai-reported generation time that is returned in an undocumented header, which should be shorter. Or then the timer you make around yielding the API call and producing its stream iterator, or finishing. They’ll probably all be within milliseconds. All techniques let you calculate the token production rate - and you can also see if there are long pauses between deltas, possibly an effect of output content inspection for recitation (copyright reproduction).

So: find out if your complaint is about the total model time, which can just be a random slow model running in Redneck Joe’s A40 server shack (the mini and nano models experience this more). Or if it is the startup time, that you can affect by endpoint decision (vs Chat Completions) or by use of previous response.

Initial latency discovery here implies streaming. The AI will seem more responsive to an end user if you show tokens as they are produced.

Recommendation:

  • don’t build on Assistants, wasted effort and lost conversations when it is shut down;
  • don’t use responses chat state with previous response ID nor “store”, unless demanded by particular tool use. It has no working budget limitation besides the model’s maximum input.

BTW: The AI model is not mentioned, but is a primary concern. Reasoning models such as o4-mini or o3 have their own internal thinking times before responding, based on the difficulty and your requested effort.