The responses endpoint is not published as a “beta”, taking a beta header or having “beta” in the SDK method (like Assistants that is destined to never leave that status). So we must regard it as production-ready and as feature complete parameter-wise as Chat Completions (that also just got new API input parameters).
Responsiveness
I have two metrics I would consider:
- time to first token (latency)
- token production rate (response time varies by length)
These come about from the model employed. A delegating endpoint such as Assistants or Responses (or even Batch) uses the underlying model with its own operations that could add some processing time.
Responses
A server-side chat state requires a database lookup, but could save you network transmission time. Finding the impact of using prior responses is difficult, and would affect the latency, but we could see how long it takes you to alternately manage your own conversation and send it each time.
Classifying performance
Libraries generally don’t present the actual transmission time of your full API request (httpx that openai Python SDK uses gives a full “response.elapsed”), but you can be a bit tricky and yield the JSON body with a generator to see how fast it is consumed.
That total time elapsed can be compared to the openai-reported generation time that is returned in an undocumented header, which should be shorter. Or then the timer you make around yielding the API call and producing its stream iterator, or finishing. They’ll probably all be within milliseconds. All techniques let you calculate the token production rate - and you can also see if there are long pauses between deltas, possibly an effect of output content inspection for recitation (copyright reproduction).
So: find out if your complaint is about the total model time, which can just be a random slow model running in Redneck Joe’s A40 server shack (the mini and nano models experience this more). Or if it is the startup time, that you can affect by endpoint decision (vs Chat Completions) or by use of previous response.
Initial latency discovery here implies streaming. The AI will seem more responsive to an end user if you show tokens as they are produced.
Recommendation:
- don’t build on Assistants, wasted effort and lost conversations when it is shut down;
- don’t use responses chat state with previous response ID nor “store”, unless demanded by particular tool use. It has no working budget limitation besides the model’s maximum input.
BTW: The AI model is not mentioned, but is a primary concern. Reasoning models such as o4-mini or o3 have their own internal thinking times before responding, based on the difficulty and your requested effort.