Responses API: create run -> fetch result?

The Assistants API, which will be deprecated in a year, has functionality to create a run that will happen asynchronously and then query for the run result using the run ID.

Is this possible with the Responses API? As far as I could tell, it was not immediately obvious how to do that if possible.

Many of us are running in Serverless environments (Vercel / Lambda / Cloud Run) and:

  1. the cost associated with waiting on a response would be orders of magnitude lower if we didn’t have to do that,
  2. When re-deploying code, providers often have a required short timeout for the container to stop, pending requests to OpenAI would be lost without additional complex architecture (messaging or db state management)

(And that’s putting aside the point that for hobbyists some providers don’t even allow for a long enough time out)

Responses is, like Chat Completions, inherently a keep-alive open connection service, whether you are waiting for a final http result or have enabled a stream of events.

So it is not possible.

The statefulness is by being able to send a past response ID into a followup request, not by an independent chat that actions can be kicked off against.


There is an API for, by response_id, getting the object and output of a past completed response, but that response ID can only be known and content obtained by the successful initial responses call.

Closing the connection on a stream will stop the generation on Chat Completions, likely the same here. A thought of getting the first stream event with response ID and closing the connection likely would not work. There is a field “Details about why the response is incomplete.” but without more documentation, that could be returning “max tokens too small”, not a status.

There is talk of bringing up feature parity to Assistants, but that might not involve sending events or triggers alone.

1 Like

Thanks for the response! You can see why this would be a challenge for developers, correct?

Would you recommend using the Assistants API for now despite it being slated for deprecation?

Assistants has drawbacks, especially in contrast to conversation length management you can build yourself or tailored knowledge tools you can infuse the AI with on Responses, when using your own list of input messages every time instead of state persistence.

However, it seems like the independent quick calls of Assistants, adding delays and needing your own orchestration, still have an advantage for you.

The main impediment in many hosting environments is hitting a platform timeout while waiting for a long reasoning response to be returned. Responses’ streaming now including more events may serve as more of a keep-alive when you are force-closed after 60 seconds if from inactivity.

1 Like

Again, thank you for the reply.

Could I get your input on situation? When users make the first request they basically generate a long and complex plan, for example, imagine it’s an exercise plan (I can solve the long time out problem myself).

Then, they can follow up on that plan with a virtual trainer, at which point they switch to having a conversation and the trainer should have ongoing context about the user’s plan and their previous conversation.

Ideally, I want to stick with the API that’s going to have long-term support, so I think that’s the responses one, but I’m trying to understand how to implement that use case. Do I use the store flag every message that the user has? And for every new message I give the API the previous message ID?

1 Like

To track a chat session, you’d really just need to record the latest response ID, unless you want chat editing to be possible (restarting at an earlier point).

A stored request from the past has 30 day persistence.

So, in theory, you could load your context full of system message, static documentation, and whatever else makes your assistant that is not user-specific, and then just pass it a minor question like “please introduce yourself”.

Then you’d have a previous response ready for reuse for a while.

That may give you thoughts about utility. For your concern, that doesn’t save much from simply using the “instructions” parameter now offered or your full input list of messages required to produce a unique answer, as the generation time that you report is costly still mainly depends on the token output length.

2 Likes