Handling Run Failures and Rate Limits

I have two questions regarding this:

  1. If a run fails are the messages that were already added to the thread by that run get deleted? since I had a run that first added a message to the thread then called a few tools then the run failed after I called submit_tools_output due to rate_limit_exceeded error, then when I checked the thread again I didn’t find the message that was previously added by the run before the error

  2. when the run fails isn’t there any way to for example wait a few seconds and then resume it in case the cause was exceeding the TPM or RPM

You can get rate limits in an API call’s headers.


However, Assistants doesn’t care about rate limit, and can make multiple model calls in quick succession, leading to failure at your expense. The failure behavior isn’t documented, however, tossing any AI responses that were happening internally makes sense, because there is no “resume” method with a run.

Setting tier-1 recently to such a low limit was obviously motivated by greed to get people paying more to increase the unworkable amount granted to those who have paid under $50.

If you receive a tool call, that is the only case where you can delay the AI’s operation yourself into the next minute. However, for a user waiting for a response, that would be poor.

Even that delaying the tool call is not the ideal solution but is that achievable in python I do not see how I can access rate headers in the assistant API which the delay logic would depend on?

I have added the delay of 60 sec if “rate_limit_exceeded” and retry functionality to my API, but every subsequent request on a thread is still “rate_limit_exceeded”, even if I set the delay to e.g. 5 minutes. Anyone else experiencing this?

I have had this, it tends to happen when you have not only exceeded your rate limit but also when you have exceeded your daily limit for those endpoints that feature it, e.g. vision preview.

Thanks Fox, but I am only using a basic assistant with gpt-4o, which unfortunately rules your suggestion out. If I start another process from scratch it will work fine up until 20k+ total tokens are processed and then anywhere after that point it will start the “rate_limit_exceeded” loop. However if I add the time delay of 60 sec before each request from the start of the process, I do not get this error and it finishes the task. It feels like once you hit the “rate_limit_exceeded”, the thread is then essentially locked.

It sounds like you are growing the thread in size by adding more messages.

If you do not employ the new truncation_strategy run API parameter to limit the number of past turns, the chat history sent to the model each call can grow beyond the paltry limit given to tier-1 users, trying to use up the 128k model’s token input to the max with past chat without concern that you have a 30k rate limit (where any single API call above that to the AI model will be denied).

If you were to use a model like gpt-3.5-turbo, you’d not get such a limitation, because its 60k tokens of rate limit for tier-1 can easily accomodate the maximum of 16k that can be sent to the model. The size of the chat from threads is shrunk automatically to accomodate the model by assistants.

Thanks Jay, moving up a tier fixed the issue, but I still don’t understand why:

Case 1: Set 60 sec delay between requests → Works
Case 2: Send requests until you get “rate_limit_exceeded”, then introduce 60 sec delay for subsequent requests → Locks the thread indefinitely.

The request token sizes are identical in both cases.

So far in Case 1 I was unable to even reach the “rate_limit_exceeded” error. And in Case 2 I was unable to “unlock” the thread once it hit the “rate_limit_exceeded”, even if I try to interact with it the next day.