I am making python calls to the API via the openai library.
I have included a timer so that in my request loop I never exceed the rate limits.
After a few hours of running the code without issues, I got the following error message
openai.error.APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
which comes from line 528 of api_requestor.py.
My question is: why is this happening? Is this to do with a temporary lack of my internet connection? Is there anything within my control to prevent this from happening, or is this fully due to some instability of the API or an API overload? It is very frustrating when it happens because I lost all my output data as they were not yet saved to disk when the error was raised.
Based on the GPT-4 scaling I imagine it’ll be a bit of a bumpy ride until it’s sorted.
Buckle up folks, technology is changing fast and these interruptions are expected.
I was getting bad gateway issues during yesterday’s outage.
Today it’s mostly the above Remote end closed connection without response, 104, 'Connection reset by peer' or timeout error like here.
Are all of these errors due to some systemic issue on the API side?
I really encourage to include extra layers such as retries with backoff, fallback strategies and so on (if you’re not doing it yet). This technology is new and OpenAI’s engineering team is doing an AMAZING job in scaling their servers up to cover all the increasing demand that they’re getting. But this is still expected. Things can fail, and we (developers) are the responsible ones for coming up with sound strategies when they do.
Still happening a bit on 3/22, especially during US work hours. For sure build in some safguards into your code while the poor IT and Devops folks at openAI experience the biggest and fastest scaling challenge anyone has every faced.
I’m using the @backoff.on_exception(backoff.expo, openai.error.RateLimitError) from backoff library, but today I still see APIConnectionError and timeouts. Can you suggest how to account for these errors so that my loop of requests does not break?
I’m based in EU and have run into these aborted connections a lot over the last week. Its definitely correlated to US working hours…
I’d happily pay more (2-4x token rate) for a more reliable endpoint, at this stage but I don’t see that option. I hope that the team can figure this out soon, and in the meantime, I’ll be implementing various retry mechanisms like @AgusPG recommended above.
Following up on this topic, in case it helps. There’s an API that lets you do a quick health check of every OpenAI model, so you can make your requests strategy depend on it. It’s still pretty easy to implement a health check service such as this one, doing dumb api calls from time to time. But in case you wanna try it out folks, you can check it here.
Yep, that’s pretty much it. Say that you have a client timeout of 30s per model. Models X and Y are down. It takes you 1 minute to get to model Z and get a completion out of it. This is killer for conversational interfaces, where the user will just run away if they don’t have their answer quickly .
Pinging the models in advance and having a logbook of the health of each model prevents you from continuously trying to get completions of models that are having an outage. So you go straight for model Z (and only retry on it) if you suspect that models X and Y are having an outage.
Just adding a “solution” I’ve found. I tried to capture different specific errors but I found that there are so many different errors the platform can throw (such as timeout, remote disconnection, bad gateway just to mention a few) that it’s best to do a blank except statement for now (although not ideal). I’ve found this to work quite well for me
inference_not_done = True
for sample in samples:
while inference_not_done:
try:
response = openai.Completion.create(...)
inference_not_done = False
except Exception as e:
print(f"Waiting 10 minutes")
print(f"Error was: {e}")
time.sleep(600)
I do not agree with catching generic exceptions. It’s a bad practice. Also: you do not want to handle all your exceptions in the same way. There are some error where it’s worth retrying, some others where it’s worth falling back, and some others that you should never retry.
You can customize your app to handle exceptions on status codes, instead (such as “retry with this specific payload for all 5xx errors”). For instance, in Python aiohttp_retry does a pretty decent job here. It’s the one that I’m currently using.