API aborts my connection without a reason - anything I can do?

I am making python calls to the API via the openai library.
I have included a timer so that in my request loop I never exceed the rate limits.
After a few hours of running the code without issues, I got the following error message

openai.error.APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

which comes from line 528 of api_requestor.py.

My question is: why is this happening? Is this to do with a temporary lack of my internet connection? Is there anything within my control to prevent this from happening, or is this fully due to some instability of the API or an API overload? It is very frustrating when it happens because I lost all my output data as they were not yet saved to disk when the error was raised.

9 Likes

Have you solved this? Noticed this yesterday and today as well.

1 Like

happens to me as well. did you figured it out?

1 Like

Happened to me as well just now. Also getting bad gateway sometimes.

Same here. 502 bad gateways, slow response time, conn resets

Can’t usually do much with a 5xx error.

Based on the GPT-4 scaling I imagine it’ll be a bit of a bumpy ride until it’s sorted.
Buckle up folks, technology is changing fast and these interruptions are expected.

1 Like

I was getting bad gateway issues during yesterday’s outage.
Today it’s mostly the above Remote end closed connection without response, 104, 'Connection reset by peer' or timeout error like here.

Are all of these errors due to some systemic issue on the API side?

1 Like

If it’s intermittent, in most cases, yes.

Extra characters for the bot, ignore this

Thanks everyone, good to know I’m not the only one getting these issues. It seems it’s just a matter of trying again and again until it works then

1 Like

I really encourage to include extra layers such as retries with backoff, fallback strategies and so on (if you’re not doing it yet). This technology is new and OpenAI’s engineering team is doing an AMAZING job in scaling their servers up to cover all the increasing demand that they’re getting. But this is still expected. Things can fail, and we (developers) are the responsible ones for coming up with sound strategies when they do.

7 Likes

Still happening a bit on 3/22, especially during US work hours. For sure build in some safguards into your code while the poor IT and Devops folks at openAI experience the biggest and fastest scaling challenge anyone has every faced.

1 Like

I’m using the @backoff.on_exception(backoff.expo, openai.error.RateLimitError) from backoff library, but today I still see APIConnectionError and timeouts. Can you suggest how to account for these errors so that my loop of requests does not break?

You probably just need to set up a Try/ Except statement to handle all errors.

In the except clause you can either pass out some sort of placholder (e.g., None) or else re-queue the content for later.

Thanks, so i tried this:

for i in rlist: 
    try: 
        #mycode
    except TimeoutError:
        print("error")
        continue

But the loop still breaks. Is TimeoutError correct here?

I’m based in EU and have run into these aborted connections a lot over the last week. Its definitely correlated to US working hours…

I’d happily pay more (2-4x token rate) for a more reliable endpoint, at this stage but I don’t see that option. I hope that the team can figure this out soon, and in the meantime, I’ll be implementing various retry mechanisms like @AgusPG recommended above.

1 Like

Following up on this topic, in case it helps. There’s an API that lets you do a quick health check of every OpenAI model, so you can make your requests strategy depend on it. It’s still pretty easy to implement a health check service such as this one, doing dumb api calls from time to time. But in case you wanna try it out folks, you can check it here.

1 Like

@AgusPG

I like the Try Model X → Try Model Y → Try Model Z → Retry Later

Is there a benefit to Ping Model X, Y, Z → Try model Y if X down, model Z if Y down, etc.

My only guess is you could achieve lower overall latencies if you know ahead of time, is this the only benefit?

Yep, that’s pretty much it. Say that you have a client timeout of 30s per model. Models X and Y are down. It takes you 1 minute to get to model Z and get a completion out of it. This is killer for conversational interfaces, where the user will just run away if they don’t have their answer quickly :rofl:.

Pinging the models in advance and having a logbook of the health of each model prevents you from continuously trying to get completions of models that are having an outage. So you go straight for model Z (and only retry on it) if you suspect that models X and Y are having an outage.

This improves the UX, in my view :slight_smile:

2 Likes

Just adding a “solution” I’ve found. I tried to capture different specific errors but I found that there are so many different errors the platform can throw (such as timeout, remote disconnection, bad gateway just to mention a few) that it’s best to do a blank except statement for now (although not ideal). I’ve found this to work quite well for me

inference_not_done = True
for sample in samples:
    while inference_not_done:
        try:
            response = openai.Completion.create(...)
            inference_not_done = False
        except Exception as e:
            print(f"Waiting 10 minutes")
            print(f"Error was: {e}")
            time.sleep(600)
1 Like

I do not agree with catching generic exceptions. It’s a bad practice. Also: you do not want to handle all your exceptions in the same way. There are some error where it’s worth retrying, some others where it’s worth falling back, and some others that you should never retry.

You can customize your app to handle exceptions on status codes, instead (such as “retry with this specific payload for all 5xx errors”). For instance, in Python aiohttp_retry does a pretty decent job here. It’s the one that I’m currently using.

Hope it helps!

2 Likes