Have you solved this? Noticed this yesterday and today as well.
happens to me as well. did you figured it out?
Happened to me as well just now. Also getting bad gateway sometimes.
Same here. 502 bad gateways, slow response time, conn resets
Can’t usually do much with a 5xx error.
Based on the GPT-4 scaling I imagine it’ll be a bit of a bumpy ride until it’s sorted.
Buckle up folks, technology is changing fast and these interruptions are expected.
I was getting bad gateway issues during yesterday’s outage.
Today it’s mostly the above Remote end closed connection without response
, 104, 'Connection reset by peer'
or timeout
error like here.
Are all of these errors due to some systemic issue on the API side?
If it’s intermittent, in most cases, yes.
Extra characters for the bot, ignore this
Thanks everyone, good to know I’m not the only one getting these issues. It seems it’s just a matter of trying again and again until it works then
I really encourage to include extra layers such as retries with backoff, fallback strategies and so on (if you’re not doing it yet). This technology is new and OpenAI’s engineering team is doing an AMAZING job in scaling their servers up to cover all the increasing demand that they’re getting. But this is still expected. Things can fail, and we (developers) are the responsible ones for coming up with sound strategies when they do.
Still happening a bit on 3/22, especially during US work hours. For sure build in some safguards into your code while the poor IT and Devops folks at openAI experience the biggest and fastest scaling challenge anyone has every faced.
I’m using the @backoff.on_exception(backoff.expo, openai.error.RateLimitError)
from backoff
library, but today I still see APIConnectionError
and timeouts. Can you suggest how to account for these errors so that my loop of requests does not break?
You probably just need to set up a Try/ Except statement to handle all errors.
In the except clause you can either pass out some sort of placholder (e.g., None) or else re-queue the content for later.
Thanks, so i tried this:
for i in rlist:
try:
#mycode
except TimeoutError:
print("error")
continue
But the loop still breaks. Is TimeoutError
correct here?
I’m based in EU and have run into these aborted connections a lot over the last week. Its definitely correlated to US working hours…
I’d happily pay more (2-4x token rate) for a more reliable endpoint, at this stage but I don’t see that option. I hope that the team can figure this out soon, and in the meantime, I’ll be implementing various retry mechanisms like @AgusPG recommended above.
Following up on this topic, in case it helps. There’s an API that lets you do a quick health check of every OpenAI model, so you can make your requests strategy depend on it. It’s still pretty easy to implement a health check service such as this one, doing dumb api calls from time to time. But in case you wanna try it out folks, you can check it here.
I like the Try Model X → Try Model Y → Try Model Z → Retry Later
Is there a benefit to Ping Model X, Y, Z → Try model Y if X down, model Z if Y down, etc.
My only guess is you could achieve lower overall latencies if you know ahead of time, is this the only benefit?
Yep, that’s pretty much it. Say that you have a client timeout of 30s per model. Models X and Y are down. It takes you 1 minute to get to model Z and get a completion out of it. This is killer for conversational interfaces, where the user will just run away if they don’t have their answer quickly .
Pinging the models in advance and having a logbook of the health of each model prevents you from continuously trying to get completions of models that are having an outage. So you go straight for model Z (and only retry on it) if you suspect that models X and Y are having an outage.
This improves the UX, in my view
Just adding a “solution” I’ve found. I tried to capture different specific errors but I found that there are so many different errors the platform can throw (such as timeout, remote disconnection, bad gateway just to mention a few) that it’s best to do a blank except statement for now (although not ideal). I’ve found this to work quite well for me
inference_not_done = True
for sample in samples:
while inference_not_done:
try:
response = openai.Completion.create(...)
inference_not_done = False
except Exception as e:
print(f"Waiting 10 minutes")
print(f"Error was: {e}")
time.sleep(600)
I do not agree with catching generic exceptions. It’s a bad practice. Also: you do not want to handle all your exceptions in the same way. There are some error where it’s worth retrying, some others where it’s worth falling back, and some others that you should never retry.
You can customize your app to handle exceptions on status codes, instead (such as “retry with this specific payload for all 5xx errors”). For instance, in Python aiohttp_retry
does a pretty decent job here. It’s the one that I’m currently using.
Hope it helps!
I am using ‘train_3_motive_en.csv’ model to generate some texts since yesterday. Yesterday it ran well. However, I am getting this error for the last 7 hours:
openai.error.APIConnectionError: Error communicating with OpenAI: (‘Connection aborted.’, RemoteDisconnected(‘Remote end closed connection without response’))
Any idea? what is going on
Best Regards,
Zahurul