Has anyone else noticed significantly more API errors depending on the time of day? (GPT-4)

I’ve been noticing a ton more errors during “off-time”, usually around 8 PM - 2 AM stateside time.

It happened again this morning, 9 of 10 completions resulted in one of these two errors:

urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=600)

or

openai.error.APIError: Bad gateway. {"error":{"code":502,"message":"Bad gateway.","param":null,"type":"cf_bad_gateway"}} 502 {'error': {'code': 502, 'message': 'Bad gateway.', 'param': None, 'type': 'cf_bad_gateway'}}

Is anyone else having an issue with this?

Might be possible, though I don’t thing it would be the case so.

What I do notice is that when the stateside is active, the API does get slowed down a lot. Longer completions times

Bad gateway is some intermediary server and not the OpenAI or your local setup, so that would be an “internet” issue with the path from your machine to the OpenAI server somewhere, the read timeout is something that can be affected by your server/host environment, are you hosting locally on your own hardware or with a 3rd party providor?

That intermediary would be cloudflare firewall reporting it can’t access the OpenAI server.

https://community.cloudflare.com/t/community-tip-fixing-error-502-504-bad-gateway/44008 or Troubleshooting Cloudflare 5XX errors · Cloudflare Support docs

Probable reason from that page:
“The server at the origin is overloaded or unreachable at the time the request was made. The could be due to the server crashing, traffic spikes, or lack of connectivity to the server. Check your origin server logs for clues as to what happened.”
“An HTTP 502 or 504 error occurs when Cloudflare is unable to establish contact with your origin web server.”

So, in contrast to the above response, it is OpenAI server unresponsiveness.

1 Like

Looking at the API response times, there was a period a few hours ago where the system seems to have been unresponsive, if that matches up to your timeline then it’s likely the cause.

1 Like

This is a weeks worth of response times for 256 tokens, so to answer your original question (sort of) there is time of day response variation due to load.

1 Like

Yes that would be the timeframe, only issue is it’s still ongoing.

To answer your earlier question, I’m hosting through Gcloud using their Redis memorystore feature to communicate. This issue is a little confusing because everything works fine when I finish up work for the day, then the next morning it’s intermittent or constant issues for a while even though I’ve changed nothing having to do with server communication.

I’ve implemented retry error handling for the 502 errors and it seems to be working, it’s just this 500 error that is still happening every ~1 of 10 completions and I can’t get it to catch the exception yet.

Edit: @_j If I’m getting a 502 or 504, do you think it’s OpenAI server side 100% of the time? I’ve read that there still may be a problem with the code itself even though the error shows differently.

I’m thinking that there might be some maximum no reply timeout on the GCloud side, that would tend to make what is effectively a small variation in performance seem like a binary change.

Not a GC aficionado, so I’m not sure if that timeout is a configurable value.

If we look at the past 24 hours then you can see an increase, but for the month as a whole, it’s actually quicker by about 10%

It’ll be an OpenAI issue if there is only one CF hop to OpenAI, if CF are using other services to route traffic then it could be on their end.

That’s an interesting thought, I’ll do some digging today and see what I can find in App Engine

Ok, so I got into GAE’s error reporting. 502 errors are very rare, these are 99% of them:

urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=600)

Here is a stack trace of one of them:

Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/urllib3/connectionpool.py", line 798, in urlopen
    retries = retries.increment(
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/urllib3/connectionpool.py", line 714, in urlopen
    httplib_response = self._make_request(
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/urllib3/connectionpool.py", line 357, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=600)

After some further reading, it seems I should be getting something close to this error if it is GAE timing out, which I haven’t seen:

< class 'google.appengine.runtime.DeadlineExceededError' >:

Now here’s the odd thing; even though the error says it’s coming from OpenAI’s side, when I remove GAE from the equation and run my app locally, the errors disappear. It’s also odd to me that my error handling is built specifically to catch and retry 502’s and ReadTimeoutErrors, but it only works for the 502’s.

Also, as far as I know, there should only be one CF hop to OpenAI.

This may be a longshot, but could it have to do with the urllib3 package itself? It’s very, very outdated (mine is v.1.26.16, current version is 2.0.4), but I’m unable to make it current because then it conflicts with a google-oauth package I need to use.

What happens if you put v1.26.16 on your local environment? Does that start to get the ReadTimeOuts?

It doesn’t, I’ve been using the same version for local and GAE

Edit to add it’s the same urllib3 version that auto installs when I install the openai package

I did spot this on stack, interesting way to handle it, but… not a removal of the issue entirely

1 Like

Thanks for the link, I’ll keep that as a backup in case all else fails.

I tried instituting a 5 second time delay between completion requests as I read about in this thread, but that didn’t work either.

This seems to be a problem with GPT-4. I rolled my app back to 3.5-turbo and it works perfectly (at about 25% of the time it takes GPT-4 also). Frustrating to say the least.

Can you force a change of location for your App Engine? Just wonder if there is some issue with that particular node.

Do you mean the regional location or something else? I’m definitely not a GC aficionado either :smiling_face_with_tear:

Oh and I forgot to add: I started logging the completion times and the ones that error are nowhere close to OpenAI’s default 600 second timeout. The longest I saw was 190 seconds, so it technically shouldn’t be triggering this error in the first place.

Yea, like with Azure, I can pick a location for my apps, EU, US, etc., I wondered if something similar existed for GC.

Unfortunately no, whatever location I pick during the GAE setup for that project is permanent.

Edit: I updated urllib3 from v1.26.16 to v2.0.3 and the issue has gotten much better (from every ~1 of 10 completions erroring to every ~1 of 30 or so). This is still strange given the fact that the outdated version auto-installs when I installed the openai package, but I’ll take it.

I discovered something else also. When I’m testing locally, it’s running through whatever the standard Flask/Python environment is. But when the app is deployed, Google App Engine uses a Gunicorn WSGI Webserver.

I’m thinking this has something to do with it because A). it’s one of the only differences between testing locally and using GAE and B). every time the error occurs, I see 8-10 of these in my debug log:

[2023-09-05 12:12:06 +0000] [24] [INFO] Worker exiting (pid: 24)

Super interesting, thankyou for sharing your findings, if you ever crack it completely I’d love to hear about it.

I tried the app engine for one of my use cases and unfortunately I needed streaming and SSE’s, Non of the App engine variants allow server side events so… was a no no for me.