Infinite wait for OpenAI server response with GPT5 on /completions under specific conditions

lucid.dev · September 22, 2025, 11:48pm

Seriously, any one else?

I’m using chat completions (which is usually faster than responses) and querying GPT5 with often just under the max token load (i.e. I’m not getting the kickback from openAI server saying that the prompt + response went over the context window/token limit).

So I’m querying with say, 180k tokens, at high verbosity and medium reasoning. Ocassionally I can get the same context window to work with “low” reasoning on gpt5. But often that will stall the same (and I’ve tried all verbosity settings, no change most of the time, for these “hung” conditions).

I increased my python SDK calls to the /completions endpoint to timeout after 20m. Previously I had it at 10m.

So now I wait for 20m for it to time out instead of 10m.

Note, for temp debugging, I’ve increased the timeout even longer under certain conditions, and then waited again, say, 30m, for not even a single token in response (when testing with streaming). But I don’t like to sit around at play that game, I usually just switch to a different model to push the conversation through that moment, and then switch back to GPT5 again afterwards. Which incredibly often works. Leading me to believe the “complexity of the question I’m asking” is what’s causing some kind of hang-up/reasoning loop. Because the next prompt will differ usually by only +/- 1000 tokens, otherwise identical context windows just “one message further along” in a long-running multi-turn automatic coding project flow.

This is often with relatively complex multi-faceted coding aspects where I’m sharing 20+ files in the context window, but only 1-2 previous user/assistant messages from the conversation.

Anyone else having these kinds of issues?

It seems very specific to GPT5. All the other models, whether reasoning or non-reasoning, do eventually successfully respond to the same prompt/context, often with 3-5 minutes max for other reasoning models like o3.

But GPT5 just won’t do it seemingly. Tried the same but with streaming and I don’t even get a first token after 20 minutes.

Of course since I’m not using responses API I can’t see the “reasoning” that may or may not be being generated.

But is there any kind of other known issue regarding completions endpoint with GPT5?

It seems relatively insane/unreasonable to ever wait 20 minutes, or possibly to wait infinitely for a hung server with no response??

@vb @_j you guys helped me with this before - and I increased my timeout using the SDK. But basically, still having the exact same issue, except now I wait longer for no response whatsoever from the openAI server!

@stevecoffey Saw you address something similar to this on a different post a few weeks ago, in that case you recommended some turn off the “store” in responses API so that it “acted like” chat completions endpoint.

But again my problem is,
TL;DR;

Chat completions endpoint - infinite wait times for GPT5 without any response from server or errors under large context window loads (often with medium reasoning and medium or higher verbosity). And if I increase the context window loads beyond the limit, I do get those errors.

I’ve been using the chat completions endpoint for about a year to the tune of a couple thousand bucks, so I’m not a total noob when it comes to using the SDK and the system overall. I’ve tried debugging the issue and I did update my openAI sdk packages to the latest version and re-wrote all my caller code to be compliant with the new v1+ package (was previously using an older version as _J pointed out to me a couple weeks ago).

Thanks for any thoughts or help

_j · September 23, 2025, 12:30am

I think it could be:

After lots of reasoning, this model forgets it is supposed to also emit a final channel. Like it was happy to just think.

I’m like, “thanks for all the thinking, were you planning on responding?”. Which is a “continue” message when you are using ChatGPT, or passing reasoning back in on Responses and want to pay again and see if the cache works.

Or that at the end of the output, the logits are uncertain, the probability of “hang up the call” is up there with “transfer the call”…

You can try a final post-prompt developer message along the lines of "final channel output: required, mandatory. End of commentary dialog always must open a final dialog user-facing response". Just an idea.

You also don’t know if the input is even just held up. I’ve had that when sending large testing loads, just no response forever, like the rate limiter is in a loop of token guesstimating. A call that never should have started because of its parameters. You can see if it is that you never get a response, or you get a usage report that is only “reasoning”.

lucid.dev · September 23, 2025, 12:42am

That’s kind of what I was thinking, and I would just be curious as to why there is no safeguard for that from the OpenAI server level, or if such a thing would be of their interest… As it seems to consistently be happening only when I’m at the “top percentile of max usage potential” - i.e. near-max settings, near-max context window, etc…. which you would think you would really want to support being successful, and would want to carefully ensure that meaningful errors are returned if there is looping/failure… like why does the server not have it’s own “timeout”, or some kind of failure for this kind of scenario..

And wait, are you saying that I can actually get reasoning output (streaming or otherwise?) using /completions?? I thought that was only available via responses… would love to be able to see into the “reasoning” aspect, but I always thought that was unsupported on /completions, and for whatever reason have really stuck with completions and never used /responses.

_j · September 23, 2025, 2:54am

Apparently not:

lucid.dev · September 30, 2025, 6:04am

I tried resolving this issue by setting service_tier = priority.

However now for GPT5 calls I’m getting continual (and quick) 520 errors back from openai server (cloudflare error with connecting to open ai server).

I see something in the priority service tier documentation indicating that it doesn’t work with large context windows for gpt5? Is this true? Would this be the cause of this error or is something else going on??

I don’t see any ongoing issues on openai status that would seem to indicate this being a real server side issue? But then why don’t I get a meaningful error response and instead get the 520?

@vb

vb · September 30, 2025, 6:39am

I’ll bring this up with staff at our meeting and get back to you with feedback.
Until then I can only suggest to reduce the amount of tokens you are sending per request.

_j · September 30, 2025, 12:25pm

Written three months ago when it also says “Enterprise”, your likely source.

Is Priority processing available for long context, fine-tuned models, embeddings, etc.?

Not at this time. We will evaluate in the future whether to offer Priority processing on additional products beyond our latest models.

What is “long context”? This is the only hint you will get.

Estimated: the rate limiter calculation which is usually an over-estimate on English.
Where? Rate limit page, and only a lower TPM/RPM showing there for gpt-4.1 (although ‘long’ might be evaluated more broadly.)

I suppose you could see if this is the threshold of failure 100k vs 140k. I’m up to other stuff right now that doesn’t involve pressing the “priority” button on a novel.

lucid.dev · September 30, 2025, 3:55pm

I appreciate it. Yes, there’s always a work around. My use case is interesting at again it would seem that I’m consistently using this model every day at the “maximum” capacity of the model. I.e. we are working with a repo that is large enough to where 10-15% of our repo is 150k+ tokens.

Thus, using the GPT5 model with some reasoning is highly effective. In fact, it’s the best LLM coding experience I’ve ever had. And normal wait times of 2-7 minutes for us is still worthwhile - because we are using it within automated multi-turn workflows, so we just set it and forget it for hours.

BUT the big issue is these “silent failures” (indefinite timeout with no response from the server) and the newer errors with the 520 from cloudflare (which actually we are seeing happen regardless of using the service_tier: priority option). It almost seems like the previous indefinite-time-out issue is also perhaps related to the previous issue - i.e. using very large context windows close-to-or-near the maximum, but there’s no “entry point failure” at the server side - instead just indefinite processing/indefinite hang up.

This really disrupts the workflows and makes it so that the GPT5 use case “at it’s highest level of capacity” is quite a bit hampered. Because then we are auto-retrying until a few failures and then just bailing out - but then we have to notify user of the failure - but what’s the failure?? An indefinite timeout or a generic 520 error. So how do we handle that? We have to start removing some of the codebase documents from the context window and “hope” that it works for us to try again - waiting several minutes to see if the call completes..

We’ve never had any issues like this with any other model over the past year from openAI. We always got a reliable response indicating what the failure was - usually models response + prompt kicked over the CW size, and thus we could at least automate a tiered truncation process to attempt the same call again but with a slightly smaller window - thus allowing us to always “push the model to the max” in terms of CW size.

This practice is what we are continuing to try and use with GPT5, but again these issues are really making it difficult because we are guessing as to what the issue is and not getting any clear feedback from the server.

The workaround on my end is going to be to start doing a UI setting that tokenizes before the call to openAI is made and try different “limits” pre-call to see if I can reliably avoid these issues at say, 150k context windows, instead of trying to get closer with 180k/200k.

However still I would imagine that the purpose would be to have this implemented server-side from openAI, because otherwise clients whom are trying to make use of the models full potential I would assume are experiencing this kind of issues.

Especially given that this is often explicitly tied to the difference between using the “medium” reasoning level and “minimum” reasoning level - i.e. whether or not the call succeeds or silently fails!

Or maybe nobody is trying to use it at this level and I’m just silly, ha!

Or maybe I have to give up on completions and switch over to using Responses… I’m guessing GPT5 was designed primarily with Responses in mind and perhaps that’s why this kind of usage on Completions is getting less success?

vb · September 30, 2025, 6:55pm

That’s probably the best way to resolve this issue quickly. As a fan of the Completions API myself, I know it feels a bit unorthodox—but the Responses API has been rolled out alongside the new models and is better equipped to handle the requirements and edge cases.

Hope this helps!

lucid.dev · September 30, 2025, 7:25pm

Just in case anyone is ever interested in actually looking into this. This is what it looks like:

2025-09-30 12:04:50 [detail] [CallOpenAI]
[CallOpenAI][AUDIT] No stop_next_llm or stop_all code matched for thread 171, conversation 16862. Proceeding with LLM call.

2025-09-30 12:04:50 [detail] [CallOpenAI]
[CallOpenAI][gpt-5] API parameters (excluding messages): {'model': 'gpt-5', 'top_p': 1.0, 'reasoning_effort': 'medium', 'verbosity': 'high', 'temperature': 1.0, 'service_tier': 'priority'}

2025-09-30 12:06:54 [detail] ***ANY***
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 520 "

2025-09-30 12:06:54 [detail] ***ANY***
Retrying request to /chat/completions in 0.429196 seconds

2025-09-30 12:09:07 [detail] ***ANY***
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 520 "

2025-09-30 12:09:07 [detail] ***ANY***
Retrying request to /chat/completions in 0.765088 seconds

2025-09-30 12:11:55 [detail] ***ANY***
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 520 "

2025-09-30 12:11:55 [error] [OpenAI_Retry]
[OpenAI_Retry] Attempt 1/3 failed: openai.InternalServerError - </html> (http=520)

2025-09-30 12:11:55 [error] [OpenAI_Retry]
[OpenAI_Retry] Non-retryable; failing out. reason=non_retryable_http_status_520, http=520, type=openai.InternalServerError

Then the actual cloudflare response HTML: (redacted):

  2025-09-30 12:11:55 [error]   
<title>api.openai.com | 520: Web server is returning an unknown error</title>  </head> 

<span class="inline-block">Web server is returning an unknown error</span>               <span class="code-label">Error code 520</span>             

Visit <a href="https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_520&utm_campaign=api.openai.com" target="_blank" rel="noopener noreferrer">cloudflare.com</a> for more information.             </div>             

<div class="mt-3">2025-09-30 18:11:55 UTC</div>       
<a href="https://www.cloudflare.com/5xx-error-landing?utm_source=errorcode_520&utm_campaign=api.openai.com" target="_blank" rel="noopener noreferrer">       

<span class="md:block w-full truncate">Los Angeles</span>  

<p>There is an unknown connection issue between Cloudflare and the origin web server. As a result, the web page can not be displayed.</p>  

<p><span>There is an issue between Cloudflare's cache and your origin web server. Cloudflare monitors for these errors and automatically investigates the cause. To help support the investigation, you can pull the corresponding error log from your web server and submit it our support team.  Please include the Ray ID (which is at the bottom of this error page).</span> <a rel="noopener noreferrer" href="https://developers.cloudflare.com/support/troubleshooting/http-status-codes/cloudflare-5xx-errors/error-520/">Additional troubleshooting resources</a>.</p>                 </div>             </div>         </div> </body> </html>

Cloudflare Ray ID: 9875ab6e09ab6a26

christopherlyon · October 14, 2025, 2:03pm

Hi all, I have a similar issue. Making Gpt-5 calls using the responses API. I can see the call being completed in the OpenAI Dashboard Logs, (And I’m paying for the token usage…) but my server API call never actually resolves and just hangs until I get a timeout.

This is the code example:

This is my server console:
——————–

Starting GPT-5 research for: https://www.url.com/

(15 minutes later)

Onboarding API error: Error: Request timed out.
at OpenAI.makeRequest (src/client.ts:655:15)
at async POST (app/api/onboarding/route.ts:248:22)
246 |
247 | // Call GPT-5 with web_search tool

248 | const response = await client.responses.create({
| ^
249 | model: “gpt-5-mini”,
250 | tools: [{ type: “web_search” }],
251 | input: researchPrompt, {
status: undefined,
headers: undefined,
requestID: undefined,
error: undefined,
code: undefined,
param: undefined,
type: undefined
}
POST /api/onboarding 500 in 904848ms

I’m only at about 60k tokens in, and 10k tokens out when completed in the logs overview…

Snippet from the dashboard roughly 10 mins before my call the the endpoint times out…

lucid.dev · January 12, 2026, 12:57am

@vb @Foxalabs @OpenAI_Support @_j

Does anyone know if there has ever been any resolution to this kind of issue that I’m not finding in the forums or online?

Still to this day, sending a request to chat completions with parameters like this:

2026-01-11 17:40:54,998 - INFO - [CallOpenAI][gpt-5] API parameters (excluding messages): {‘model’: ‘gpt-5’, ‘top_p’: 1.0, ‘reasoning_effort’: ‘medium’, ‘verbosity’: ‘medium’, ‘temperature’: 1.0}

Occasionally results in an indefinite non-timeout non/response from the server. You know, waiting for 20 minutes for a response, etc. I can try the same call again and again (i.e. no change to the context window), and never get a response.

I used to only notice this happening when I was “close” to context window limits. For example if I was sending 200k tokens. Now, it’s happening at lower levels - for example today at only 140k tokens.

The routine behavior is that it’s anytime I’m asking something somewhat complex. I.e. if there is really a lot going on in the codebase being presented to the model, and I’m asking for several items to be addressed/considered (even if not fixed simultaneously) in a single prompt.

If anyone was interested in actually reproducing, I presume it would be very easy if I just send you the dump of the API call I made and you could try it yourself.

It’s just crazy to me that the server never responds at all. Without approaching token limits, etc. the only way to get around it is to set reasoning to a lower level in the model parameters or to make a less complex request.

Back in the day the models would just give you a kind of crappy response. Here, it’s just not responding at all!

What would be extremely interesting is to be able to see what the models reasoning is during this process… as obviously it gets caught into some kind of loop somewhere… but the fact that this isn’t inhibited at the model level (runaway reasoning, infinite loops?)… sure would be cool to see what’s really going on..

baptistegrandclerc · January 12, 2026, 10:15am

I was facing this issue. I switched to GPT-5.2 and it solved the problem.

vb · January 12, 2026, 10:39am

Yes, it’s worth a try to repro your issue.
Send me your code and I can run it in a loop until something happens, or not.

Topic		Replies	Views
OpenAI Why Are The API Calls So Slow? When will it be fixed? API	103	57575	February 19, 2024
O3-deep-research - 1 million tokens spent .. no output :( API deep-research	37	2678	November 27, 2025
What is going on with the GPT-5 API? API	40	18111	October 21, 2025
Structured output calls fail trying to parse response content Bugs structured-output	22	12647	June 20, 2025
GPT5 API call fails with large context windows and medium reasoning API	6	1263	September 1, 2025

Infinite wait for OpenAI server response with GPT5 on /completions under specific conditions

Is Priority processing available for long context, fine-tuned models, embeddings, etc.?

Related topics