I’ve been getting “request timed out errors” with the API when calling GPT5, specifically with reasoning set to medium or high and verbosity set to high.
This is when calling with a relatively large context window, usually over 100k tokens and below 200k tokens.
The same request usually works OK if I switch reasoning to LOW and/or verbosity to medium.
Of course, this is undesireable as I’m not able to use the higher reasoning capabilities when processing large context windows.
I’ll note that this is under the circumstances of instructing the LLM to use a highly automated system and provide several kinds of “structured output” (in a non-formal sense), i.e. it has to produce different kinds of special code-blocks containing metadata, as well as a variety of code changes, document updates, and “tool calls” (again non-formal).
This is in the chat completions endpoint.
So none of the structured outputs or tool calls are formalized in the same sense as responses API - it’s all in-house processing once the assistant response is received (i.e. all tool calling/output processing is handled at the system level after receiving the assistant response - nothing internal or similar to agents/responses SDK).
What gives? I presume the model is just bailing out on the backend due to “too much reasoning” or something like that?
I’d like to hear if anyone else has been experiencing this issue or if staff can weigh in about noting this kind of experience.. presumably there’s no way for me to pass anything in my call about “allowing a longer running call to complete” on my end - and in fact I’m pretty sure I’ve seen “longer running calls” complete before in terms of time - so I’m guessing that the “request timed out” is more of some kind of internal failure then an actual timeout, though it is after almost exactly 10 minutes:
I can make the same “call” again successfully once my current prompt has changed (I’m truncating the context window with every API call to only include the most recent 2-3 role: user messages and 2-3 assistant responses, as well as special role: user message that includes all codebase content and documentation en-masse
Thus the issue really arises whenever I’m providing “large sets of feedback and feature requests in a single prompt in order to continue a long running multi-turn coding implementation plan. Once I’ve switched the model to “reasoning = low” and gotten the LLM to process and add to documentation my set of feedback and requests, and that message has then been truncated from the conversation, I can then switch back to reasoning = high with otherwise an identical context window and have the request succeed normally.
I think you’re onto something with the Chat Completions API, since the newer Responses API adds background mode, which is a clean way to handle long-running tasks.
For the Completions API, there are other options. Using the Python SDK, for example, you can set a global timeout when creating the client:
from openai import OpenAI
client = OpenAI(timeout=100) # seconds
Or, you can set a per-call timeout, which overrides the client default just for that request:
Just to clarify, so your saying that the timeout is NOT on openAI’s server side in a “hard set kind of way”. That I can modify the default timeout limit of 10 minutes, because the timeout settings are a part of the OpenAI python module (SDK?) that I’m using here?
Oh very interesting. Thanks J that’s super helpful.
Given that I built this API module at the very end of 2023/early 2024, and have been developing the same system ever since - and I didn’t write any of the code my self - hence your seeing code artifacts from whenever GPT4’s knowledge base cut off time was at the time I was coding… haha!!!
But, amazingly - now it’s only getting better - so I’ll just give it a bump to update my openAIAPI module to current standards (whatever GPT5’s knowledge cut off is now, lol)