GPT5 API call fails with large context windows and medium reasoning

I’ve been getting “request timed out errors” with the API when calling GPT5, specifically with reasoning set to medium or high and verbosity set to high.

This is when calling with a relatively large context window, usually over 100k tokens and below 200k tokens.

The same request usually works OK if I switch reasoning to LOW and/or verbosity to medium.

Of course, this is undesireable as I’m not able to use the higher reasoning capabilities when processing large context windows.

  • I’ll note that this is under the circumstances of instructing the LLM to use a highly automated system and provide several kinds of “structured output” (in a non-formal sense), i.e. it has to produce different kinds of special code-blocks containing metadata, as well as a variety of code changes, document updates, and “tool calls” (again non-formal).

This is in the chat completions endpoint.

  • So none of the structured outputs or tool calls are formalized in the same sense as responses API - it’s all in-house processing once the assistant response is received (i.e. all tool calling/output processing is handled at the system level after receiving the assistant response - nothing internal or similar to agents/responses SDK).

What gives? I presume the model is just bailing out on the backend due to “too much reasoning” or something like that?

I’d like to hear if anyone else has been experiencing this issue or if staff can weigh in about noting this kind of experience.. presumably there’s no way for me to pass anything in my call about “allowing a longer running call to complete” on my end - and in fact I’m pretty sure I’ve seen “longer running calls” complete before in terms of time - so I’m guessing that the “request timed out” is more of some kind of internal failure then an actual timeout, though it is after almost exactly 10 minutes:

2025-08-30 23:01:20 [detail] [CallOpenAI]

[CallOpenAI][gpt-5] API parameters (excluding messages): {‘model’: ‘gpt-5’, ‘top_p’: 1.0, ‘reasoning_effort’: ‘medium’, ‘verbosity’: ‘high’, ‘temperature’: 1.0}

2025-08-30 23:11:28 [error] [OpenAI Retry]

[OpenAI Retry] Attempt 1/3 failed: openai.error.Timeout - Request timed out (http=None)

@OpenAI_Support @vb @edwinarbus

Further notes:

  • I can make the same “call” again successfully once my current prompt has changed (I’m truncating the context window with every API call to only include the most recent 2-3 role: user messages and 2-3 assistant responses, as well as special role: user message that includes all codebase content and documentation en-masse
    • Thus the issue really arises whenever I’m providing “large sets of feedback and feature requests in a single prompt in order to continue a long running multi-turn coding implementation plan. Once I’ve switched the model to “reasoning = low” and gotten the LLM to process and add to documentation my set of feedback and requests, and that message has then been truncated from the conversation, I can then switch back to reasoning = high with otherwise an identical context window and have the request succeed normally.
1 Like

Hi @lucid.dev!

I think you’re onto something with the Chat Completions API, since the newer Responses API adds background mode, which is a clean way to handle long-running tasks.

For the Completions API, there are other options. Using the Python SDK, for example, you can set a global timeout when creating the client:

from openai import OpenAI
client = OpenAI(timeout=100)  # seconds

Or, you can set a per-call timeout, which overrides the client default just for that request:

client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    timeout=30  # overrides client default
)

A couple of notes:

  • The timeout parameter is SDK-side only — it’s not sent to the API.

  • The SDK’s default timeout is 10 minutes.

  • Since the Python SDK uses httpx under the hood, you can customize timeout behavior even further if needed.

3 Likes

Deeply appreciative of your reply!

Just to clarify, so your saying that the timeout is NOT on openAI’s server side in a “hard set kind of way”. That I can modify the default timeout limit of 10 minutes, because the timeout settings are a part of the OpenAI python module (SDK?) that I’m using here?

try:
                    response = await self._await_coro_with_cancel(
                        openai.ChatCompletion.acreate(**copy.deepcopy(api_params)),
                        _cancel_check,
                        poll_interval=0.2
                    )

So I don’t add the “timeout” into API params, I would just include it for example, something like this?

response = await self._await_coro_with_cancel(
                        openai.ChatCompletion.acreate(**copy.deepcopy(api_params), timeout=900),
2 Likes

Short answer: Yes, that’s correct.

The timeout is a waiting limit on the SDK side. It’s not a API parameter because it’s never send to the API.

1 Like

Muchisimas gracias compa!!

Just to note: the method in the code snippet is EXTREMELY old.

You can read how to update to SDK >= 1.0.0 in a posting from 2023…

And for your complete demo, cranking the timeout and maximum token budget way up for the full model..

"""Chat Completions gpt-5, SDK async streaming, super-basic example"""
import asyncio, json
from openai import AsyncOpenAI

async def stream_chat(prompt):
    '''Example chat completions - only parses text response content
       - Library directly employs environment variable OPENAI_API_KEY'''
    client = AsyncOpenAI(timeout=1200)
    kwargs = {
        "model": "gpt-5-mini", "max_completion_tokens": 32_000,
        "stream": True, "stream_options": {"include_usage": True},
        "messages": [
          {"role": "developer", "content": "You are ChatAPI, a conversational AI."},
          {"role": "user", "content": prompt},
        ],
    }
    reply = ""
    print("API Call:\n", json.dumps(kwargs, indent=2), "\n=======")
    async for chunk in await client.chat.completions.create(**kwargs):
        if chunk.choices:
            delta = chunk.choices[0].delta
            if delta.content:
                print(delta.content, end="", flush=True)  # streaming print function
                reply += delta.content  # gatherer for later
        if chunk.usage:
            print("\n--\nUsage:\n", json.dumps(chunk.usage.model_dump(), indent=2))

if __name__ == "__main__":
    asyncio.run(stream_chat("Write a haiki poem about AI token usage costs"))

Oh very interesting. Thanks J that’s super helpful.

Given that I built this API module at the very end of 2023/early 2024, and have been developing the same system ever since - and I didn’t write any of the code my self - hence your seeing code artifacts from whenever GPT4’s knowledge base cut off time was at the time I was coding… haha!!!

But, amazingly - now it’s only getting better - so I’ll just give it a bump to update my openAIAPI module to current standards (whatever GPT5’s knowledge cut off is now, lol)

But seriously thanks for the tip