O3 API timeout versus no issue with 4o with same prompt

Hello,

We’re running our queries with the API on o3 and 4o.
With the exact same prompt:

  • 4o: no issue
  • o3: timeout after

We tried all possible ways to avoid this timeout issue on the code side, but it seems to be on the API side. Any parameter to avoid this?

Thanks!

Timeout can also be on the worker platform where you are running. Many will close idle-looking connections on you after 60 seconds or so.

Make the same call locally with your code.

Yes that’s what i thought too, so we tried that, and same thing locally: works with 4o, times out with o3…

I can assure you that O3 doesn’t hang up on you before you get an answer.

The delay in answering you here was in coming up with and running one of the longer tasks that is question-answering just to demonstrate.

...
with only a low-cost 39 Ω, 1-W resistor and
without touching the constant-current driver or the mechanics of the
lamp.
total time: 77.0s
`{'prompt_tokens': 353, 'completion_tokens': 3354, 'total_tokens': 3707, 'prompt_tokens_details': {'cached_tokens': 0, 'audio_tokens': 0}, 'completion_tokens_details': {'reasoning_tokens': 1792, 'audio_tokens': 0, 'accepted_prediction_tokens': 0, 'rejected_prediction_tokens': 0}}`

The model certainly has the capacity to take longer if you just want it to write forever. GPT-4.5 also has a low production rate if you want to see if it is model-specific or by timeout, but it will want to wrap up after crossing 1500 tokens.

So you’d be looking at your code, various timeouts of libraries. The precise time elapsed and the error raised could give you a clue.

The OpenAI SDK new enough to pass O3 should have a long timeout, if that is what you are using.

We use a CURL call for now.
I understand what you’re saying. It goes through with every possible model, excep o3.
And it’s weird because the CURL Times out with everything being set to 600 seconds at all level.
But the query still appears in the OpenAI log page. It uses around 10k token in+out.

Another thing to note: if using the Responses endpoint, the answer will come back in an array, where there is a (empty) reasoning summary as the first, and the actual content in the second. So, naive parsing of the JSON response can also be a reason you could be receiving nothing, even on “hello”.


It’s a challenge to show both “API example” and “good practice” at once, but I poked away at this a bit for you - Python code to chat with o3 at the console.

__doc__ = "Chat Completions Python demo chatbot; no SDK, no stream, no async"
import os
import time
import logging
# `pip install httpx` if you don't have it
import httpx

logger = logging.getLogger(__name__)

def send_chat_request(
    conversation_messages: list[dict[str, str]],
    model: str = "gpt-4o-mini",
    max_tokens: int = 4000,
    *,
    timeout: float = 900.0,
) -> tuple[httpx.Response, float]:
    """
    Call the OpenAI chat-completions endpoint with the supplied message list.
    OPENAI_API_KEY environment variable is used

    Parameters
    ----------
    conversation_messages : List[Dict[str, str]]
        A list of dicts, each with 'role' and 'content' keys.
    model : str, optional
    max_tokens : int, optional
    timeout (keyword) : float, optional
    Returns
    -------
    Tuple[httpx.Response, float]
        Response object and elapsed time in seconds.
        (you can easily add response headers with rate limits)

    Raises
    ------
    ValueError
        If OPENAI_API_KEY environment variable is unset.
    httpx.HTTPStatusError
        If the response has an HTTP error status, like 429 = no credits.
    httpx.RequestError
        If a network error occurs.
    """
    api_url = "https://api.openai.com/v1/chat/completions"
    api_key = os.environ.get("OPENAI_API_KEY")  # don't hard-code
    if not api_key:
        raise ValueError("ERROR: Set the OPENAI_API_KEY environment variable.")

    headers = {"Authorization": f"Bearer {api_key}"}
    start_time = time.time()
    try:
        response = httpx.post(
            api_url,
            headers=headers,
            json={
                "model": model,
                "messages": conversation_messages,
                "max_completion_tokens" : max_tokens,
            },
            timeout=timeout,
        )
        response.raise_for_status()
        return response, time.time() - start_time

    except httpx.HTTPStatusError as err:
        logger.error(f"HTTP Err {err.response.status_code}: {err.response.text}")
        raise
    except httpx.RequestError as err:
        logger.error(f"Request Error: {err}")
        raise


# Chat application pattern as script, where exit/break gives you ai_response
MODEL_NAME = "o3"  # start with "gpt-4o-mini"
MAX_TOKENS = None       # Reasoning models need high value or None
MAX_HISTORY_LENGTH = 20  # 20 == 10 user inputs
SYSTEM_PROMPT = """
You are a helpful AI assistant, employing your expertise and vast world knowledge.
With internal planning, you fulfill every input truthfully, accurately, and robustly.
""".strip()

system_message = {
    "role": "developer" if MODEL_NAME.startswith("o") else "system",
    "content": SYSTEM_PROMPT
}
conversation_history: list[dict[str, str]] = []
ai_response: httpx.Response | None = None

print(f"Type your prompt to {MODEL_NAME}.  Enter “exit” to quit.", end="\n\n")

# A chatbot session sends repeatedly, growing a message context list
while True:
    user_input = input("prompt> ").strip()
    if user_input.lower() == "exit":
        print("\nExiting.  Inspect `resp` in a REPL for full details if desired.")
        break
    user_message = {"role": "user", "content": user_input}
    recent_history = conversation_history[-MAX_HISTORY_LENGTH:]
    messages = [system_message, *recent_history, user_message]

    # Here, send_chat_request is purposefully allowed to raise traceback
    ai_response, ai_duration = send_chat_request(
        messages,
        model=MODEL_NAME,
        max_tokens=MAX_TOKENS,
    )
    # Parse out stuff we want and expect: just text content from assistant
    try:
        assistant_reply = ai_response.json()["choices"][0]["message"]["content"]
        ai_usage = ai_response.json()["usage"]
    except (KeyError, IndexError, ValueError) as parse_err:
        print(f"Failed to parse response – {parse_err}", file=sys.stderr)
        continue

    # Add to a conversation history only after success (or could retry it)
    conversation_history.append(user_message)
    conversation_history.append({"role": "assistant", "content": assistant_reply})
    print("assistant>", assistant_reply)
    print(f"total time: {ai_duration:.1f}s")
    print(ai_usage)

Thanks. Don’t think it’s this since shorter prompt work fine with o3, and other models.
It’s really an issue for longer prompts with o3, and only with o3.
I’ve tried everything on the server side at this point. Very weird

1 Like

Try a man curl on your machine and see who’s build of the thing you’re using, and its timeout parameter…

You could probably set up a little server script somewhere that will respond more cheaply “hello” only after two minutes of sleep, and watch CURL or the server connection supervisor die on that also.

1 Like