O1- response time has been terrible lately

I wrote in the o3 price reduction forum topic:

o1-pro was down to non-functional in ChatGPT..then O3-pro was released. Perhaps the same “reasoning models on the backburner” was applied to others…

Or the effect delivered:

  • cut down on an individual user’s usage of compute resources, by making them wait minutes between API calls in a chat;
  • make it seem that the AI was “thinking hard”, when the actual product metrics reveals simply low token output rate, especially o3-pro (here’s how we can use legacy GPUs…)

I send your question as model input to an “OpenAI product assistant” - on Chat Completions:

o1-2024-12-17: 8.0s; in: 212; out: 860 - that’s 100 tokens a second realized
o3-2025-04-16: 34.4s; in: 212; out: 1,363 (320 internal) - down to 40 TPS

Those aren’t terrible. Are you also including tools that the AI might get caught up in using in loops?

o3 output: 80% complete fabrication of fact

I’m sorry you’ve been running into this—what you’re describing isn’t expected behavior for the o1 models. Below is a checklist of things you can do right away, plus some background on what typically causes the symptoms you’re seeing and how to get help if the problem persists.

────────────────────────────────

  1. Check that the issue isn’t a known incident
    ────────────────────────────────
    • Always look at status.openai.com first.
    – If you see a “Degraded performance” or “Partial outage” banner that coincides with the time of your calls, that is almost certainly the cause.
    – You can subscribe to e-mail or Slack alerts so you know when an incident starts and when it is resolved.

────────────────────────────────
2. Retrieve recent request IDs and open a support ticket
────────────────────────────────
• Each API response includes a header called openai-request-id. If the call hangs and you never get a response, capture the ID(s) from the most recent successful request before the hang, and the one that finally errors or times out.
• Send those IDs, the exact UTC timestamps, your Organization ID, and (if possible) the approximate prompt and settings (model, max_tokens, temperature, etc.) to help.openai.com → “Submit a request”.
– This lets the engineering team pull internal logs and pinpoint whether you’re hitting a service-side regression or a usage pattern that needs adjustment.

────────────────────────────────
3. Common client-side settings that accidentally create long hangs
────────────────────────────────

  1. Very large max_tokens
    – If max_tokens is high (e.g., ≥8 000) the server must hold a GPU worker for the entire generation window. If the prompt is short and the call normally finishes quickly, but once in a while the model decides to keep talking, you can wait minutes.
    – Mitigation: cap max_tokens closer to the average you expect (e.g., 1 500), or break your work into multiple shorter calls.

  2. Long user-side timeouts
    – The OpenAI backend will terminate a request at ~600 s (10 min). If your HTTP client’s read timeout is longer, you might keep waiting even after the job has been cancelled server-side.
    – Mitigation: set your read timeout to 610–620 s and use an exponential-back-off retry. If you still get no bytes by then, abandon and retry the request.

  3. Large contexts with repeated instructions
    – o-series models do a full forward pass on the entire context at every generation step; if the same huge blob is injected for every message you’ll see a super-linear slowdown.
    – Mitigation: deduplicate boilerplate, move long reference text into a vector store or partial look-ups, or chunk the document (see “batching” below).

────────────────────────────────
4. Practical steps to control cost when the model aborts
────────────────────────────────
• Stream the response (stream=True).
– When you stream tokens you only pay for what is actually delivered. If the call dies halfway through, you lose less time and money.
• Break large documents into smaller chunks and parallelize.
– Instead of one 30-minute monolith request, send 20 smaller requests in parallel. Even if 1-2 stall, you can retry just those pieces.
• Set a per-request budget guardrail.
– Before sending the request, estimate token usage and skip or trim anything that would exceed your maximum cost for the job.

────────────────────────────────
5. Differences between o1 and o3 that sometimes surprise people
────────────────────────────────
• o3 has a more aggressive “early termination” policy if the model predicts it has completed the answer, while o1 will keep going until you run into max_tokens.
– This is why o3 can look “anomalous” (shorter replies), but it also means o3 often finishes far faster.
• If your workload isn’t highly instruction-sensitive, you might try gpt-4o-mini (available under the gpt-4o-mini name). It is newer than o1, cheaper, and typically 2-3× faster.

────────────────────────────────
6. Next steps if the issue continues
────────────────────────────────

  1. Gather: request IDs, org ID, timestamps, approximate prompt size, max_tokens, whether you used streaming.
  2. Open a ticket at help.openai.com (or e-mail support@openai.com).
  3. Reference “long hangs on o1 beginning around ” in the subject so it can be routed to the runtime-performance queue.

We want to get you unblocked quickly, so don’t hesitate to provide as much detail as you can. Again, apologies for the trouble, and thank you for flagging the issue.