For the last few weeks, o1 API has been awful in terms of response. It just hangs there for 30 minutes on a single document while it would on the same document a month ago, would only take 3 minutes. And then it completely quits the session, if it ever continues on, in the middle of execution. Have incurred so much cost by abrupt quitting of o1 or incredibly slow response. This has been going on for a few weeks now with no sign of improvement. Even today, after waiting to process a single small document for more than 10 minutes, I force quit the session. o3 still suffers from some anomalies and hence I choose to stick to o1 until o3 improves.
I wrote in the o3 price reduction forum topic:
o1-pro was down to non-functional in ChatGPT..then O3-pro was released. Perhaps the same βreasoning models on the backburnerβ was applied to othersβ¦
Or the effect delivered:
- cut down on an individual userβs usage of compute resources, by making them wait minutes between API calls in a chat;
- make it seem that the AI was βthinking hardβ, when the actual product metrics reveals simply low token output rate, especially o3-pro (hereβs how we can use legacy GPUsβ¦)
I send your question as model input to an βOpenAI product assistantβ - on Chat Completions:
o1-2024-12-17: 8.0s; in: 212; out: 860 - thatβs 100 tokens a second realized
o3-2025-04-16: 34.4s; in: 212; out: 1,363 (320 internal) - down to 40 TPS
Those arenβt terrible. Are you also including tools that the AI might get caught up in using in loops?
o3 output: 80% complete fabrication of fact
Iβm sorry youβve been running into thisβwhat youβre describing isnβt expected behavior for the o1 models. Below is a checklist of things you can do right away, plus some background on what typically causes the symptoms youβre seeing and how to get help if the problem persists.
ββββββββββββββββββββββββββββββββ
- Check that the issue isnβt a known incident
ββββββββββββββββββββββββββββββββ
β’ Always look at status.openai.com first.
β If you see a βDegraded performanceβ or βPartial outageβ banner that coincides with the time of your calls, that is almost certainly the cause.
β You can subscribe to e-mail or Slack alerts so you know when an incident starts and when it is resolved.
ββββββββββββββββββββββββββββββββ
2. Retrieve recent request IDs and open a support ticket
ββββββββββββββββββββββββββββββββ
β’ Each API response includes a header called openai-request-id
. If the call hangs and you never get a response, capture the ID(s) from the most recent successful request before the hang, and the one that finally errors or times out.
β’ Send those IDs, the exact UTC timestamps, your Organization ID, and (if possible) the approximate prompt and settings (model
, max_tokens
, temperature
, etc.) to help.openai.com β βSubmit a requestβ.
β This lets the engineering team pull internal logs and pinpoint whether youβre hitting a service-side regression or a usage pattern that needs adjustment.
ββββββββββββββββββββββββββββββββ
3. Common client-side settings that accidentally create long hangs
ββββββββββββββββββββββββββββββββ
-
Very large max_tokens
β Ifmax_tokens
is high (e.g., β₯8 000) the server must hold a GPU worker for the entire generation window. If the prompt is short and the call normally finishes quickly, but once in a while the model decides to keep talking, you can wait minutes.
β Mitigation: capmax_tokens
closer to the average you expect (e.g., 1 500), or break your work into multiple shorter calls. -
Long user-side timeouts
β The OpenAI backend will terminate a request at ~600 s (10 min). If your HTTP clientβs read timeout is longer, you might keep waiting even after the job has been cancelled server-side.
β Mitigation: set your read timeout to 610β620 s and use an exponential-back-off retry. If you still get no bytes by then, abandon and retry the request. -
Large contexts with repeated instructions
β o-series models do a full forward pass on the entire context at every generation step; if the same huge blob is injected for every message youβll see a super-linear slowdown.
β Mitigation: deduplicate boilerplate, move long reference text into a vector store or partial look-ups, or chunk the document (see βbatchingβ below).
ββββββββββββββββββββββββββββββββ
4. Practical steps to control cost when the model aborts
ββββββββββββββββββββββββββββββββ
β’ Stream the response (stream=True
).
β When you stream tokens you only pay for what is actually delivered. If the call dies halfway through, you lose less time and money.
β’ Break large documents into smaller chunks and parallelize.
β Instead of one 30-minute monolith request, send 20 smaller requests in parallel. Even if 1-2 stall, you can retry just those pieces.
β’ Set a per-request budget guardrail.
β Before sending the request, estimate token usage and skip or trim anything that would exceed your maximum cost for the job.
ββββββββββββββββββββββββββββββββ
5. Differences between o1 and o3 that sometimes surprise people
ββββββββββββββββββββββββββββββββ
β’ o3 has a more aggressive βearly terminationβ policy if the model predicts it has completed the answer, while o1 will keep going until you run into max_tokens
.
β This is why o3 can look βanomalousβ (shorter replies), but it also means o3 often finishes far faster.
β’ If your workload isnβt highly instruction-sensitive, you might try gpt-4o-mini (available under the gpt-4o-mini
name). It is newer than o1, cheaper, and typically 2-3Γ faster.
ββββββββββββββββββββββββββββββββ
6. Next steps if the issue continues
ββββββββββββββββββββββββββββββββ
- Gather: request IDs, org ID, timestamps, approximate prompt size,
max_tokens
, whether you used streaming. - Open a ticket at help.openai.com (or e-mail support@openai.com).
- Reference βlong hangs on o1 beginning around β in the subject so it can be routed to the runtime-performance queue.
We want to get you unblocked quickly, so donβt hesitate to provide as much detail as you can. Again, apologies for the trouble, and thank you for flagging the issue.