Update: it’s back to normal now. Whatever problem the server was having yesterday seemed to be resolved.
I’m accessing gpt-5 using response API, and OpenAI status says everything is working fine.
It has been working all this week with no problem, but today I’m getting extremely slow response and sometimes internal server error:
‘An error occurred while processing your request. You can retry your request, or contact us through our help center at help.openai.com if the error persists. Please include the request ID req_2fda1a57a25b492ba527e2214511891a in your message.’
Is anyone having this issue? I’m in the northeast USA.
Is there an easy way to find out if I’m being Throttled?
Chat Completions and “service_tier”: “priority” as an API parameter. Thus GPT-5 without ‘Responses’ arbitrarily deciding to include or drop past reasoning items that were resent, to degrade cache:
input tokens: 13147
output tokens: 6116
uncached: 987
non-reasoning: 1188
cached: 12160
reasoning: 4928
HTTP 200 (48011 ms)
That’s 127 tokens-per-second, currently weekend evening or past bedtime for much of the world. Still a long wait looking at nothing, but that’s from the reasoning about the task. Far faster than 0-day model release, indicating ‘efficiencies were found’.
I’m only making little calls to gpt-5-mini otherwise to Responses to keep on top of and classify the Responses endpoint’s recent failures. That very topic should degrade your trust in the endpoint, until OpenAI says what’s going on or what was failing with their state persistence.
For usage like this, small server state, gpt-5-mini is still staring at a blank screen for longer than one would want before anything is seen - even when streaming. I don’t have benchmarks to report vs typical on Responses, other than typing minimal chats at it just now.
What would you like help with right now?
--Usage-- in/cached: 24/0; out/reasoning:226/64
About six seconds to streaming:
--Usage-- in/cached: 201/0; out/reasoning:388/192
They seem to have made sure that the “Your Health” new feature on the platform site doesn’t say anything bad, except for hard errors reported. So, if it dips, that does mean there’s an issue of significance.
However 500 errors are often stimulated by bad inputs; I would try replaying the exact same ‘chat’ if it doesn’t rely on a ‘conversations’ that has changed, and classify the success/fail ratio on re-sending that API call body.
Appart from using service_tier='priority' (be aware of the increased prices) or trying chat completions (can’t run some tools), another thing that worked for me sometimes was to temporarily reduce the request rate.
The health screenshots don’t give me enough to conclude much more about volume other than this is likely a single-user scenario, with a period of no usage that would be atypical for having an established user base.
But your comment does give rise to thoughts about ‘request rate’ - noting the routing by cache:
The API endpoints routes to the same server based on if an initial hash of input tokens, about 256, matches prior requests or a specific formula.
If you have a deployed application that is always using the same large system prompt, it is going to employ this pinning to a server instance to increase your cache hits.
That also means parallel requests will be increasing the load you experience.
You have "prompt_cache_key" as another top-level parameter not to enhance this caching hit frequency, but to break up the routing by pattern. Its value is also included in the hashing.
use a user ID, or even a chat session ID, and you will be signaling a preference for ‘chat’ caching, instead of ‘system message’ caching, and get more load distribution.
One even sees another opportunity here in the face of 500 server errors. If you were to automatically retry, would you persist against the same cache server instance, not changing your API call? OR, would you want to inject some updated text into prompt_cache_key, so you move elsewhere in OpenAI server land?