GPT-5-mini API unstable and slow, repeated timeout and empty responses

I see posts from around a month ago, but I am wondering why there doesn’t seem to be any resolution? Using completions API for gpt-5-mini, I am having an insane amount of instability in actually getting responses back as well as slowness.

With repeatedly passing in history, a generous upper bound on my input sizes are around 100k input tokens. I repeatedly get openai.APITimeoutError despite having tried 5 and 10 minute timeout. 10 minutes is nearing prohibitively long for our purposes. I was previously running gpt4o mini and 4.1mini at around 2 minute timeouts with no issues. Model also repeatedly fails to complete Pydantic defined structured inputs many times. I am also getting empty completions that are not errors, even on minimal thinking. I know this is somewhat resolveable by increasing max_completion_tokens, but this is not desirable for us. Never before had these problem with gpt4mini or 4.1mini. Cannot also downgrade because we are specifically running benchmarking for gpt5.

Does anyone have any recommendations or resolutions to these issues? Maybe something to do with API update or not migrating to responses API, but we are finding that this is ridiculous when previous models and API had minimal if any issues for our purposes.

1 Like

I can tell you that we are using gpt-5-mini in our production system that handles chat conversations. It is a bit slower than gpt-4o-mini, but that is an okay trade-off for us given that its reasoning seems better. We see time to first streaming character around 3-5 seconds and complete responses at about 10s. This uses file_search and sometimes some of our function tools so those are multi-turn generations. Have you taken a look at the reasoning flow? You can see that by running this in streaming mode and looking at the event stream.

1 Like

We did try streaming and same issues, sometimes we just randomly get empty content with no refusal, followed by end chunk with length finish reason. Seemingly doubling timeout again and tripling max_completion_tokens improves stability but running our processes still eventually results in some kind of error. I am surprised your streaming is that fast, I am up to 15 seconds for first chunk (of an empty string).

You’re not alone in frustrations. Our experience has been that whenever you get “off the beaten path” with OpenAI stuff, you find a lot of “rough edges”. My sense is that their development philosophy is “move fast and break things” – which doesn’t seem ideal for a platform that is supposed to be the foundation of other products (like yours and mine).

Having said that, in our case, we use both OpenAI and Google – partly because we need the redundancy because OpenAI has had so many service-affecting outages. We can choose either as primary. But we use both for generating and it’s very interesting comparing the responses side-by-side for the same context. In general, OpenAI responses are better when they work but they are also much slower. (Comparing gpt-5-mini with gemini-2.5-flash.)

I wish I could offer some other specific suggestions for you. Best of luck, though. If you do figure it out, I bet we’ll all be interested to hear what you learn.

1 Like

Same here. I’ve been getting a lot of timeouts with GPT-5 Mini. Even with fallback, it’s a big problem. It doesn’t refuse the connection due to heavy workload — instead, it accepts, tries to respond, and then never completes (time out).
As a result, we lose a lot of time in our flow just waiting.

1 Like