In initial tests it looks like GPT-5 (even mini and nano) are extremely slow compared to 4.1 or 4o. The scenarios I have tried are for function calling.
Is this expected?
In initial tests it looks like GPT-5 (even mini and nano) are extremely slow compared to 4.1 or 4o. The scenarios I have tried are for function calling.
Is this expected?
Been coding with it for an hour, it takes forever to produce an answer and I’m not even convinced it’s smarter/better/using the full conversation context. It’s possible to shorten the answer by clicking on it but then we might as well stick to 4.1 since we’re accustomed to the behaviour and know what’s expected. I’ll give it more time, probably all morning before deciding to switch back to 4.1 for coding or not. Disappointed to say the least
Yes, it’s slower than 4.1 and 4o by default and that’s expected:
It is a reasoning model, and its reasoning_effort
(or effort
for responses) is defaulted to medium. So it is spending time “thinking” (and costing you more tokens/money).
Try setting reasoning_effort
(or effort
for responses) to “minimal” and speed should improve significantly. Obviously that may affect the response quality, but give it a go.
In my experience, even the full model then becomes quite “snappy”.
https://platform.openai.com/docs/api-reference/chat/create#chat_create-reasoning_effort
https://platform.openai.com/docs/api-reference/responses/create#responses_create-reasoning
I’m not even convinced it’s smarter, period, smarter than the best-in-class you can send a refactor request to and get a one-and-done.
More time “vibing” time than it would take to code, with corrections, examples, needed, going outside the scope, chatting at you in the code and even in adding prints to “chat”.
Low effort: too dumb; medium effort: too disobedient and instruction-ignoring.
The AI has a behavior issue: it writes a lot of hedging. As in “this code is gonna run and fail silently, because I don’t know what I’m doing from all the speculations I made”.
Speed:
The billing lets us infer the token production rate we can’t see, what we’ll have to defer to in finding day-by-day performance.
Model (minimal) | Trials | Avg Latency (s) | Avg Stream Rate (tok/s) | Avg Total Rate (tok/s) |
---|---|---|---|---|
gpt-4.1-mini | 10 | 0.628 | 75.441 | 72.007 |
gpt-5-mini | 10 | 1.067 | 90.294 | 83.732 |
Note:
reasoning_effort
= minimal can use as low as 8 tokens of reasoning. Default can consume all of a writing assignment benchmark’s 1024 max_completion_tokens internally.Typical:
model gpt-5-mini: 1024 generated (incl. 8 reasoning), 1007 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
6 | # Human Aspirations in a Post-Money World ## Introduction …
4 | Human Aspirations in a Post-Money World Introduction …
4 | Title: Human Aspirations in a Post‑Money World Introduction…
3 | Human Aspirations in a Post‑Money World Introduction Money…
1 | Title: Human Aspirations in a Post-Money World Introduction…
1 | Human Aspirations in a Post‑Money World Introduction — what…
1 | Human Aspirations in a Post‑Money World Introduction For m…
Nice analysis, I’m guessing that the stream doesn’t start until the reasoning has ended, so it looks like it is on better infrastructure, but has more work to do, hence the latency.
I guess it catches up if the output is long enough, otherwise it is slower?
“Catch up” is dubious with reasoning. It would take more output than the model is willing to write, while gpt-4.1 is done first by not wanting to write.
But it’s a poor time to evaluate before applications have transitioned and inference machines have transitioned. All of gpt-4.1 has been going significantly slower for months vs it’s launch-day performance.
GPT-5 with medium reasoning and too low of max_tokens: done faster. Much less to read before truncation, though. A call log of a small trial:
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 640 reasoning), 375 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 769 reasoning), 246 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 626 reasoning), 389 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 481 reasoning), 534 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 472 reasoning), 543 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 713 reasoning), 302 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 528 reasoning), 487 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 672 reasoning), 343 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 559 reasoning), 456 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 725 reasoning), 290 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
Let the AI write all it wants for the same prompts, gpt-5-mini without a verbosity comes in a resounding second, with more for you to read. Both know their output caps well.
model gpt-4.1-mini: 1551 generated, 1551 delivered of 10240 max, o200k_base
model gpt-4.1-mini: 1403 generated, 1403 delivered of 10240 max, o200k_base
model gpt-4.1-mini: 1559 generated, 1559 delivered of 10240 max, o200k_base
model gpt-4.1-mini: 1531 generated, 1531 delivered of 10240 max, o200k_base
model gpt-4.1-mini: 1467 generated, 1467 delivered of 10240 max, o200k_base
model gpt-4.1-mini: 1404 generated, 1404 delivered of 10240 max, o200k_base
model gpt-4.1-mini: 1297 generated, 1297 delivered of 10240 max, o200k_base
model gpt-4.1-mini: 1387 generated, 1387 delivered of 10240 max, o200k_base
model gpt-4.1-mini: 1416 generated, 1416 delivered of 10240 max, o200k_base
model gpt-5-mini: 3000 generated (incl. 512 reasoning), 2479 delivered of 10240 max, o200k_base
model gpt-5-mini: 3290 generated (incl. 832 reasoning), 2449 delivered of 10240 max, o200k_base
model gpt-5-mini: 3117 generated (incl. 640 reasoning), 2468 delivered of 10240 max, o200k_base
model gpt-4.1-mini: 1397 generated, 1397 delivered of 10240 max, o200k_base
model gpt-5-mini: 3129 generated (incl. 768 reasoning), 2352 delivered of 10240 max, o200k_base
model gpt-5-mini: 3592 generated (incl. 768 reasoning), 2815 delivered of 10240 max, o200k_base
model gpt-5-mini: 3274 generated (incl. 704 reasoning), 2561 delivered of 10240 max, o200k_base
model gpt-5-mini: 3193 generated (incl. 768 reasoning), 2416 delivered of 10240 max, o200k_base
model gpt-5-mini: 3154 generated (incl. 576 reasoning), 2569 delivered of 10240 max, o200k_base
model gpt-5-mini: 3495 generated (incl. 768 reasoning), 2718 delivered of 10240 max, o200k_base
model gpt-5-mini: 3575 generated (incl. 704 reasoning), 2862 delivered of 10240 max, o200k_base
The benchmark launches async in parallel. The only opportunity for cache is successive runs or an unexpected queueing by the server (and the prompt is too small).
Model | Trials | Avg Latency (s) | Avg Stream Rate (tok/s) | Avg Total Rate (tok/s) |
---|---|---|---|---|
gpt-4.1-mini | 10 | 0.728 | 74.663 | 70.582 |
gpt-5-mini | 10 | 6.405 | 112.284 | 102.904 |
2 | Human Aspirations in a Post‑Money World Introduction Money…
1 | Introduction The idea of a “post-money world” conjures imag…
1 | Title: Human Aspirations in a Post‑Money World Introduction…
1 | Human Aspirations in a Post‑Money World Introduction Imagi…
1 | Human Aspirations in a Post‑Money World Introduction — What…
1 | Human Aspirations in a Post‑Money World Introduction — what…
1 | Introduction When money loses its central role as the measu…
1 | Introduction The idea of a post-money world — where money n…
1 | Introduction Money has been one of humanity’s most powerful…
A test could be “time to 1000 readable tokens”, but that doesn’t answer any need.
so looks like we are aligned. great job!
gpt-5 reasoning medium
time 2m45s
gpt-5 reasoning low
time 1m53s
gpt-5 reasoning minimal
time 1m56s
gpt-4.1
time 32sec
Prompts was with ~20k tokens
Definitely.
The same prompt that takes 2-3 seconds with gpt-4.1 takes around 30-70 seconds in gpt-5. This is not usable in a production level application.
See the following Trace log, the user prompt was just “Hello”.
Oh! My bad. I was only looking at small prompts. Yes that’s not good at all!
I think you will find it’s acceptable with small jobs.
No that’s just weird. I cannot repro or explain that.
There is also a system prompt
around 5k tokens apart from “Hello”.
Which API do you use? This may be related to Responses API. I experienced many problems before that seem like model related but it was actually Responses API.
Chat Completions!
Bingo
Can you repro this issue with another reasoning model with Responses?
Make very sure you are using the special format for reasoning level in Responses btw, which differs to Completions. (links in earlier post)
This does smell like too many cycles going on inside the run, possibly because the specified reasoning level is being ignore?
I am getting the same behavior. Simple tasks in agents api with 3-4k tokens are taking 70 seconds
edit: I think I may have found part of the problem. even with verbosity and reasoning at their lowest settings, gpt 5 has a tendency to write a LOT more tokens than 4.1.
If you specify the length of the output, it equalizes somewhat. I’m still finding large structured outputs are significantly slower on gpt 5 though.
It seems like it is related to Responses API
because it is nothing to do with Agents SDK
. Considering Responses API
is very problematic, I’m almost sure.
Can you also post your problem with details here?
I experienced the same thing. Was very excited for a new model given the excellent work ive been able to do with prior models. Very disappointed in the mobile version. There isn’t the option to switch back and the answers are painfully slow leaving it rather unusable for any real work given the time length now it would take to get through one full conversation. I hope they see this feedback and address it.
We have been designing a very large portal with ChatGPT (previously 4o) for 3 months. With the new version 5, we can already say:
“ChatGPT 5 goes much deeper and is far better in the code it delivers.” I was really praying for it.
BUT
After yesterday’s promising presentation, we are all shocked by the slow response times. Every programming-related requests now takes between 60–120 seconds.
A real disaster! And we are very disappointed.
Furthermore, the “computer use” function is not available. Which is also very disappointing.
Markus
Project: https://www.Experts4Germany.com
agree. Response times are a desaster…
This is absolutely terrible - in a fresh thread using the 5 Pro model a simple ping to 1m 19s to generate pong - I mean come on. If openai wants me to keep shelling out my $200 shekels a month - they need to do better.