GPT-5 is very slow compared to 4.1 (Responses API)

In initial tests it looks like GPT-5 (even mini and nano) are extremely slow compared to 4.1 or 4o. The scenarios I have tried are for function calling.

Is this expected?

20 Likes

Been coding with it for an hour, it takes forever to produce an answer and I’m not even convinced it’s smarter/better/using the full conversation context. It’s possible to shorten the answer by clicking on it but then we might as well stick to 4.1 since we’re accustomed to the behaviour and know what’s expected. I’ll give it more time, probably all morning before deciding to switch back to 4.1 for coding or not. Disappointed to say the least

9 Likes

Yes, it’s slower than 4.1 and 4o by default and that’s expected:

It is a reasoning model, and its reasoning_effort (or effort for responses) is defaulted to medium. So it is spending time “thinking” (and costing you more tokens/money).

Try setting reasoning_effort (or effort for responses) to “minimal” and speed should improve significantly. Obviously that may affect the response quality, but give it a go.

In my experience, even the full model then becomes quite “snappy”.

https://platform.openai.com/docs/api-reference/chat/create#chat_create-reasoning_effort

https://platform.openai.com/docs/api-reference/responses/create#responses_create-reasoning

5 Likes

I’m not even convinced it’s smarter, period, smarter than the best-in-class you can send a refactor request to and get a one-and-done.

More time “vibing” time than it would take to code, with corrections, examples, needed, going outside the scope, chatting at you in the code and even in adding prints to “chat”.

Low effort: too dumb; medium effort: too disobedient and instruction-ignoring.

The AI has a behavior issue: it writes a lot of hedging. As in “this code is gonna run and fail silently, because I don’t know what I’m doing from all the speculations I made”.


Speed:

The billing lets us infer the token production rate we can’t see, what we’ll have to defer to in finding day-by-day performance.

Model (minimal) Trials Avg Latency (s) Avg Stream Rate (tok/s) Avg Total Rate (tok/s)
gpt-4.1-mini 10 0.628 75.441 72.007
gpt-5-mini 10 1.067 90.294 83.732

Note:

  • Latency is your first-token time of user experience.
  • Stream rate meters only the streaming window after first token (by tiktoken).
  • Total rate uses usage.completion_tokens over the entire call duration.
  • reasoning_effort = minimal can use as low as 8 tokens of reasoning. Default can consume all of a writing assignment benchmark’s 1024 max_completion_tokens internally.

Typical:

model gpt-5-mini: 1024 generated (incl. 8 reasoning), 1007 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base

Unique responses for gpt-4.1-mini (by first 60 chars):

6 | # Human Aspirations in a Post-Money World ## Introduction …
4 | Human Aspirations in a Post-Money World Introduction …

Unique responses for gpt-5-mini (by first 60 chars):

4 | Title: Human Aspirations in a Post‑Money World Introduction…
3 | Human Aspirations in a Post‑Money World Introduction Money…
1 | Title: Human Aspirations in a Post-Money World Introduction…
1 | Human Aspirations in a Post‑Money World Introduction — what…
1 | Human Aspirations in a Post‑Money World Introduction For m…

2 Likes

Nice analysis, I’m guessing that the stream doesn’t start until the reasoning has ended, so it looks like it is on better infrastructure, but has more work to do, hence the latency.

I guess it catches up if the output is long enough, otherwise it is slower?

1 Like

“Catch up” is dubious with reasoning. It would take more output than the model is willing to write, while gpt-4.1 is done first by not wanting to write.

But it’s a poor time to evaluate before applications have transitioned and inference machines have transitioned. All of gpt-4.1 has been going significantly slower for months vs it’s launch-day performance.

GPT-5 with medium reasoning and too low of max_tokens: done faster. Much less to read before truncation, though. A call log of a small trial:

model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 640 reasoning), 375 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 769 reasoning), 246 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 626 reasoning), 389 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 481 reasoning), 534 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 472 reasoning), 543 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 713 reasoning), 302 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 528 reasoning), 487 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 672 reasoning), 343 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 559 reasoning), 456 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
model gpt-5-mini: 1024 generated (incl. 725 reasoning), 290 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base
model gpt-4.1-mini: 1024 generated, 1024 delivered of 1024 max, o200k_base

Let the AI write all it wants for the same prompts, gpt-5-mini without a verbosity comes in a resounding second, with more for you to read. Both know their output caps well.

model gpt-4.1-mini: 1551 generated, 1551 delivered of 10240 max, o200k_base
model gpt-4.1-mini: 1403 generated, 1403 delivered of 10240 max, o200k_base
model gpt-4.1-mini: 1559 generated, 1559 delivered of 10240 max, o200k_base
model gpt-4.1-mini: 1531 generated, 1531 delivered of 10240 max, o200k_base
model gpt-4.1-mini: 1467 generated, 1467 delivered of 10240 max, o200k_base
model gpt-4.1-mini: 1404 generated, 1404 delivered of 10240 max, o200k_base
model gpt-4.1-mini: 1297 generated, 1297 delivered of 10240 max, o200k_base
model gpt-4.1-mini: 1387 generated, 1387 delivered of 10240 max, o200k_base
model gpt-4.1-mini: 1416 generated, 1416 delivered of 10240 max, o200k_base
model gpt-5-mini: 3000 generated (incl. 512 reasoning), 2479 delivered of 10240 max, o200k_base
model gpt-5-mini: 3290 generated (incl. 832 reasoning), 2449 delivered of 10240 max, o200k_base
model gpt-5-mini: 3117 generated (incl. 640 reasoning), 2468 delivered of 10240 max, o200k_base
model gpt-4.1-mini: 1397 generated, 1397 delivered of 10240 max, o200k_base
model gpt-5-mini: 3129 generated (incl. 768 reasoning), 2352 delivered of 10240 max, o200k_base
model gpt-5-mini: 3592 generated (incl. 768 reasoning), 2815 delivered of 10240 max, o200k_base
model gpt-5-mini: 3274 generated (incl. 704 reasoning), 2561 delivered of 10240 max, o200k_base
model gpt-5-mini: 3193 generated (incl. 768 reasoning), 2416 delivered of 10240 max, o200k_base
model gpt-5-mini: 3154 generated (incl. 576 reasoning), 2569 delivered of 10240 max, o200k_base
model gpt-5-mini: 3495 generated (incl. 768 reasoning), 2718 delivered of 10240 max, o200k_base
model gpt-5-mini: 3575 generated (incl. 704 reasoning), 2862 delivered of 10240 max, o200k_base

The benchmark launches async in parallel. The only opportunity for cache is successive runs or an unexpected queueing by the server (and the prompt is too small).

Analysis with longer reasoning to show reasoning = latency here

Model Trials Avg Latency (s) Avg Stream Rate (tok/s) Avg Total Rate (tok/s)
gpt-4.1-mini 10 0.728 74.663 70.582
gpt-5-mini 10 6.405 112.284 102.904

Throw 700 reasoning tokens in front, every output is down a different path.

Unique responses for gpt-5-mini (by first 60 chars):

2 | Human Aspirations in a Post‑Money World Introduction Money…
1 | Introduction The idea of a “post-money world” conjures imag…
1 | Title: Human Aspirations in a Post‑Money World Introduction…
1 | Human Aspirations in a Post‑Money World Introduction Imagi…
1 | Human Aspirations in a Post‑Money World Introduction — What…
1 | Human Aspirations in a Post‑Money World Introduction — what…
1 | Introduction When money loses its central role as the measu…
1 | Introduction The idea of a post-money world — where money n…
1 | Introduction Money has been one of humanity’s most powerful…


A test could be “time to 1000 readable tokens”, but that doesn’t answer any need.

4 Likes

so looks like we are aligned. great job!

1 Like

gpt-5 reasoning medium
time 2m45s

gpt-5 reasoning low
time 1m53s

gpt-5 reasoning minimal
time 1m56s

gpt-4.1
time 32sec

Prompts was with ~20k tokens

4 Likes

Definitely.

The same prompt that takes 2-3 seconds with gpt-4.1 takes around 30-70 seconds in gpt-5. This is not usable in a production level application.

See the following Trace log, the user prompt was just “Hello”.

3 Likes

Oh! My bad. I was only looking at small prompts. Yes that’s not good at all!

I think you will find it’s acceptable with small jobs.

No that’s just weird. I cannot repro or explain that.

There is also a system prompt around 5k tokens apart from “Hello”.

Which API do you use? This may be related to Responses API. I experienced many problems before that seem like model related but it was actually Responses API.

2 Likes

Chat Completions!

Bingo :bullseye:

2 Likes

Can you repro this issue with another reasoning model with Responses?

Make very sure you are using the special format for reasoning level in Responses btw, which differs to Completions. (links in earlier post)

This does smell like too many cycles going on inside the run, possibly because the specified reasoning level is being ignore?

I am getting the same behavior. Simple tasks in agents api with 3-4k tokens are taking 70 seconds

edit: I think I may have found part of the problem. even with verbosity and reasoning at their lowest settings, gpt 5 has a tendency to write a LOT more tokens than 4.1.

If you specify the length of the output, it equalizes somewhat. I’m still finding large structured outputs are significantly slower on gpt 5 though.

1 Like

It seems like it is related to Responses API because it is nothing to do with Agents SDK. Considering Responses API is very problematic, I’m almost sure.

Can you also post your problem with details here?

I experienced the same thing. Was very excited for a new model given the excellent work ive been able to do with prior models. Very disappointed in the mobile version. There isn’t the option to switch back and the answers are painfully slow leaving it rather unusable for any real work given the time length now it would take to get through one full conversation. I hope they see this feedback and address it.

2 Likes

We have been designing a very large portal with ChatGPT (previously 4o) for 3 months. With the new version 5, we can already say:
“ChatGPT 5 goes much deeper and is far better in the code it delivers.” I was really praying for it.

BUT
After yesterday’s promising presentation, we are all shocked by the slow response times. Every programming-related requests now takes between 60–120 seconds.
A real disaster! And we are very disappointed.

Furthermore, the “computer use” function is not available. Which is also very disappointing.

Markus
Project: https://www.Experts4Germany.com

5 Likes

agree. Response times are a desaster…

1 Like

This is absolutely terrible - in a fresh thread using the 5 Pro model a simple ping to 1m 19s to generate pong - I mean come on. If openai wants me to keep shelling out my $200 shekels a month - they need to do better.