Is gpt-4o-2024-11-20 recently just became slower?

yhankovich · August 11, 2025, 4:08pm

Is anyone recently noticed performance downgrade for gpt-4o-* models?

last two days API calls to gpt-4o-2024-11-20 takes twice longer than it was before…

I guess it might be OpenAI internal compute capacity redistribution to speed-up GPT5 models… and now we have two slow model families…

_j · August 11, 2025, 9:55pm

Model (512 max)	Trials	Avg Latency (s)	Avg Stream Rate (tok/s)	Avg Total Rate (tok/s)
gpt-4o-2024-05-13	10	1.282	81.570	67.592
gpt-4o-2024-08-06	10	1.201	40.453	37.003
gpt-4o-2024-11-20	10	1.030	65.218	57.375

The only thing I note is that after -08-06 coming in last for almost all the requests, there were two 11-20 stragglers that took still a bit more.

Replace that reliably-slowest with another and benchmark the call blast again:

Since the input was now cached, I put randomization in both the “system” and in “prompt_cache_key” to ensure load distribution and cache-breaking beyond a single routing (before just used “user” alone)

Model	Trials	Avg Latency (s)	Avg Stream Rate (tok/s)	Avg Total Rate (tok/s)
gpt-4o-2024-05-13	10	0.722	110.215	94.781
gpt-4.1	10	0.732	64.934	59.270
gpt-4o-2024-11-20	10	0.721	82.984	73.920

Model	Trials	Avg Latency (s)	Avg Stream Rate (tok/s)	Avg Total Rate (tok/s)
gpt-4o-2024-05-13	10	0.776	104.944	90.425
gpt-4.1	10	0.767	65.305	59.017
gpt-4o-2024-11-20	10	0.633	81.658	73.653

It looks like your fastest currently is to pay a bit more for 05-13, where you likely pay in correlation to the computation expended in the first generation model also.

smit · August 12, 2025, 8:02am

Another thing I noticed today is that suddenly when I use o3 it talks about a much smaller token limit. I used o3 for a while and when I used it just now I suddenly get “The requested length (“at least 25 000 tokens”) exceeds the maximum that can be generated or displayed in a single answer on this platform (current hard limit: 8 192 tokens—including both user query and assistant answer).”. However O3 does not have a token limit of 8192k so I dont understand why it says that? https://platform.openai.com/docs/models/compare

Then I tried switching from o3 to o1 and it did not give a warning but the output was worse what I previously got from o1. So now I have a situation whereby gpt-5 is not working well, o3 seems to have odd context limits and o1 output has worsened. I guess I will need to switch my API to Claude soon?

_j · August 12, 2025, 8:25am

Do you have a slow model concern? Or shall a moderator take care of your off-topic posting?

Why it says that is because the AI model has a knowledge cutoff date a year old, and doesn’t know what it is.

Then, because you as a developer didn’t orient the AI model properly with a developer message. Therefore, it produced a refusal based on assumptions.

Better:

developer

Today is 2025-08-12 Monday
Reasoning: high effort
Input budget: 400k tokens
Output budget: 128k tokens
Free tokens, uncounted towards word budget: repeated sections of prior messages
Model Class: GPT-5, self-reasoning

You might get some acknowledgement because it knows about gpt-4-turbo, which is 128k/4k. However, it is mostly pointless on reasoning models, because “developer” is a degraded quality instruction, and you are a consumer of a product low in the trust hierarchy, that doesn’t have “system” control. It will even spit out a name of an OpenAI product into your application.

The solution is not to ask, and not to provide developer information for the AI to inspect, judge, and ultimately reject internally because it knows best.

smit · August 12, 2025, 8:36am

Thanks, it’s probably somewhat off topic indeed although I do have slowness concerns. Almost all API outputs have become very slow compared to yesterday and last week(s), that’s regardless whether I use o1, o3 or gpt-5. Previously it took 3 minutes to output while now the same type of ask takes 10 minutes for gpt-5.

Wrt the suggestions, I understand that I can improve the system/developer messages however my issue is that without any change to my prompts/input nor to the requested models, suddenly I get these new messages. I’ve been using o1, o3 for weeks for a specific purpose and now suddenly I get these strange responses. In my POV that means it’s unstable, but I will try if the suggestions improve the response/output.

_j · August 12, 2025, 9:00am

Yes, one can discount the slowness as problems scaling up model deployment to meet demands appropriately, and hope it gets better.

This topic is about gpt-4o, which did not suffer to the point where someone was sure about their suspicion.

Topic		Replies	Views
Gpt-4-0125-preview INCREDIBLY slower than 3.5 turbo API	12	9899	July 22, 2024
GPT-4o-2024–08–06 slower then previous version API gpt-4o	9	1485	January 7, 2025
GPT 4 API is Very Slow Still API gpt-4 , chatgpt , api	14	7206	September 27, 2023
GPT-4.1 models are very slow due to API response. API	5	1072	September 8, 2025
GPT-5 + Responses API is extremely slow API gpt-5 , responses , responses-api	33	19059	October 27, 2025

Is gpt-4o-2024-11-20 recently just became slower?

Related topics