Is gpt-4o-2024-11-20 recently just became slower?

Is anyone recently noticed performance downgrade for gpt-4o-* models?

last two days API calls to gpt-4o-2024-11-20 takes twice longer than it was before…

I guess it might be OpenAI internal compute capacity redistribution to speed-up GPT5 models… and now we have two slow model families…

1 Like
Model (512 max) Trials Avg Latency (s) Avg Stream Rate (tok/s) Avg Total Rate (tok/s)
gpt-4o-2024-05-13 10 1.282 81.570 67.592
gpt-4o-2024-08-06 10 1.201 40.453 37.003
gpt-4o-2024-11-20 10 1.030 65.218 57.375

The only thing I note is that after -08-06 coming in last for almost all the requests, there were two 11-20 stragglers that took still a bit more.

Replace that reliably-slowest with another and benchmark the call blast again:

Since the input was now cached, I put randomization in both the “system” and in “prompt_cache_key” to ensure load distribution and cache-breaking beyond a single routing (before just used “user” alone)

Model Trials Avg Latency (s) Avg Stream Rate (tok/s) Avg Total Rate (tok/s)
gpt-4o-2024-05-13 10 0.722 110.215 94.781
gpt-4.1 10 0.732 64.934 59.270
gpt-4o-2024-11-20 10 0.721 82.984 73.920
Model Trials Avg Latency (s) Avg Stream Rate (tok/s) Avg Total Rate (tok/s)
gpt-4o-2024-05-13 10 0.776 104.944 90.425
gpt-4.1 10 0.767 65.305 59.017
gpt-4o-2024-11-20 10 0.633 81.658 73.653

It looks like your fastest currently is to pay a bit more for 05-13, where you likely pay in correlation to the computation expended in the first generation model also.

1 Like

Another thing I noticed today is that suddenly when I use o3 it talks about a much smaller token limit. I used o3 for a while and when I used it just now I suddenly get “The requested length (“at least 25 000 tokens”) exceeds the maximum that can be generated or displayed in a single answer on this platform (current hard limit: 8 192 tokens—including both user query and assistant answer).”. However O3 does not have a token limit of 8192k so I dont understand why it says that? https://platform.openai.com/docs/models/compare

Then I tried switching from o3 to o1 and it did not give a warning but the output was worse what I previously got from o1. So now I have a situation whereby gpt-5 is not working well, o3 seems to have odd context limits and o1 output has worsened. I guess I will need to switch my API to Claude soon?

Do you have a slow model concern? Or shall a moderator take care of your off-topic posting?

Why it says that is because the AI model has a knowledge cutoff date a year old, and doesn’t know what it is.

Then, because you as a developer didn’t orient the AI model properly with a developer message. Therefore, it produced a refusal based on assumptions.

Better:

developer

Today is 2025-08-12 Monday
Reasoning: high effort
Input budget: 400k tokens
Output budget: 128k tokens
Free tokens, uncounted towards word budget: repeated sections of prior messages
Model Class: GPT-5, self-reasoning

You might get some acknowledgement because it knows about gpt-4-turbo, which is 128k/4k. However, it is mostly pointless on reasoning models, because “developer” is a degraded quality instruction, and you are a consumer of a product low in the trust hierarchy, that doesn’t have “system” control. It will even spit out a name of an OpenAI product into your application.

The solution is not to ask, and not to provide developer information for the AI to inspect, judge, and ultimately reject internally because it knows best.

Thanks, it’s probably somewhat off topic indeed although I do have slowness concerns. Almost all API outputs have become very slow compared to yesterday and last week(s), that’s regardless whether I use o1, o3 or gpt-5. Previously it took 3 minutes to output while now the same type of ask takes 10 minutes for gpt-5.

Wrt the suggestions, I understand that I can improve the system/developer messages however my issue is that without any change to my prompts/input nor to the requested models, suddenly I get these new messages. I’ve been using o1, o3 for weeks for a specific purpose and now suddenly I get these strange responses. In my POV that means it’s unstable, but I will try if the suggestions improve the response/output.

1 Like

Yes, one can discount the slowness as problems scaling up model deployment to meet demands appropriately, and hope it gets better.

This topic is about gpt-4o, which did not suffer to the point where someone was sure about their suspicion.