The only thing I note is that after -08-06 coming in last for almost all the requests, there were two 11-20 stragglers that took still a bit more.
Replace that reliably-slowest with another and benchmark the call blast again:
Since the input was now cached, I put randomization in both the “system” and in “prompt_cache_key” to ensure load distribution and cache-breaking beyond a single routing (before just used “user” alone)
Model
Trials
Avg Latency (s)
Avg Stream Rate (tok/s)
Avg Total Rate (tok/s)
gpt-4o-2024-05-13
10
0.722
110.215
94.781
gpt-4.1
10
0.732
64.934
59.270
gpt-4o-2024-11-20
10
0.721
82.984
73.920
Model
Trials
Avg Latency (s)
Avg Stream Rate (tok/s)
Avg Total Rate (tok/s)
gpt-4o-2024-05-13
10
0.776
104.944
90.425
gpt-4.1
10
0.767
65.305
59.017
gpt-4o-2024-11-20
10
0.633
81.658
73.653
It looks like your fastest currently is to pay a bit more for 05-13, where you likely pay in correlation to the computation expended in the first generation model also.
Another thing I noticed today is that suddenly when I use o3 it talks about a much smaller token limit. I used o3 for a while and when I used it just now I suddenly get “The requested length (“at least 25 000 tokens”) exceeds the maximum that can be generated or displayed in a single answer on this platform (current hard limit: 8 192 tokens—including both user query and assistant answer).”. However O3 does not have a token limit of 8192k so I dont understand why it says that? https://platform.openai.com/docs/models/compare
Then I tried switching from o3 to o1 and it did not give a warning but the output was worse what I previously got from o1. So now I have a situation whereby gpt-5 is not working well, o3 seems to have odd context limits and o1 output has worsened. I guess I will need to switch my API to Claude soon?
Today is 2025-08-12 Monday
Reasoning: high effort
Input budget: 400k tokens
Output budget: 128k tokens
Free tokens, uncounted towards word budget: repeated sections of prior messages
Model Class: GPT-5, self-reasoning
You might get some acknowledgement because it knows about gpt-4-turbo, which is 128k/4k. However, it is mostly pointless on reasoning models, because “developer” is a degraded quality instruction, and you are a consumer of a product low in the trust hierarchy, that doesn’t have “system” control. It will even spit out a name of an OpenAI product into your application.
The solution is not to ask, and not to provide developer information for the AI to inspect, judge, and ultimately reject internally because it knows best.
Thanks, it’s probably somewhat off topic indeed although I do have slowness concerns. Almost all API outputs have become very slow compared to yesterday and last week(s), that’s regardless whether I use o1, o3 or gpt-5. Previously it took 3 minutes to output while now the same type of ask takes 10 minutes for gpt-5.
Wrt the suggestions, I understand that I can improve the system/developer messages however my issue is that without any change to my prompts/input nor to the requested models, suddenly I get these new messages. I’ve been using o1, o3 for weeks for a specific purpose and now suddenly I get these strange responses. In my POV that means it’s unstable, but I will try if the suggestions improve the response/output.