GPT-4o-mini randomly much slower than GPT-3.5-turbo

AI-Roguelite · August 21, 2024, 2:08pm

Use GPT-4o-mini and GPT-3.5-turbo-0125 each to answer the same query. Sample 10 times.

GPT-3.5-turbo speed is consistent (about 5 seconds for 500 tokens).

GPT-4o speed is mostly slightly slower than GPT-3.5, but about 30% of the time, it somehow takes an insanely long time (19 seconds for 500 tokens).

gokulraya · September 4, 2024, 8:48pm

Thanks for flagging this. Do you have an example request_id that we can take a look to debug?

AI-Roguelite · September 5, 2024, 12:31am

Thanks for getting back to me. I tested again and the problem seems to have gone away. 4o mini is now about the same speed (maybe a bit faster) than gpt 3.5 turbo. Any idea whether that means it was a temporary issue that was permanently fixed, or whether I’m likely to encounter the issue again in the future during peak times? I recently switched to using Gemini Flash for most of my use cases which is significantly faster than both of these models by OpenAI, but if 4o mini can at least be consistent, then maybe it’s worth looking at even if it’s a little slower.

hello141 · November 20, 2024, 10:46am

This issue seems to have appeared again. Now gpt-4omini seems to take up to 10 seconds more to process a request. These are the request ids for the exact same queries with force_json:
4omini: ‘chatcmpl-AVcR6x6a1B7gkn88VAfMrkInQ1dq3’
3.5: ‘chatcmpl-AVcU6bm3IQjDw53hoCmuBHMVkquSa’

_j · November 20, 2024, 10:52am

If you are using a strict JSON schema as response_format, the backend has to build a grammar enforcement technique first, that can take several documented seconds. It should be cached upon subsequent runs.

gpt-3.5-turbo cannot use this parameter.

hello141 · November 20, 2024, 12:13pm

Thanks for your quick response. We don’t use a fixed response_format yet, but we use the older json mode by passing ‘response_format’ = {“type”: “json_object”}. That works for both models. Do you think switching to a response_format will be faster?

_j · November 20, 2024, 12:43pm

It is json_schema that would be slower. It has an even higher level of JSON enforcement than just model training on tokens, by limiting the structure itself that can be produced to exactly the provided schema. This is what would incur a penalty in the first generated output using a schema.

You can use “streaming”:true and receive chunks of server-sent event output. This can let you see more if it is the initial request setup or the production rate of tokens that is affected.

hello141 · November 20, 2024, 1:11pm

Thanks a lot! That makes sense. I will try the streaming mode to see what is happening. And would you have an idea why it is so much slower for 4o-mini than 3.5? For us it is up to 10 seconds slower for the exact same request.

_j · November 20, 2024, 1:25pm

The service “is what it is”. OpenAI has previously also penalized low payment tier users with lower output rates.

Perhaps the gpt-3.5-turbo models are underutilized, just waiting for only you?

Topic		Replies	Views
GPT-4o mini slow inference API gpt-4o , gpt-4o-mini	6	469	April 9, 2025
Gpt-4o-mini is really slow API gpt-4o-mini	6	2982	March 18, 2025
Gpt-4-0125-preview INCREDIBLY slower than 3.5 turbo API	12	9631	July 22, 2024
Gpt-4-0125-preview is slower than gpt-4-0613? Feedback gpt-4 , api	5	5590	January 30, 2024
Inconsistent Response Speed with GPT-4.0 Mini Completion API Bugs gpt-4	1	84	July 29, 2025

GPT-4o-mini randomly much slower than GPT-3.5-turbo

Related topics