GPT-4o-mini randomly much slower than GPT-3.5-turbo

Use GPT-4o-mini and GPT-3.5-turbo-0125 each to answer the same query. Sample 10 times.

GPT-3.5-turbo speed is consistent (about 5 seconds for 500 tokens).

GPT-4o speed is mostly slightly slower than GPT-3.5, but about 30% of the time, it somehow takes an insanely long time (19 seconds for 500 tokens).

Thanks for flagging this. Do you have an example request_id that we can take a look to debug?

Thanks for getting back to me. I tested again and the problem seems to have gone away. 4o mini is now about the same speed (maybe a bit faster) than gpt 3.5 turbo. Any idea whether that means it was a temporary issue that was permanently fixed, or whether I’m likely to encounter the issue again in the future during peak times? I recently switched to using Gemini Flash for most of my use cases which is significantly faster than both of these models by OpenAI, but if 4o mini can at least be consistent, then maybe it’s worth looking at even if it’s a little slower.

This issue seems to have appeared again. Now gpt-4omini seems to take up to 10 seconds more to process a request. These are the request ids for the exact same queries with force_json:
4omini: ‘chatcmpl-AVcR6x6a1B7gkn88VAfMrkInQ1dq3’
3.5: ‘chatcmpl-AVcU6bm3IQjDw53hoCmuBHMVkquSa’

If you are using a strict JSON schema as response_format, the backend has to build a grammar enforcement technique first, that can take several documented seconds. It should be cached upon subsequent runs.

gpt-3.5-turbo cannot use this parameter.

Thanks for your quick response. We don’t use a fixed response_format yet, but we use the older json mode by passing ‘response_format’ = {“type”: “json_object”}. That works for both models. Do you think switching to a response_format will be faster?

It is json_schema that would be slower. It has an even higher level of JSON enforcement than just model training on tokens, by limiting the structure itself that can be produced to exactly the provided schema. This is what would incur a penalty in the first generated output using a schema.

You can use “streaming”:true and receive chunks of server-sent event output. This can let you see more if it is the initial request setup or the production rate of tokens that is affected.

1 Like

Thanks a lot! That makes sense. I will try the streaming mode to see what is happening. And would you have an idea why it is so much slower for 4o-mini than 3.5? For us it is up to 10 seconds slower for the exact same request.

The service “is what it is”. OpenAI has previously also penalized low payment tier users with lower output rates.

Perhaps the gpt-3.5-turbo models are underutilized, just waiting for only you?