Thanks for getting back to me. I tested again and the problem seems to have gone away. 4o mini is now about the same speed (maybe a bit faster) than gpt 3.5 turbo. Any idea whether that means it was a temporary issue that was permanently fixed, or whether I’m likely to encounter the issue again in the future during peak times? I recently switched to using Gemini Flash for most of my use cases which is significantly faster than both of these models by OpenAI, but if 4o mini can at least be consistent, then maybe it’s worth looking at even if it’s a little slower.
This issue seems to have appeared again. Now gpt-4omini seems to take up to 10 seconds more to process a request. These are the request ids for the exact same queries with force_json:
4omini: ‘chatcmpl-AVcR6x6a1B7gkn88VAfMrkInQ1dq3’
3.5: ‘chatcmpl-AVcU6bm3IQjDw53hoCmuBHMVkquSa’
If you are using a strict JSON schema as response_format, the backend has to build a grammar enforcement technique first, that can take several documented seconds. It should be cached upon subsequent runs.
Thanks for your quick response. We don’t use a fixed response_format yet, but we use the older json mode by passing ‘response_format’ = {“type”: “json_object”}. That works for both models. Do you think switching to a response_format will be faster?
It is json_schema that would be slower. It has an even higher level of JSON enforcement than just model training on tokens, by limiting the structure itself that can be produced to exactly the provided schema. This is what would incur a penalty in the first generated output using a schema.
You can use “streaming”:true and receive chunks of server-sent event output. This can let you see more if it is the initial request setup or the production rate of tokens that is affected.
Thanks a lot! That makes sense. I will try the streaming mode to see what is happening. And would you have an idea why it is so much slower for 4o-mini than 3.5? For us it is up to 10 seconds slower for the exact same request.