Using it as a backbone to voice having every 20th response taking over 1s isn’t good enough and my understand was that priority service was also going to give consistent latency even at peak periods.
Is this solvable? Any thoughts in advance much appreciated.
From what I can observe, priority tier requests to these models (5.4-mini and -nano) are currently being processed as default service tier. I’ve pinged the team to confirm whether this is expected beyond the documentation note that priority processing is not guaranteed. To me this looks like a bug, but I could be wrong.
I also noticed that chat.completions appears to be a bit faster, in case switching is easy as a temporary workaround.
Thats great - thanks for letting me. Hope it is a bug and that it will get resolved.
If they do not get processed as priority anyway in logs to confirm this on my end / i assume they will hopefully in future be resolved and do this?
I’m trying to use this directly with 11labs so unfortunately I can’t see the logs but I could get the completions endpoint to work like I could with responses endpoint. Is the completions endpoint any different than say switching out the completions endpoint with openrouter (which does work..) - ie model name stay the same?
Appreciate this - just to be clear. This fix is relating to the priority having consistent results and not to the fact that completions faster than responses? Or are you seeing completion are more consistent too?
I observed in a sample of n=100 that time to first token with responses is higher than with chat.completions, particularly for the requests that take unexpectedly long.
Additionally, I checked gpt-5-nano which appears to not work with priority either… But I am still waiting for official confirmation because I think it makes sense to have full control over latency, especially with the small models.
Whether or not you can expect a service_tier of “priority” to be delivered can be determined by the pricing actually showing the doubled price you must pay after picking Priority - a table within which the gpt-5.4-nano model does not appear:
What I suspect, given that the token generation rate doesn’t follow the 12x cost reduction, is that “nano” is seen as a model capable of running on legacy hardware and thus, is routed to a pool of lower performance more often than not.
Another aspect barely in your control is cache hits: on a model that can generate fast, this lookup could have a not insignificant time as a ratio of its benefit, especially with the 24 hour retention being on the table. You could “distribute” calls by using a unique prompt cache key API parameter per-call and see if your P99 quality is improved.
I haven’t broadly classified how much “reasoning” is done vs gpt-5-nano (which was excessive for still lower performance), or how much to maintain quality, but “low” doesn’t seem to deliberate as much. The amount of reasoning per-question is going to be quite variable in length and happenstance with a model where the low number of parameters make every token less certain. The first step is to see if “low” or “none” reasoning effort can fulfill the task, as that is your latency-by-design in these thinking models that at “none” are still considering internally if they shall refuse the answer. Answer: can the first-output-token time be directly correlated to the reasoning token generation count in “usage”.
Thanks for pointing out that the pricing table explains it.
That said, this feels a bit indirect. The request appears to be silently downgraded to default processing, while the pricing table only shows what is possible instead of clearly stating that these models currently cannot be used with this service tier.
“service_tier”: “default” is delivered in a return. It is also possible for a supporting model to be downgraded to default.
If requesting “flex”, you will get an error on a non-supported model, but it seems the decision was made to keep a failing or unsupported “priority” silent except for the response object report of the service tier delivery.
btw - one of the issues I was having with using completions is 5.4-nano allowed reasoning: none on Responses API, but on completions only allows default.
I’m not seeing a noticeable difference at the moment.
btw I am not setting this at request level (I can’t) I am doing it at project level.
Does this mean I won’t ever be able to get consistently low latency with priority+5.4-nano then?
Actually, I didn’t get an error earlier today when trying to use flex with these models. It didn’t work but I guess the point is that we are looking at moving parts.
It is not possible to consistently get 500ms ping times across a diverse network, including DDoS protection layers, and AI on output looking for recitation/copyright reproduction.
I mean we say that but my p75 is 520ms, so seemingly possible - I guess the blocker here is that priority isn’t offered on the models where it makes most sense to offer it. Nano/mini