GPT-5.4-nano - Priority Service Tier - Inconsistent Latency

MattJodie · March 20, 2026, 9:01am

I am using GPT-5.4-nano - i really need consistently low latency - which I had before with gemini 2.0 flash which is now deprecated.

I am using priority service, but my chart looks like this from testing…

Using it as a backbone to voice having every 20th response taking over 1s isn’t good enough and my understand was that priority service was also going to give consistent latency even at peak periods.

Is this solvable? Any thoughts in advance much appreciated.

vb · March 20, 2026, 11:32am

Hi and welcome to the Community!

From what I can observe, priority tier requests to these models (5.4-mini and -nano) are currently being processed as default service tier. I’ve pinged the team to confirm whether this is expected beyond the documentation note that priority processing is not guaranteed. To me this looks like a bug, but I could be wrong.

I also noticed that chat.completions appears to be a bit faster, in case switching is easy as a temporary workaround.

MattJodie · March 20, 2026, 11:35am

Thats great - thanks for letting me. Hope it is a bug and that it will get resolved.
If they do not get processed as priority anyway in logs to confirm this on my end / i assume they will hopefully in future be resolved and do this?

I’m trying to use this directly with 11labs so unfortunately I can’t see the logs but I could get the completions endpoint to work like I could with responses endpoint. Is the completions endpoint any different than say switching out the completions endpoint with openrouter (which does work..) - ie model name stay the same?

vb · March 20, 2026, 11:42am

I don’t have an openrouter setup for quick testing so I can’t confirm this.

Make sure to set the effort to none in this case of switching to chat.completions for lowest latency.

The team will most likely update this topic themselves in case it’s a bug.

Fingers crossed for a fast resolution!

MattJodie · March 20, 2026, 1:13pm

Appreciate this - just to be clear. This fix is relating to the priority having consistent results and not to the fact that completions faster than responses? Or are you seeing completion are more consistent too?

vb · March 20, 2026, 1:16pm

I observed in a sample of n=100 that time to first token with responses is higher than with chat.completions, particularly for the requests that take unexpectedly long.

Additionally, I checked gpt-5-nano which appears to not work with priority either… But I am still waiting for official confirmation because I think it makes sense to have full control over latency, especially with the small models.

_j · March 20, 2026, 4:15pm

Whether or not you can expect a service_tier of “priority” to be delivered can be determined by the pricing actually showing the doubled price you must pay after picking Priority - a table within which the gpt-5.4-nano model does not appear:

What I suspect, given that the token generation rate doesn’t follow the 12x cost reduction, is that “nano” is seen as a model capable of running on legacy hardware and thus, is routed to a pool of lower performance more often than not.

Another aspect barely in your control is cache hits: on a model that can generate fast, this lookup could have a not insignificant time as a ratio of its benefit, especially with the 24 hour retention being on the table. You could “distribute” calls by using a unique prompt cache key API parameter per-call and see if your P99 quality is improved.

I haven’t broadly classified how much “reasoning” is done vs gpt-5-nano (which was excessive for still lower performance), or how much to maintain quality, but “low” doesn’t seem to deliberate as much. The amount of reasoning per-question is going to be quite variable in length and happenstance with a model where the low number of parameters make every token less certain. The first step is to see if “low” or “none” reasoning effort can fulfill the task, as that is your latency-by-design in these thinking models that at “none” are still considering internally if they shall refuse the answer. Answer: can the first-output-token time be directly correlated to the reasoning token generation count in “usage”.

Chat Completions: AI without a middleman layer.

vb · March 20, 2026, 4:25pm

Thanks for pointing out that the pricing table explains it.

That said, this feels a bit indirect. The request appears to be silently downgraded to default processing, while the pricing table only shows what is possible instead of clearly stating that these models currently cannot be used with this service tier.

_j · March 20, 2026, 4:34pm

“service_tier”: “default” is delivered in a return. It is also possible for a supporting model to be downgraded to default.

If requesting “flex”, you will get an error on a non-supported model, but it seems the decision was made to keep a failing or unsupported “priority” silent except for the response object report of the service tier delivery.

MattJodie · March 20, 2026, 4:35pm

This is a huge bummer.

btw - one of the issues I was having with using completions is 5.4-nano allowed reasoning: none on Responses API, but on completions only allows default.

I’m not seeing a noticeable difference at the moment.

btw I am not setting this at request level (I can’t) I am doing it at project level.

Does this mean I won’t ever be able to get consistently low latency with priority+5.4-nano then?

_j · March 20, 2026, 4:40pm

I just tested “none” and “low” with success on Chat Completions:

  "usage": {
    "prompt_tokens": 936,
    "completion_tokens": 714,
    "total_tokens": 1650,
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": 0
    },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  }

That the call giving this “usage” was CC is apparent in the “audio_tokens” field being present, which is not there on Responses.

It was pretty fast.

Lack of “priority” likely means you aren’t realistically going to jump the line to get B300 processing ahead of full-scale GPT-5.4.

If you want the first tokens to come to you instead of them being reasoning for the AI to read itself → gpt-4.1 family.

vb · March 20, 2026, 5:16pm

Actually, I didn’t get an error earlier today when trying to use flex with these models. It didn’t work but I guess the point is that we are looking at moving parts.

MattJodie · March 20, 2026, 5:17pm

So all in all, is it not possible for me to get consistently ~500ms with any these models that are mega $$$?

vb · March 20, 2026, 5:18pm

I was about to make the same suggestion @_j did: try the 4.1 models. They are still holding up well for many use cases.

Or realtime(-mini) may be the other option to try.

_j · March 20, 2026, 5:32pm

It is not possible to consistently get 500ms ping times across a diverse network, including DDoS protection layers, and AI on output looking for recitation/copyright reproduction.

Millisecond tokens = your own hardware

+ NVIDIA RTX PRO 2000 Blackwell for that quant.

MattJodie · March 21, 2026, 6:43am

I mean we say that but my p75 is 520ms, so seemingly possible - I guess the blocker here is that priority isn’t offered on the models where it makes most sense to offer it. Nano/mini

Topic		Replies	Views
Gpt-4-0125-preview INCREDIBLY slower than 3.5 turbo API	13	9803	December 29, 2025
GPT-4.1 models are very slow due to API response. API	6	915	December 29, 2025
GPT-5 + Responses API is extremely slow API gpt-5 , responses , responses-api	33	18371	October 27, 2025
GPT-4o-2024–08–06 slower then previous version API gpt-4o	10	1359	December 29, 2025
GPT 4 API is Very Slow Still API gpt-4 , chatgpt , api	15	7123	December 16, 2023

GPT-5.4-nano - Priority Service Tier - Inconsistent Latency

Related topics