GPT-4o mini slow inference

Hi everyone,

Does anyone have any idea why 4o mini is way slower than 4o although it is much smaller?

Do you think it is running on slower GPUs or they nerfed its performance for monetary reasons?

3 Likes

Yes, I see that this also being reported on the unofficial status page.

This is likely a temporary issue because being faster is one of the key features of offering smaller models.

Thanks for flagging.

2 Likes

Interesting link.

In addition to usage fluctuations, I think it might be related to the fact that 4o has received recent updates to reduce gpu usage, while 4o-mini has been lagging behind for a while.

1 Like

Totally agree with you. That is why you can find the May 2024, Nov 2024, and March 2025 4o models in API, and there are huge fluctuations in terms of Latency and Speed between them.

Great question—latency can be caused a few factors, aside from model size. For example,

  • Engine load balancing: In some cases, 4o mini requests may wait longer in queue depending on how the system routes traffic.
  • Caching behavior: Enabling caching can sometimes increase latency because it pins the request to a specific engine that may not be the fastest available at that moment.

So while 4o mini is indeed designed for high throughput, things like queue times and caching strategy can still impact latency.

That said, I’ve flagged this internally. Thank you for surfacing 🙏

5 Likes

Thank you so much for the insights.

I also noticed 4o-mini has way slower TPS compared to 4o. I have seen some threads online about it getting x4 times slower overnight 6 months ago.

However, the 4o-mini API from Microsoft Azure did not drop in performance. Through some heuristic methods, I noticed that this downgrade in performance can be explained by switching from H100 to A100 GPUs.

The H100 vs A100 theory makes a lot of sense and likely matches some of what we’ve been seeing too.

Thanks for sharing your testing!