GPT-4o mini slow inference

Great question—latency can be caused a few factors, aside from model size. For example,

  • Engine load balancing: In some cases, 4o mini requests may wait longer in queue depending on how the system routes traffic.
  • Caching behavior: Enabling caching can sometimes increase latency because it pins the request to a specific engine that may not be the fastest available at that moment.

So while 4o mini is indeed designed for high throughput, things like queue times and caching strategy can still impact latency.

That said, I’ve flagged this internally. Thank you for surfacing 🙏

4 Likes