GPT-4o mini slow inference

OpenAI_Support · April 9, 2025, 9:21pm

Great question—latency can be caused a few factors, aside from model size. For example,

Engine load balancing: In some cases, 4o mini requests may wait longer in queue depending on how the system routes traffic.
Caching behavior: Enabling caching can sometimes increase latency because it pins the request to a specific engine that may not be the fastest available at that moment.

So while 4o mini is indeed designed for high throughput, things like queue times and caching strategy can still impact latency.

That said, I’ve flagged this internally. Thank you for surfacing 🙏

Topic		Replies	Views
GPT-4o-mini randomly much slower than GPT-3.5-turbo Bugs gpt-4o-mini	8	871	November 20, 2024
Gpt-4o-mini is really slow API gpt-4o-mini	6	1897	March 18, 2025
GPT-4o-2024–08–06 slower then previous version API gpt-4o	9	955	January 7, 2025
Gpt-4-0125-preview INCREDIBLY slower than 3.5 turbo API	12	9554	July 22, 2024
Fine-tuned model with same seed and data is 7x slower now vs 6 months ago API fine-tuning-problems	2	190	January 9, 2025