Hi everyone,
I’ve been testing gpt-4.1-mini in streaming mode and I’m noticing some latency inconsistencies that are starting to affect the user experience.
Here are some examples from my logs (time = total response time, LEN = characters in response):
-
1.21 s (84 chars)
-
1.33 s (91 chars)
-
1.36 s (106 chars)
-
1.20 s (110 chars)
-
1.32 s (50 chars)
-
1.39 s (128 chars)
-
1.42 s (75 chars)
-
1.40 s (62 chars)
-
1.61 s (72 chars)
As you can see, most responses tend to hover around 1.3–1.5 seconds, which already feels a bit higher than expected for a mini model. But at certain hours, latency spikes dramatically (sometimes over 15–40 seconds), which ruins the real-time experience I’m trying to build.
I’m not sure if this is due to server load, HTTP/2 session reuse, or something else, but the inconsistency is very noticeable.
Has anyone else experienced the same issue with gpt-4.1-mini? Is this expected behavior, or should I consider this an anomaly and open a support ticket?
Thanks in advance!