Gpt-4o-mini is really slow

I’m working on a new application that uses GPT-4o-mini to generate data from rather big text inputs, but I’ve been experiencing significant delays in response times.

Regardless of whether I’m requesting raw text or structured JSON outputs, everything seems sluggish.

Input token count ranges between 20k-30k, with average completion tokens around 700.

I get an output average of 25 to 30 tokens per second, which is frustratingly slow for real-world application usage, making it impossible for me to use this model because it takes almost 30 seconds to get a full response.

I previously used this model in another application and don’t recall it being this slow. At the very least, it didn’t seem this sluggish.

Am I missing something, or has there been a recent change affecting performance?

I see 4o (not mini) yield 30-50 output tokens per second.

Since your context is pretty large, the 30 OTPS doesn’t surprise me.

You can try streaming out the response if the user is “seeing” it in realtime.

Maybe others can chime in on 4o-mini with 30k input tokens.

1 Like

Unfortunately streaming the responses to the users is not a solution to my problem.

Another hack is to fine-tune the model. Maybe even do a light fine-tune.

This generally yields lower latencies, at least historically, but it costs more to run.

That might be something to try.

I’ve always had slower responses with lots of input tokens.

Any way to trim it down?

Same amount of tokens for input?