I’m working on a new application that uses GPT-4o-mini to generate data from rather big text inputs, but I’ve been experiencing significant delays in response times.
Regardless of whether I’m requesting raw text or structured JSON outputs, everything seems sluggish.
Input token count ranges between 20k-30k, with average completion tokens around 700.
I get an output average of 25 to 30 tokens per second, which is frustratingly slow for real-world application usage, making it impossible for me to use this model because it takes almost 30 seconds to get a full response.
I previously used this model in another application and don’t recall it being this slow. At the very least, it didn’t seem this sluggish.
Am I missing something, or has there been a recent change affecting performance?