Openai image analysis via the api is very slow

We’re using a fine tuned model and we’re doing image analysis using openai chat.completions api.

The time to get a response from the api is very slow, 12 seconds, if we send a second request with the same data, it’s 6s because it’s cached.

We tried using GPT 5, it’s worse, 13s, so the fine-tuned model isn’t the only slow part.

is there a way to do image analysis in under 6s? does switching to the responses api help?

1 Like

One thing you can try is using priority processing.

Be aware though, it is more expensive and only available for a few models. You can find the detailed prices and models here.

You can try, but there is no “priority” tier for inference of fine-tuning AI models (and in the pricing page, no header above the fine-tuning section to select a discount other than for batch). It is a “safe” parameter to attempt, as it will fall back if the tier upgrade cannot be delivered.

The Responses API response object will tell you (without you needing to go to the “usage” page to see no “model | priority” delivered) it was of no use, such as this just returned for me against a fine-tuning model call with image input and "service_tier": "priority" hard-coded in:

'service_tier': 'default'


Fine-tuning AI models also take a while to ‘cold start’. The second request is likely not from any caching, but because the dormant model is made ready for use. You can get the same latency behavior with “hello” to a disused fine-tuning AI model.

Do you really need chat.completions for image analysis? Unless you are developing an actual chat with multiple message exchanges between the user and a chatbot, you are better off using the POST responses endpoint, which is faster and consumes way less tokens even with image processing.

This is a statement that is untrue.

The exact same tokenized data can be sent and received by either Responses or Chat Completions. Vision is no different, except by malfunction.

Responses is likely to be slower, especially in the case of not needing a chat state persisted server-side: a larger network payload returned, and essentially an edge service that runs API calls against the models for you with its own internal tools available.

1 Like

try to use lightweight, and fast models like named mini or nano.

This topic is about fine-tuning and vision.

You would find significant challenges in attempts at using “mini” or “nano” models with fine-tuning on images.

You can see that “try to use” has already been attempted for you, and is not a valid recommendation.

1 Like

From personal experience, we had an image analysis agent that was using chat.completions and each request consumed about 30,000 tokens.

We swapped that for the responses endpoint with store: false and same prompts, and tokens dropped to about 1,500 per request and the response time was noticeably faster.

I believe the reason chat.completions was being used in the first place was simply due to incorrect guidance from tutorials on the internet that one of our team members followed, since there was no need to persist anything between requests (every request had a one-shot purpose and that chat would never be revisited again). Not saying that’s OP’s case, but could help someone coming across this thread.

@caiosm1005 I’m using gpt-4.1 fine tuned model, it doesn’t support responses api, I tried. I’m not even sure if any fine tuned model can be used with reponses API. Maybe gpt-4o?

Anyway GPT-5 works fine with responses api but im not confident going to production with a model that’s not fine tuned, and i’m not confident using old models like gpt-4o. And also responses API with gpt-5 and store=false and structured output is slower than a fine tuned model

A fine tuned model first request is 24s, second request is 6s, I was not able to get gpt-5 below 24 seconds. The responses were great though but I just can’t go to production with a model that’s not fine tuned.

@aprendendo.next I tried, not sure if it makes any difference, I mean the first request is slow and the second request is fast regardless, no speed difference but I added that flag anyway.

@_j I just ended up using the chat completion api as I already have but thanks to you, I now have a scheduled task that runs every 5 minutes to keep the fine tuned model warm, I think it dies after 5 minutes. so you told me, it has a cold start and now I addressed that. Thank you.

3 Likes