You can try, but there is no “priority” tier for inference of fine-tuning AI models (and in the pricing page, no header above the fine-tuning section to select a discount other than for batch). It is a “safe” parameter to attempt, as it will fall back if the tier upgrade cannot be delivered.
The Responses API response object will tell you (without you needing to go to the “usage” page to see no “model | priority” delivered) it was of no use, such as this just returned for me against a fine-tuning model call with image input and "service_tier": "priority" hard-coded in:
'service_tier': 'default'
Fine-tuning AI models also take a while to ‘cold start’. The second request is likely not from any caching, but because the dormant model is made ready for use. You can get the same latency behavior with “hello” to a disused fine-tuning AI model.
Do you really need chat.completions for image analysis? Unless you are developing an actual chat with multiple message exchanges between the user and a chatbot, you are better off using the POST responses endpoint, which is faster and consumes way less tokens even with image processing.
The exact same tokenized data can be sent and received by either Responses or Chat Completions. Vision is no different, except by malfunction.
Responses is likely to be slower, especially in the case of not needing a chat state persisted server-side: a larger network payload returned, and essentially an edge service that runs API calls against the models for you with its own internal tools available.
From personal experience, we had an image analysis agent that was using chat.completions and each request consumed about 30,000 tokens.
We swapped that for the responses endpoint with store: false and same prompts, and tokens dropped to about 1,500 per request and the response time was noticeably faster.
I believe the reason chat.completions was being used in the first place was simply due to incorrect guidance from tutorials on the internet that one of our team members followed, since there was no need to persist anything between requests (every request had a one-shot purpose and that chat would never be revisited again). Not saying that’s OP’s case, but could help someone coming across this thread.
@caiosm1005 I’m using gpt-4.1 fine tuned model, it doesn’t support responses api, I tried. I’m not even sure if any fine tuned model can be used with reponses API. Maybe gpt-4o?
Anyway GPT-5 works fine with responses api but im not confident going to production with a model that’s not fine tuned, and i’m not confident using old models like gpt-4o. And also responses API with gpt-5 and store=false and structured output is slower than a fine tuned model
A fine tuned model first request is 24s, second request is 6s, I was not able to get gpt-5 below 24 seconds. The responses were great though but I just can’t go to production with a model that’s not fine tuned.
@aprendendo.next I tried, not sure if it makes any difference, I mean the first request is slow and the second request is fast regardless, no speed difference but I added that flag anyway.
@_j I just ended up using the chat completion api as I already have but thanks to you, I now have a scheduled task that runs every 5 minutes to keep the fine tuned model warm, I think it dies after 5 minutes. so you told me, it has a cold start and now I addressed that. Thank you.