Inconsistent Fine-Tune Behavior: Chat vs. Responses API (GPT-4o)

Hey folks :waving_hand:

I’m working on a product that uses vision and recently fine-tuned GPT-4o for it. I’m running into a strange issue:

  • When I use the Chat Completions API, the fine-tuned model gives the expected (fine-tuned) results.
  • But when I use the Responses API, it just behaves like the regular GPT-4o - not using the fine-tuned behavior.

Has anyone else seen this? Any idea why the Responses API might not be picking up the fine-tuned model?

Appreciate any insights!

1 Like

Thought I’d see what I can test for you.

  • Immediately after the “warmup” time latency of about 10 seconds on Chat Completions. I get a big timeout “nothingburger” from Responses:

  • then one more fast, then one more with lag. But enough persistence and I get fine-tuning responses exclusively produced. And the extreme latency continues on Responses, much higher than Chat Completions. 10 seconds to see the first of 10 tokens.

That’s on an existing gpt-4o, without vision or its tuning.

It could be that seemingly poorer performance that I get on Responses that could lead to a model run without weights.


Do you have a strongly unique system message? Then reduce top_p so output always begins your way?

1 Like

i have tried many combinations with temperature and top_p, still giving the biased response

It sounds like you have uncovered a platform issue with Responses. You describe that the model name fulfills the response, but apparently with none of the effect that your fine-tuning AI model is supposed to bring about.

I would report this issue, and revert back to Chat Completions (there really is not much offered by Responses - they take away features, and don’t replace them by something better).

Reporting would be by “help” in the platform site, and then to report the bug clearly and unambiguously as being OpenAI’s problem and not yours, with backing evidence such as response IDs, after selecting through the options.

2 Likes

Hey! I reached out to OpenAI and they confirmed it’s a platform limitation — Responses API doesn’t currently support fine-tuned models, even if you pass the correct model ID. It defaults to the base behavior.

They recommended sticking with Chat Completions API for now, as that’s the only supported way to use fine-tuned models like GPT-4o.

Appreciate your help — you were spot on!

2 Likes

I used a fine-tuned model in Responses via the Playground in an attempt to reproduce this yesterday. It worked fine. Unless they specified that this was for vision, you were talking to a hallucinating AI.

2 Likes

Hey, thanks for testing it out! Quick question, did you fine-tune your model using vision input, and was the input an image when you tried it in Responses?

In my case, the setup involves image + text inputs, and the model’s behavior is expected to change based on both. The fine-tuned version works perfectly via Chat Completions, but in Responses, it behaves like the base model and ignores the fine-tuning.

1 Like

fine-tuned text-only models seem to work with the Responses API, but fine-tuned vision models don’t, they fall back to base behavior.

1 Like

I especially don’t believe the fabricating support bot that gives the “i can take no action, go away” answer. I am able to use Responses over and over with a fine-tuning gpt-4o model and others, reporting on exactly the behavior earlier.


There is no failing of a gpt-4o-2025-08-06

Or a gpt-4.1-nano from a few days ago:

So the person that sent that reply, the person that approved anyone or anything that could produce such an answer, needs to be ferreted out and eliminated from their role.

Because a broody fine-tuning goth bot can do the same speculation better:

Appreciate the strong take, but just to clarify, my case specifically involves a fine-tuned GPT-4o model with vision (image) input. From all my tests, text-only fine-tuned models work fine with Responses, just as you described.

Curious if you’ve tested that use case too?

1 Like

One thing that your fine tuning might not account for - OpenAI breaking AI inference and triggering patterns with their own system messages that comes before your own.

Reproduction on Responses by gpt-4o-2024-08-06:

A fine-tuning model reproducing what comes in the system message before yours:

Or what you get on a base model that doesn’t have OpenAI scanning and potentially blocking fine-tuning images responses or other undesirable output from them:

Knowledge cutoff: 2023-10

Image input capabilities: Enabled


Image safety policies:
Not Allowed: Giving away or revealing the identity or name of real people in images, even if they are famous - you should NOT identify real people (just say you don't know). Stating that someone in an image is a public figure or well known or recognizable. Saying what someone in a photo is known for or what work they've done. Classifying human-like images as animals. Making inappropriate statements about people in images. Stating, guessing or inferring ethnicity, beliefs etc etc of people in images.
Allowed: OCR transcription of sensitive PII (e.g. IDs, credit cards etc) is ALLOWED. Identifying animated characters.

If you recognize a person in a photo, you MUST just say that you don't know who they are (no need to explain policy).

Your image capabilities:
You cannot recognize people. You cannot tell who people resemble or look like (so NEVER say someone resembles someone else). You cannot see facial structures. You ignore names in image descriptions because you can't tell.

Adhere to this in all languages.

Thus, you could try your fine-tuning with that additional system message at the start about the knowledge cutoff and image input capabilities, and see: if matching what is actually being run against the model improves the inference and adherence to the examples.


The API is now erroring out on gpt-4.1 (full) fine tunes with images, but working on Chat Completions. I have another topic addressing that to follow up in. I do not have an older model that would decisively show that vision tuning on gpt-4o is “on” but one is coming:

1 Like

I can definitely second this. My fine-tuned models that are trained on a very specific task and a very specific format are indistinguishable from base models whenever you change something. They learn to only exhibit the new behavior for your specific prompt only.

It may be worth adding variant examples to the dataset so the fine-tunes still perform as expected when the instructions change.

1 Like

I can basically confirm the symptom on a brand new model.

Fine-tuning a very overfitted model on vision. Inference using its pattern but a held-out image gives JSON

Switching only to Responses. We get the looking but no model behavior.

Basically: including an image disables the model’s weights but still allows inference sampling.

Older gpt-4o that is brief:

Add an image == No personality

Conclusion: Responses + Images = Broken fine-tuning weights.

1 Like