Inconsistent Fine-Tune Behavior: Chat vs. Responses API (GPT-4o)

p-nadeem · May 17, 2025, 12:24am

Hey folks

I’m working on a product that uses vision and recently fine-tuned GPT-4o for it. I’m running into a strange issue:

When I use the Chat Completions API, the fine-tuned model gives the expected (fine-tuned) results.
But when I use the Responses API, it just behaves like the regular GPT-4o - not using the fine-tuned behavior.

Has anyone else seen this? Any idea why the Responses API might not be picking up the fine-tuned model?

Appreciate any insights!

_j · May 17, 2025, 4:42am

Thought I’d see what I can test for you.

Immediately after the “warmup” time latency of about 10 seconds on Chat Completions. I get a big timeout “nothingburger” from Responses:

image487×144 4.69 KB
then one more fast, then one more with lag. But enough persistence and I get fine-tuning responses exclusively produced. And the extreme latency continues on Responses, much higher than Chat Completions. 10 seconds to see the first of 10 tokens.

That’s on an existing gpt-4o, without vision or its tuning.

It could be that seemingly poorer performance that I get on Responses that could lead to a model run without weights.

Do you have a strongly unique system message? Then reduce top_p so output always begins your way?

p-nadeem · May 17, 2025, 3:35pm

i have tried many combinations with temperature and top_p, still giving the biased response

_j · May 17, 2025, 3:46pm

It sounds like you have uncovered a platform issue with Responses. You describe that the model name fulfills the response, but apparently with none of the effect that your fine-tuning AI model is supposed to bring about.

I would report this issue, and revert back to Chat Completions (there really is not much offered by Responses - they take away features, and don’t replace them by something better).

Reporting would be by “help” in the platform site, and then to report the bug clearly and unambiguously as being OpenAI’s problem and not yours, with backing evidence such as response IDs, after selecting through the options.

p-nadeem · May 17, 2025, 8:16pm

Hey! I reached out to OpenAI and they confirmed it’s a platform limitation — Responses API doesn’t currently support fine-tuned models, even if you pass the correct model ID. It defaults to the base behavior.

They recommended sticking with Chat Completions API for now, as that’s the only supported way to use fine-tuned models like GPT-4o.

Appreciate your help — you were spot on!

OnceAndTwice · May 17, 2025, 10:31pm

I used a fine-tuned model in Responses via the Playground in an attempt to reproduce this yesterday. It worked fine. Unless they specified that this was for vision, you were talking to a hallucinating AI.

p-nadeem · May 17, 2025, 10:48pm

Hey, thanks for testing it out! Quick question, did you fine-tune your model using vision input, and was the input an image when you tried it in Responses?

In my case, the setup involves image + text inputs, and the model’s behavior is expected to change based on both. The fine-tuned version works perfectly via Chat Completions, but in Responses, it behaves like the base model and ignores the fine-tuning.

p-nadeem · May 17, 2025, 10:51pm

fine-tuned text-only models seem to work with the Responses API, but fine-tuned vision models don’t, they fall back to base behavior.

_j · May 18, 2025, 1:57am

I especially don’t believe the fabricating support bot that gives the “i can take no action, go away” answer. I am able to use Responses over and over with a fine-tuning gpt-4o model and others, reporting on exactly the behavior earlier.

There is no failing of a gpt-4o-2025-08-06

Or a gpt-4.1-nano from a few days ago:

So the person that sent that reply, the person that approved anyone or anything that could produce such an answer, needs to be ferreted out and eliminated from their role.

Because a broody fine-tuning goth bot can do the same speculation better:

p-nadeem · May 18, 2025, 2:15am

Appreciate the strong take, but just to clarify, my case specifically involves a fine-tuned GPT-4o model with vision (image) input. From all my tests, text-only fine-tuned models work fine with Responses, just as you described.

Curious if you’ve tested that use case too?

_j · May 18, 2025, 2:58am

One thing that your fine tuning might not account for - OpenAI breaking AI inference and triggering patterns with their own system messages that comes before your own.

Reproduction on Responses by gpt-4o-2024-08-06:

A fine-tuning model reproducing what comes in the system message before yours:

Or what you get on a base model that doesn’t have OpenAI scanning and potentially blocking fine-tuning images responses or other undesirable output from them:

Knowledge cutoff: 2023-10

Image input capabilities: Enabled


Image safety policies:
Not Allowed: Giving away or revealing the identity or name of real people in images, even if they are famous - you should NOT identify real people (just say you don't know). Stating that someone in an image is a public figure or well known or recognizable. Saying what someone in a photo is known for or what work they've done. Classifying human-like images as animals. Making inappropriate statements about people in images. Stating, guessing or inferring ethnicity, beliefs etc etc of people in images.
Allowed: OCR transcription of sensitive PII (e.g. IDs, credit cards etc) is ALLOWED. Identifying animated characters.

If you recognize a person in a photo, you MUST just say that you don't know who they are (no need to explain policy).

Your image capabilities:
You cannot recognize people. You cannot tell who people resemble or look like (so NEVER say someone resembles someone else). You cannot see facial structures. You ignore names in image descriptions because you can't tell.

Adhere to this in all languages.

Thus, you could try your fine-tuning with that additional system message at the start about the knowledge cutoff and image input capabilities, and see: if matching what is actually being run against the model improves the inference and adherence to the examples.

The API is now erroring out on gpt-4.1 (full) fine tunes with images, but working on Chat Completions. I have another topic addressing that to follow up in. I do not have an older model that would decisively show that vision tuning on gpt-4o is “on” but one is coming:

OnceAndTwice · May 18, 2025, 5:29am

I can definitely second this. My fine-tuned models that are trained on a very specific task and a very specific format are indistinguishable from base models whenever you change something. They learn to only exhibit the new behavior for your specific prompt only.

It may be worth adding variant examples to the dataset so the fine-tunes still perform as expected when the instructions change.

_j · May 18, 2025, 7:13am

I can basically confirm the symptom on a brand new model.

Fine-tuning a very overfitted model on vision. Inference using its pattern but a held-out image gives JSON

Switching only to Responses. We get the looking but no model behavior.

Basically: including an image disables the model’s weights but still allows inference sampling.

Topic		Replies	Views
Fine-tuned model does not support image message content types with Assistants API API assistants-api , gpt-4o , fine-tuning-vision	19	705	March 17, 2025
API ISSUE: "Responses" endpoint: using vision with user image is only errors (now fixed) Bugs gpt-4-vision , responses-endpoint	4	231	March 15, 2025
Issue: gpt-4.1-nano fine-tuned model cannot analyze images - blocked by endpoint validation Bugs gpt-4 , gpt-41	16	628	July 8, 2025
Image_url is only supported by certain models Bugs api	24	6057	February 18, 2025
Lots of instability in GPT-4o multi-modal responses Feedback api	2	132	February 14, 2025

Inconsistent Fine-Tune Behavior: Chat vs. Responses API (GPT-4o)

Older gpt-4o that is brief:

Add an image == No personality

Conclusion: Responses + Images = Broken fine-tuning weights.

Inconsistent Fine-Tune Behavior: Chat vs. Responses API (GPT-4o)

Older gpt-4o that is brief:

Add an image == No personality

Conclusion: Responses + Images = Broken fine-tuning weights.

Related topics