then one more fast, then one more with lag. But enough persistence and I get fine-tuning responses exclusively produced. And the extreme latency continues on Responses, much higher than Chat Completions. 10 seconds to see the first of 10 tokens.
That’s on an existing gpt-4o, without vision or its tuning.
It could be that seemingly poorer performance that I get on Responses that could lead to a model run without weights.
Do you have a strongly unique system message? Then reduce top_p so output always begins your way?
It sounds like you have uncovered a platform issue with Responses. You describe that the model name fulfills the response, but apparently with none of the effect that your fine-tuning AI model is supposed to bring about.
I would report this issue, and revert back to Chat Completions (there really is not much offered by Responses - they take away features, and don’t replace them by something better).
Reporting would be by “help” in the platform site, and then to report the bug clearly and unambiguously as being OpenAI’s problem and not yours, with backing evidence such as response IDs, after selecting through the options.
Hey! I reached out to OpenAI and they confirmed it’s a platform limitation — Responses API doesn’t currently support fine-tuned models, even if you pass the correct model ID. It defaults to the base behavior.
They recommended sticking with Chat Completions API for now, as that’s the only supported way to use fine-tuned models like GPT-4o.
I used a fine-tuned model in Responses via the Playground in an attempt to reproduce this yesterday. It worked fine. Unless they specified that this was for vision, you were talking to a hallucinating AI.
Hey, thanks for testing it out! Quick question, did you fine-tune your model using vision input, and was the input an image when you tried it in Responses?
In my case, the setup involves image + text inputs, and the model’s behavior is expected to change based on both. The fine-tuned version works perfectly via Chat Completions, but in Responses, it behaves like the base model and ignores the fine-tuning.
I especially don’t believe the fabricating support bot that gives the “i can take no action, go away” answer. I am able to use Responses over and over with a fine-tuning gpt-4o model and others, reporting on exactly the behavior earlier.
So the person that sent that reply, the person that approved anyone or anything that could produce such an answer, needs to be ferreted out and eliminated from their role.
Because a broody fine-tuning goth bot can do the same speculation better:
Appreciate the strong take, but just to clarify, my case specifically involves a fine-tuned GPT-4o model with vision (image) input. From all my tests, text-only fine-tuned models work fine with Responses, just as you described.
One thing that your fine tuning might not account for - OpenAI breaking AI inference and triggering patterns with their own system messages that comes before your own.
Reproduction on Responses by gpt-4o-2024-08-06:
A fine-tuning model reproducing what comes in the system message before yours:
Or what you get on a base model that doesn’t have OpenAI scanning and potentially blocking fine-tuning images responses or other undesirable output from them:
Knowledge cutoff: 2023-10
Image input capabilities: Enabled
Image safety policies:
Not Allowed: Giving away or revealing the identity or name of real people in images, even if they are famous - you should NOT identify real people (just say you don't know). Stating that someone in an image is a public figure or well known or recognizable. Saying what someone in a photo is known for or what work they've done. Classifying human-like images as animals. Making inappropriate statements about people in images. Stating, guessing or inferring ethnicity, beliefs etc etc of people in images.
Allowed: OCR transcription of sensitive PII (e.g. IDs, credit cards etc) is ALLOWED. Identifying animated characters.
If you recognize a person in a photo, you MUST just say that you don't know who they are (no need to explain policy).
Your image capabilities:
You cannot recognize people. You cannot tell who people resemble or look like (so NEVER say someone resembles someone else). You cannot see facial structures. You ignore names in image descriptions because you can't tell.
Adhere to this in all languages.
Thus, you could try your fine-tuning with that additional system message at the start about the knowledge cutoff and image input capabilities, and see: if matching what is actually being run against the model improves the inference and adherence to the examples.
The API is now erroring out on gpt-4.1 (full) fine tunes with images, but working on Chat Completions. I have another topic addressing that to follow up in. I do not have an older model that would decisively show that vision tuning on gpt-4o is “on” but one is coming:
I can definitely second this. My fine-tuned models that are trained on a very specific task and a very specific format are indistinguishable from base models whenever you change something. They learn to only exhibit the new behavior for your specific prompt only.
It may be worth adding variant examples to the dataset so the fine-tunes still perform as expected when the instructions change.