I fine-tuned a GPT-4.1 model using the function calling feature. All of my training examples followed a strict format. Here’s a simplified example:
{
"messages": [
{ "role": "user", "content": "Book a hotel in Paris" },
{
"role": "assistant",
"function_call": {
"name": "extract_booking_request",
"arguments": "{\"location\": \"Paris\", \"action\": \"book\", \"reject_message\": \"no\"}"
}
}
]
}
In training:
I never included fields like "action": "search" or "location": "London".
I used only valid, specific values for all fields.
I did not include a system prompt or function schema in the fine-tuning file.
Issue:
After training, when I send a user message like:
messages = [{"role": "user", "content": "Book a hotel in Paris"}]
The model responds with a function call like:
"function_call": {
"name": "extract_booking_request",
"arguments": {
"location": "Paris",
"action": "search" ← ❌ this value was never in training data
}
}
It invents field values that were never used in any of the fine-tuned examples.
❓ What I want to understand:
Why does a fine-tuned GPT-4.1 model invent values that were not in the training data?
Should I include a system prompt or function schema during inference, even if I didn’t during fine-tuning?
Is this expected behavior, or am I missing something to force it to follow the training pattern more strictly?
Thanks in advance for any insights from the OpenAI team or other community experts.
The AI invents and infers because you are not training a blank slate.
The training reinforcement is being run on an AI model that already has intelligent patterns trained of using functions, and inferring values for keys, beyond the artificial intelligence on language.
You can make lots of "location": "bananatown" examples, but the AI uses its learning to fill in “London” when the user isn’t asking for fruity villages. Producing the wrong thing when it is actually the smart thing, or not booking a hotel when it has no idea what is being booked, is natural.
What you’ll want to do for your use case is just as you would have done without fine tuning:
specify a strict function schema
use parameter property description fields
use enums to enforce strictly validated field values
You will need to use functions (“tools”) in the API calls to have them handled by a tool recipient and to have the parallel tool call wrapper placed. You would want to construct API calls exactly as you trained on, and train on exactly the inference pattern you will use. You can reduce some of the more pedantic language in prompting and system messages, but you must train and use a consistent system message that is unique to your application for good training activation.
I, too, definitely think that models fine-tune easier when you’re using engineered prompts. You may need more examples if you want to get to the point where you’re omitting the schema, but then you’d have to justify the additional cost to yourself. If your goal with fine-tuning is just to reduce tokens then there may be a chance it’s not worth the trouble.
Actually, when I use function calling with a strict schema, I do get very accurate results — so that part is working well.
But now I’m wondering:
What is the real benefit of fine-tuning in this case?
The whole reason I went with fine-tuning was to reduce token usage and cost.
However, if I’m still required to include the full function schema or system prompt after fine-tuning — and given that fine-tuning has higher token pricing per million tokens —
then the fine-tuning is actually increasing cost instead of helping reduce it.
Is there a more efficient way to benefit from fine-tuning without having to send all the function definitions again at inference time
You need a lot of examples in order to bake in your instructions, otherwise it’ll just learn to hallucinate, which is what you’re seeing here. OpenAI recommends to start at 80, but from my experience I’ve always needed around 200. How many examples are in your dataset? How many of them call your function?
Usually it’s for refining the style / personality of the responses, or for improving accuracy on a specific task in a cheaper model. If your entire goal is just to remove the schema, it probably isn’t worth it to do all of this. Especially if you’re not doing calling the model at a large scale.
in my experience gpt-4.1-mini and nano are really poor with function calling (they struggle to intelligently distinguish similar functions - a result of quantisation fallout?)
My guess (please be skeptic with what I say), you have used fine tuning to “save” tokens spent on tool description… Which took “grounding context” from your data set… (Ignore if your training data samples contains full tool description schema).
So instead of tuning for better “selection” among provided fields, you fine-tuned for better “field guessing” based on provided input…
I would definitely include all the schema in all data samples, and if possible, I would also mix it with several other tool calls in the same domain so that the model gets patterns of matching input to schema to task to output (the whole chain) where the schema also varies.
Fine tuning on single schema samples makes sense when the input is too long or schema is too complex…
But ideally, the issue with tool calls has to be solved way before calling them: in the workflow design stage, make your tools simple and single application, so that they are easy to get by the model “understanding”.
Fine-tuning of tool calls makes sense when the task is complex and can be solved only by expensive models. So fine-tuned lower class model performs same (or often better) than the higher class one, is faster, and stays cheaper despite the extra cost for fine-tuning.
Totally agree, tool description is important, and fine-tuning without it will “confuse” the model.
Think of fine-tuning as “behavior modifications” as the first effect. The additional weights “tune” the predicted logits, as the name implies.
This can be an AI that you have only given one-sentence responses in examples, so it learns the style, and learns to wrap up the output quickly.
It can be indeed reinforcement of when to use particular parameter values, showing by example instead of instruction. Enough “I want to book a hotel”, and the AI might find that filling out “action” with “BOOK” as you trained is more certain than “book a hotel” or “search for hotels”.
However, blindly parroting decisions will take overfitting that is poorer at general intelligence: many n_epochs at increased learning multiplier on a large set of examples. Examples will also need to be proportionate so you don’t have a monotone repeater of a single wrong output for many cases.
Using a fine-tuning model takes work and costs in training, to then have more expensive AI. A simple upgrade to a better AI model not billed at fine-tuning rates that is instructed might have parity, at the same ultimate cost.
Agreed, fine tuning just to eliminate the schema is like using war bombs as pesticides. It isn’t cheaper at smaller scale and it also robs you of any chance at changing the instructions without starting over.
i can do this with base model when use schema enum propert no need to fine tune i try it and response was always perfect but cost me tokens when add prompt and schema with each message
i have thousands of messages per each hour so i do not care about finetune cost i care only how reduce tokens in messages send
i now use big promp with big schema it run well but cost me huge tokens per day
I trid use cheaper model GPT 4.1 mini but response was bad like
when user said TOMORROW I HAVE DATE AND MAY BE BOOK HOTEL
cheaper GPT 4.1 ai model make book and this wronge response because user not ask to response he just said may be book