Finetuning with tool calls and tool responses

Hello,
we decided to finetune Gpt4.1 or Gpt4.1-mini for both improving tool use ability and the final response generation.

In our use case, we let the model access several tools (around 20).

We opted for using large, expensive, reasoner models (e.g. o3 and o4) to pre-determine which tools are the best ones to call and in which order (for a given user query).

In the finetuning docs is reported that we need to provide samples for training that follows the chat-completition format.

So, ideally, we should come with a set of chats as the following example:

{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "<User query>"
        },
        {
            "role": "assistant",
            "tool_calls": [
                {
                    "id": "call_id_1",
                    "type": "function",
                    "function": {
                        "name": "tool_name_1",
                        "arguments": {"arg1": "value1", "arg2": "value2"}
                    }
                }
            ]
        },
        {
            "role": "tool",
            "tool_call_id": "call_id_1",
            "content": [
                {
                    "type": "text",
                    "text": "Tool1 response content"
                }
            ]
        },
        {
            "role": "assistant",
            "tool_calls": [
                {
                    "id": "call_id_2",
                    "type": "function",
                    "function": {
                        "name": "tool_name_2",
                        "arguments": {"arg1": "value1", "arg2": "value2"}
                    }
                }
            ]
        },
        {
            "role": "tool",
            "tool_call_id": "call_id_2",
            "content": [
                {
                    "type": "text",
                    "text": "Tool2 response content"
                }
            ]
        },
        {
            "role": "assistant",
            "content": "Final response"
        }
    ],
    "parallel_tool_calls": false,
    "tools": [
        // list of available tools
    ]
}

Schematically:

  • system message
  • user message
  • assistant that says ‘I want to use tool 1’
  • response from the tool 1
  • repeat until no more tools are needed
  • final assistant response

Assuming that the above json is correct, and that one can insert tool outputs in the order shown above, our doubts are related to the following points:

  1. Is the loss of the model computed only for ‘assistant’ role messages? We do not want the model to ‘learn to predict’ tool outputs (that’s really crucial)
  2. Is it a good practice to teach the model how to perform tool calls AND how to provide final response, in the same training sample? Are there better alternatives to the above json?
  3. In the final training sample, the one to be submitted to training API, do we need to insert fake ids for the tool calls or is it fine to leave place-holders like in the example (e.g. ‘call_id_1’)?
  4. What if one wants to perform parallel tool calls? We set 1 message of the assistant calling multiple tools and then concatenating several role=‘tool’ messages, one for each tool call?

Thank you all for reading, I hope this post will help others aswell

Resource: https://platform.openai.com/docs/api-reference/fine-tuning/chat-input

There is literally no guidance and documentation on training.

You get examples of “marv the sarcastic chatbot”, a system prompt that needs no fine-tuning.

I would think the entire chat would serve well. My reasoning is that you are training the model on sequences with supervised learning. It sees an input pattern, it completes the rest. That can be producing the rest of a function call after “search the database”, terminated by the end-of-message token, or the rest of “Here is your answer” after a function return.

The AI already knows how to send to functions. At most, you are just reducing the amount of description you provide with tuning.

Repeating different lengths of the same chat example could lead to overfitting on writing functions, without good interpolation.

You ask good questions: is the weighting also training the self-attention, and thus, is it a token-by-token training of rewarding each token’s state only through the “assistant” part?

I think a conclusion might be “possibly”. When (previously) using completion base models, the only input was a “prompt” field, and this was distinct from completion you provided:

{"prompt":"user:Bob---What is Earth?\n\n\n","completion":"AI:SarcasmBot---Earth? Meh, just another ball of dirt.###"}

Were they equivalent, one could just train on prompt alone.

So: are you trying to make function calling that is under-described, that is reducing the tool specification descriptions to make functions or then what is expected is unobvious? That is where I would leave off at assistant being a function call in many chat examples.


The AI does not receive or generate tool call or function IDs. It only ensures ordering in the same order if they are returned async and in parallel through the parallel function calling (yet IDs are still mandated and force pairs of call-return).

The function usage should follow the same pattern as you would have stored in a chat completions history, including parallel tool calls. The only departure is the placement of the final “assistant” for output training.

There is certainly tons of missing documentation that should be written. More experimentation, more fails, more misapplication of fine-tuning: more profit?

thank you for your reply, really appreciated!

The idea of finetuning with tool calls is to encourage the final model to use more tools as he would otherwise.

In practice: we noticed that reasoner models can really well plan tools usage (multi-step, parallel etc..etc..), so we wanted to “transfer” this multi-tool capacity to the end-model (e.g. gpt4.1-mini)

As a plus, also the final response in the training dataset will be generated by an heavy model → so like a distill we would like the gpt-4.1-mini to mimic the well prepared responses.

The actual tool call is the AI writing its initial token generation to a tool recipient instead of closing its prefix and writing to a user.

The AI also can trigger and invoke sending to a tool in the same response after writing to a user. This takes a good amount of function description and prompting to ensure, but can have the AI actually planning, “to answer this question, I’ll use the problems_database followed by the company customer_lookup functions, hang in there while I start..” - another possibility one could explore with fine-tuning.

I’ll give you a free hint:

Chat_Completions("logit_bias": {316: +5}, ...)

You can’t demote initially writing to a user, but you can promote sending to a tool. Note that logit_bias is only working currently if you don’t use sampling parameters, and OpenAI is unresponsive to comment on why it’s broken in conjunction with temperature or top_p.