First post, long time reader.
I have a question regarding fine-tuning a model with function calling / tools.
I want to fine tune 4o-mini based on the good responses from 4o. Basically my prompt starts with a set of instructions and tools, then I let the model execute the instructions with the tools provided. 4o is really good at doing this, but it is costly. So I thought I could collect a lot of “good responses” from the main model and use that to train the mini model. So I did this, but the fine tuned model is worst now than just 4o-min.
To fine tune the model I used the OpenAI dashboard / playground, which automatically use all the stored calls, but this also includes the intermediate steps, like when you send the result of a tool execution and you wait for a response.
So, after all this introduction, my question is: should I be only fine tuning the model with full length conversations (so 1 full complete execution log), rather than than with intermediate steps like is done via the dashboard?