Why does the use of tools in finetuning represent a 40% increase in the amount of trained tokens?

I suspect that it is because tool fine-tune also is including the the multi_tool_use tool for parallel tool calls and its description. Also, functions might not support and ignore some of a schema.

I calculate a 911 token per example difference, and you can also divide that by the number of epochs trained to obtain the final extra token count per single run of example.

Here is the source of a lot of bloat documented:

Then you also have the tokens of the AI emitting what it does in a different manner if fine tuning backend were programmed to invoke that parallel container wrapper to even just send a single tool call. Giving and getting tool call IDs to match up (and enforce) input to output in the AI language.

So:
functions = less undesired behavior, less text you didn’t write and can’t improve.
tools = more nesting quality like descriptions in nested objects and more json schema parameters converted to description when placing the tool spec.

It is a shame OpenAI attempts to obfuscate the actual AI operation in terms of tokens actually employed.

3 Likes