You are looking at an assistant that is being told that it emitted first one, and then multiple tools calls, then.
So we need a little loopy to make building requests automated:
from openai import OpenAI
client = OpenAI()
for toolcount in range(1, 6):
call = {"id": "x", "type": "function",
"function": {"name": "calculator", "arguments": "2"}}
assistant = {"role": "assistant", "tool_calls": []}
tool = {"role": "tool", "tool_call_id": "x", "content": "3"}
for _ in range(toolcount):
assistant["tool_calls"].append(call)
request = {"model": "gpt-3.5-turbo", "max_tokens": 1, "messages": []}
request["messages"].append(assistant)
request["messages"].append(tool)
# print(request)
a = client.chat.completions.with_raw_response.create(**request)
chat_completion = a.parse()
print(f" --{chat_completion.usage.model_dump()['prompt_tokens']}")
–18
–62
–80
–98
–116
What I see is a big jump when transitioning from one to two calls, and then a progressive rate (+18).
My guess is that this reflects the new language the AI also uses: that emitting multiple tool calls has a much larger container, and the AI makes a transition to another method to write them to the tool recipient API backend, and telling the AI what it called in the past takes as much overhead.