Strange token cost calculation for tool_calls

Hi,

I am trying to understand how prompt_tokens are calculated for one/multiple tool_calls and corresponding tool messages.

Here is an example with a single tool call and a single tool message:

{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "How much is 2 plus 2?"
        }
      ]
    },
    {
      "role": "assistant",
      "tool_calls": [
        {
          "id": "a",
          "type": "function",
          "function": {
            "name": "calculator",
            "arguments": "{\"a\":2, \"b\":2}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "tool_call_id": "a",
      "content": "4"
    }
  ]
}

In response I am getting "prompt_tokens": 40.

If I copy/paste the same tool call and tool message (1 user message, 2 tool calls and 2 tool messages in total), I am getting 89 tokens, so the cost of a single (function call+tool message) pair seems to be 49 tokens.

But If I copy/paste them again (1 user message, 3 tool calls and 3 tool messages in total), I will get 114 tokens, which means that an additional (function call+tool message) pair costs 25 tokens. If I go further with 4/5/etc pairs, the cost is increased each time by 25.

So when the number of tool calls changes from 1 to 2, the cost of a second tool call is doubled for some reason. Is it a bug?

Thanks!

After some playing around it seems that addition of a second tool call is not “doubled”, but increased by 24-25 tokens.

No, it is not a bug.

Your API request JSON is not placed directly into the AI context with all the formatting. What is counted is language that the AI sees.

The language that the AI emits to call a tool is purposefully hidden from you and not documented. Those are the tokens you pay for from a “tool call” message in chat history.

The return value is also a role message with name, placed in AI context.

@_j I understand, but my point is that I expect the cost of tool calls would increase proportionally to the number of tool calls, but the jump from 1 to 2 tool calls seems to add 24-25 extra tokens on top of a proportional increase.

They seem to increase proportionally for me.

Let’s take the roles out so we can add them on demand:

user = {"role": "user", "content": "1"}
assistant = {
    "role": "assistant",
    # "content": "ok",
    "tool_calls": [
        {
            "id": "x",
            "type": "function",
            "function": {"name": "calculator", "arguments": "2"},
        }
    ],
}
tool = {"role": "tool", "tool_call_id": "x", "content": "3"}

Then let’s make a “request” framework, and then add more assistant/tool pairs and see the increase each run:

request = {"model": "gpt-3.5-turbo", "max_tokens": 1, "messages": []}

# request['messages'].append(user)
request["messages"].append(assistant)
request["messages"].append(tool)
# assistant + tool = 18 prompt

request["messages"].append(assistant)
request["messages"].append(tool)  # + 15
# (assistant + tool) x 2 = 33 prompt

request["messages"].append(assistant)
request["messages"].append(tool)  # + 15 more
# (assistant + tool) x 3 = 48 prompt

Using 1-token contents, we have each pair taking 15 tokens, with the additional 3 of a final unseen “assistant” prompt.

And just in case you want the rest to run this, and also see exactly what’s being sent:

from openai import OpenAI
client = OpenAI()
a = client.chat.completions.with_raw_response.create(**request)

chat_completion = a.parse()

print("-- message content --")
print(chat_completion.choices[0].message.content)
print("-- usage --")
print(chat_completion.usage.model_dump())
print("remaining-requests: "
      f"{a.headers.get('x-ratelimit-remaining-requests')}")
print("-- API request --")
print(a.http_request.content)

@_j I am talking about the case when there is a single assistant message with 1/2/3/etc tool_calls, not when there are multiple assitant messages with single tool_call.

{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "assistant",
      "tool_calls": [
        {
          "id": "a",
          "type": "function",
          "function": {
            "name": "a",
            "arguments": "a"
          }
        }
      ]
    },
    {
      "role": "tool",
      "tool_call_id": "a",
      "content": "a"
    }
  ]
}

17 prompt tokens


{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "assistant",
      "tool_calls": [
        {
          "id": "a",
          "type": "function",
          "function": {
            "name": "a",
            "arguments": "a"
          }
        },
        {
          "id": "b",
          "type": "function",
          "function": {
            "name": "b",
            "arguments": "b"
          }
        }
      ]
    },
    {
      "role": "tool",
      "tool_call_id": "a",
      "content": "a"
    },
    {
      "role": "tool",
      "tool_call_id": "b",
      "content": "b"
    }
  ]
}

58 prompt tokens (+41)


{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "assistant",
      "tool_calls": [
        {
          "id": "a",
          "type": "function",
          "function": {
            "name": "a",
            "arguments": "a"
          }
        },
        {
          "id": "b",
          "type": "function",
          "function": {
            "name": "b",
            "arguments": "b"
          }
        },
        {
          "id": "c",
          "type": "function",
          "function": {
            "name": "c",
            "arguments": "c"
          }
        }
      ]
    },
    {
      "role": "tool",
      "tool_call_id": "a",
      "content": "a"
    },
    {
      "role": "tool",
      "tool_call_id": "b",
      "content": "b"
    },
    {
      "role": "tool",
      "tool_call_id": "c",
      "content": "c"
    }
  ]
}

74 prompt tokens (+16)

{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "assistant",
      "tool_calls": [
        {
          "id": "a",
          "type": "function",
          "function": {
            "name": "a",
            "arguments": "a"
          }
        },
        {
          "id": "b",
          "type": "function",
          "function": {
            "name": "b",
            "arguments": "b"
          }
        },
        {
          "id": "c",
          "type": "function",
          "function": {
            "name": "c",
            "arguments": "c"
          }
        },
        {
          "id": "d",
          "type": "function",
          "function": {
            "name": "d",
            "arguments": "d"
          }
        }
      ]
    },
    {
      "role": "tool",
      "tool_call_id": "a",
      "content": "a"
    },
    {
      "role": "tool",
      "tool_call_id": "b",
      "content": "b"
    },
    {
      "role": "tool",
      "tool_call_id": "c",
      "content": "c"
    },
    {
      "role": "tool",
      "tool_call_id": "d",
      "content": "d"
    }
  ]
}

90 prompt tokens (+16)

You are looking at an assistant that is being told that it emitted first one, and then multiple tools calls, then.

So we need a little loopy to make building requests automated:

from openai import OpenAI
client = OpenAI()

for toolcount in range(1, 6):
    call = {"id": "x", "type": "function",
        "function": {"name": "calculator", "arguments": "2"}}
    assistant = {"role": "assistant", "tool_calls": []}
    tool = {"role": "tool", "tool_call_id": "x", "content": "3"}
    for _ in range(toolcount):
        assistant["tool_calls"].append(call)
    request = {"model": "gpt-3.5-turbo", "max_tokens": 1, "messages": []}
    request["messages"].append(assistant)
    request["messages"].append(tool)
    # print(request)

    a = client.chat.completions.with_raw_response.create(**request)
    chat_completion = a.parse()
    print(f" --{chat_completion.usage.model_dump()['prompt_tokens']}")

–18
–62
–80
–98
–116

What I see is a big jump when transitioning from one to two calls, and then a progressive rate (+18).

My guess is that this reflects the new language the AI also uses: that emitting multiple tool calls has a much larger container, and the AI makes a transition to another method to write them to the tool recipient API backend, and telling the AI what it called in the past takes as much overhead.

@_j This is my guess as well, but would be nice to get a confirmation from someone from OpenAI that it is not a bug and an expected behavior.

Any conclusion now? Counting tokens for multiple tool_calls in a message is confusing so that i can’t calculate when uses parallel tool calling with stream = true.

Sending a multi-tool call back to the AI as the previously sent “assistant” message as part of conversation history would show the AI complying with having created this wrapper container for functions (which you don’t see in the initial assistant tool output):

## multi_tool_use

// This tool serves as a wrapper for utilizing multiple tools. Each tool that can be used must be specified in the tool sections. Only tools in the functions namespace are permitted.
// Ensure that the parameters provided to each tool are valid according to that tool’s specification.
namespace multi_tool_use {

// Use this function to run multiple tools simultaneously, but only if they can operate in parallel. Do this even if the prompt suggests using the tools sequentially.
type parallel = (_: {
// The tools to be executed in parallel. NOTE: only functions tools are permitted
tool_uses: {
// The name of the tool to use. The format should either be just the name of the tool, or in the format namespace.function_name for plugin and function tools.
recipient_name: string,
// The parameters to pass to the tool. Ensure these are valid according to the tool’s own specifications.
parameters: object,
},
}) => any;

} // namespace multi_tool_use