Parallel tool calls in chat completions causes token count overestimation from the API

alex_semblyai · August 21, 2024, 2:09pm

TL;DR

Parallel tool calls inflate token spending, we have to rework them to sequential calls to bring it back to normal.

Issue description

I have recently noticed a disturbing issue with Chat Completions API for the use-case of parallel tool calls usage: whenever there is an Assistant message with multiple tool calls followed by multiple tool messages (with tool results) the prompt token count that API calculates (and bills for) is much higher than the actual token count of the messages in the request.
It’s being reproduced both on public API and in private Azure-hosted API; with any API chat model (3.5, 4, 4 preview, 4o).

How to reproduce

As reproducible example, here’s a sample request:

import requests
import os
import tiktoken

api_key = os.getenv('OPENAI_API_KEY')
model = "gpt-4o"

# Cutoff the real values to ensure the code will stay higlighted, 
# get the real values in a block below
NY = '## Detailed Weather Report for New York City\n\n### General Overview\n\nNew York City, often simply referred to a...'
BOS = "**Boston Weather Report**\n\nGood day, Bostonians and visitors! This is your comprehensive weather guide..."
OK = "Okinawa, Japan, in August 2024, experiences quintessential tropical weather, marked by hot temperatures, high humidity..."
KYIV = '### Comprehensive Weather Report for Kyiv\n\n#### General Overview\n\nKyiv, historically known as Kiev, is the...'


json_payload = {
    "model": model,
    "messages": [
        {
		    "role": "user",
		    "content": "What'\''s the weather like in today in Boston, New York, Okinawa and Kyiv?"
        },
        {
            "role": "assistant",
            "content": None,
            "tool_calls": [
            {
                "id": "call_tHfowN9l9wbWUayh7ooPDuZa",
                "type": "function",
                "function": {
                    "name": "get_current_weather_overview",
                    "arguments": "{\"location\": \"Boston, MA\"}"
                }
            },
            {
                "id": "call_9GbV6TvnBpbmKd6E9pdXysmk",
                "type": "function",
                "function": {
                    "name": "get_current_weather_overview",
                    "arguments": "{\"location\": \"New York, NY\"}"
                }
            },
            {
                "id": "call_OELzWov7HL8pepyMg1cc9k8t",
                "type": "function",
                "function": {
                    "name": "get_current_weather_overview",
                    "arguments": "{\"location\": \"Okinawa, Japan\"}"
                }
            },
            {
                "id": "call_yxqZl8P3U3E2eeIdqaIBvg3i",
                "type": "function",
                "function": {
                    "name": "get_current_weather_overview",
                    "arguments": "{\"location\": \"Kyiv, Ukraine\"}"
                }
            }
            
            ],
            "refusal": None
        },
        {
            "role": "tool",
            "content": BOS,
            "tool_call_id": "call_tHfowN9l9wbWUayh7ooPDuZa"
        },
        {
            "role": "tool",
            "content": NY,
            "tool_call_id": "call_9GbV6TvnBpbmKd6E9pdXysmk"
        },
        {
            "role": "tool",
            "content": OK,
            "tool_call_id": "call_OELzWov7HL8pepyMg1cc9k8t"
        },
        {
            "role": "tool",
            "content": KYIV,
            "tool_call_id": "call_yxqZl8P3U3E2eeIdqaIBvg3i"
        }
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_current_weather_overview",
                "description": "Get the current weather overview in a given location",
                "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA"
                    }
                },
                "required": ["location"]
                }
            }
        }
	],
	"tool_choice": "auto"
}

response = requests.post(
    "https://api.openai.com/v1/chat/completions",
    headers={
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}" 
    },
    json=json_payload   
)

encoding = tiktoken.encoding_for_model(model)
total_content = "\n".join(i['content'] for i in json_payload['messages'] if i['content'] is not None)

api_count = response.json()['usage']['prompt_tokens']
tt_count = len(encoding.encode(total_content))
overhead = (api_count/tt_count - 1) * 100

print(f"API responded: {api_count} prompt tokens")
print(f"Tiktoken counted: {tt_count} tokens in all messages content")
print(f"Overhead is {overhead:.2f}% ({api_count - tt_count} tokens)")

Github gist with tool messages content values (generated texts with ~1.6k tokens in each): /alex-semblyai/ed400cc4a2767fa63364ada35cc1462f

As result I get following token counts mismatch:

API responded: 7073 prompt tokens
Tiktoken counted: 5902 tokens in all messages content
Overhead is 19.84% (1171 tokens)

The redundant token count is more than 1k, that’s much more than could be explained with extra tokens additions for the messages and function definitions under the hood.

Moreover, the ratio of overhead differs a lot for different requests. For example, in one of my test runs tiktoken counted ~11k prompt tokens, while API count was ~56k.

The workaround

I also found a workaround: to restructure messages after tool execution to call-result message pairs, like this:

--- Before ---
M0: user
M1: assitant
	  - tool call 1
	  - tool call 2
M2: tool
		- tool call 1 result
M3: tool
		- tool call 2 result
		
--- After ---
M0: user
M1: assitant
	  - tool call 1
M2: tool
		- tool call 1 result
M3: assitant
	  - tool call 2
M4: tool
		- tool call 2 result

Once we do so, the overhead disappears despite the content in the messages is the same. A reproducable code sample:

import requests
import os

api_key = os.getenv('OPENAI_API_KEY')
model = "gpt-4o"

# Cutoff the real values to ensure the code will stay higlighted, 
# get the real values in a block above
NY = '## Detailed Weather Report for New York City\n\n### General Overview\n\nNew York City, often simply referred to a...'
BOS = "**Boston Weather Report**\n\nGood day, Bostonians and visitors! This is your comprehensive weather guide..."
OK = "Okinawa, Japan, in August 2024, experiences quintessential tropical weather, marked by hot temperatures, high humidity..."
KYIV = '### Comprehensive Weather Report for Kyiv\n\n#### General Overview\n\nKyiv, historically known as Kiev, is the...'

json_payload_restructured = {
    "model": model,
    "messages": [
        {
            "role": "user",
            "content": "What'\''s the weather like in today in Boston, New York, Okinawa and Kyiv?"
        },
        {
            "role": "assistant",
            "content": None,
            "tool_calls": [
                {
                    "id": "call_tHfowN9l9wbWUayh7ooPDuZa",
                    "type": "function",
                    "function": {
                        "name": "get_current_weather_overview",
                        "arguments": "{\"location\": \"Boston, MA\"}"
                    }
                }
            ],
            "refusal": None
        },
        {
            "role": "tool",
            "content": BOS,
            "tool_call_id": "call_tHfowN9l9wbWUayh7ooPDuZa"
        },
        {
            "role": "assistant",
            "content": None,
            "tool_calls": [
                {
                    "id": "call_9GbV6TvnBpbmKd6E9pdXysmk",
                    "type": "function",
                    "function": {
                        "name": "get_current_weather_overview",
                        "arguments": "{\"location\": \"New York, NY\"}"
                    }
                }
            
            ],
            "refusal": None
        },
        {
            "role": "tool",
            "content": NY,
            "tool_call_id": "call_9GbV6TvnBpbmKd6E9pdXysmk"
        },
        {
            "role": "assistant",
            "content": None,
            "tool_calls": [
                {
                    "id": "call_OELzWov7HL8pepyMg1cc9k8t",
                    "type": "function",
                    "function": {
                        "name": "get_current_weather_overview",
                        "arguments": "{\"location\": \"Okinawa, Japan\"}"
                    }
                }
            
            ],
            "refusal": None
        },
        {
            "role": "tool",
            "content": OK,
            "tool_call_id": "call_OELzWov7HL8pepyMg1cc9k8t"
        },
        {
            "role": "assistant",
            "content": None,
            "tool_calls": [
                {
                    "id": "call_yxqZl8P3U3E2eeIdqaIBvg3i",
                    "type": "function",
                    "function": {
                        "name": "get_current_weather_overview",
                        "arguments": "{\"location\": \"Kyiv, Ukraine\"}"
                    }
                }
            ],
            "refusal": None
        },
        {
            "role": "tool",
            "content": KYIV,
            "tool_call_id": "call_yxqZl8P3U3E2eeIdqaIBvg3i"
        }
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_current_weather_overview",
                "description": "Get the current weather overview in a given location",
                "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA"
                    }
                },
                "required": ["location"]
                }
            }
        }
    ],
    "tool_choice": "auto"
}

restructured_response = requests.post(
    "https://api.openai.com/v1/chat/completions",
    headers={
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}" 
    },
    json=json_payload_restructured
)

encoding = tiktoken.encoding_for_model(model)
total_content = "\n".join(i['content'] for i in json_payload_restructured['messages'] if i['content'] is not None)

api_count = restructured_response.json()['usage']['prompt_tokens']
tt_count = len(encoding.encode(total_content))
overhead = (api_count/tt_count - 1) * 100

print(f"API responded: {api_count} prompt tokens")
print(f"Tiktoken counted: {tt_count} tokens in all messages content")
print(f"Overhead is {overhead:.2f}% ({api_count - tt_count} tokens)")

And we get following token counts:

API responded: 6090 prompt tokens
Tiktoken counted: 5902 tokens in all messages content
Overhead is 3.19% (188 tokens)

Related materials

I found mention of similar issues on the dev forum, however didn’t notice docs references, official feedbacks or resolution advices:

Strange token cost calculation for tool_calls - API - OpenAI Developer Forum
Inconsistent token billing for tool_calls in gpt-3.5-turbo-1106 - API / Bugs - OpenAI Developer Forum
Token Count: Playground vs Tokenizer - GPT builders - OpenAI Developer Forum

The info about token usage in documentation (Function Calling - OpenAI API) on function calling mentions only function definitions injection into the system message, haven’t seen something related to the issue:

Under the hood, functions are injected into the system message in a syntax the model has been trained on. This means functions count against the model’s context limit and are billed as input tokens <…>

Conclusion

I’m not totally sure whether it’s a bug, feature or misuse, but this effect increases the cost of requests with parallel tool usage unreasonably and unexpectedly.
In some cases the costs overhead may reach up to 500% (according to my observations).
It worth mentioning in documentation in the “token usage” section at least and possibly provide some advices on how to overcome it.
The workaround i found is quite inconvenient and I’m also not sure how does it affect to LLM outputs.

Topic		Replies	Views
Strange token cost calculation for tool_calls API api	10	1965	February 3, 2024
Inconsistent token billing for tool_calls in gpt-3.5-turbo-1106 Bugs	1	432	December 7, 2023
Assistants API token usage and pricing breakdown clarification API gpt-4 , api , assistants	10	9847	February 6, 2024
How to count tokens from code interpreter usage? API	3	2487	January 16, 2024
The "multi_tool_use.parallel" bug and how to fix it Bugs bug , api , tool	0	414	July 25, 2024