I want to manually count token usage since the Assistants API doesn’t provide any usage information right now.
So I’m charged $0.03 as soon as the run steps include a tool call to the code interpreter but how are the input/output tokens calculated?
Am I right with the following assumption: The input to the code interpreter would be considered completion tokens, as the code was generated by GPT. Once code interpreter finishes, I would count both input and output of the code interpreter as prompt tokens since they are appended to the conversation context?
Example:
user: Write Python code that calculates 2+2 and give me the output.
code_interpreter(‘print(2+2)’)
assistant: The output was “4”
I would now have print(2+2) and The output was "4" as completion tokens and Write Python code that calculates 2+2 and give me the output. + print(2+2) (input to code interpreter) + 4 (output of code interpreter) as prompt tokens. Are there any additional tokens that I’m missing?
So this consumed 47 completion tokens (using https://platform.openai.com/tokenizer I get 23 tokens for the input to the code interpreter and 16 for the answer => 39 and the left over 8 are probably some indication for the code interpreter). HOWEVER, this also consumed 288 prompt tokens (I did not provide any instructions). Are we paying for some default system prompt?
I suspect there is some description of code interpreter, so that the LLM knows when to trigger it. It’s similar to any generic tool where we provide the description.
For ChatGPT, some folks have found the system prompt that describes code interpreter.
Tools
Python
When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 60.0 seconds. The drive at ‘/mnt/data’ can be used to save and persist user files. Internet access for this session is disabled. Do not make external web requests or API calls as they will fail.
Good call! I ran some tests and “reverse engineered” their token usage. It’s kind of hacky and I’ve only tested it for my use case (I expect exactly one user message, that is followed by exactly one code interpreter call; retrieval and function calling disabled, no instructions). Also haven’t tested if the name of the assistant has an impact. Should be pretty straight forward to take that into account if you need to. Should also work for multiple code interpreter calls and user messages but I did not test that.
Findings:
Base system prompt for an assistant called “Test1” with no instructions and only code interpreter enabled consumes 105 prompt tokens
Each user message consumes 4 additional tokens to the message tokens
Each assistant message consumes 2 additional tokens to the message tokens
Each code interpreter call consumes 6 additional tokens to the completion tokens (the input to the code interpreter are obviously completion tokens). Additionally, each code interpreter call adds 14 tokens to the prompt tokens (which include the input to the code interpreter and the output log)
Here’s a hacky function that calculates the token usage (I verified it using the usage dashboard on the OpenAI platform but obviously I’m not 100% certain it is correct):
def _get_token_usage(messages: List[ThreadMessage], steps: List[RunStep]) -> (int, int):
base_prompt = 105
per_user = 4
per_assistant = 2
per_code_interpreter = 14
per_code_interpreter_output = 6
objects = messages + steps
objects.sort(key=lambda o: o.created_at)
total_prompt_tokens = 0
total_completion_tokens = 0
prompt_tokens = base_prompt
for o in objects:
# i think each run step can be considered to trigger a completion
if o.object == "thread.run.step":
total_prompt_tokens += prompt_tokens
if o.type == "tool_calls":
step_details = o.step_details
for tool_call in step_details.tool_calls:
if tool_call.type == "code_interpreter":
input_tokens = count_openai_tokens(tool_call.code_interpreter.input, "gpt-4-1106-preview")
prompt_tokens += per_code_interpreter + input_tokens
# the input to the code interpreter is generated by GPT => count as completion tokens
total_completion_tokens += per_code_interpreter_output + input_tokens
for output in tool_call.code_interpreter.outputs:
if output.type == "logs":
output_tokens = count_openai_tokens(output.logs, "gpt-4-1106-preview")
prompt_tokens += output_tokens
else:
raise ValueError(f"Unsupported code interpreter output type: {output.type}")
else:
raise ValueError(f"Unsupported tool call type: {tool_call.type}")
if o.object == "thread.message":
if o.role == "user":
prompt_tokens += per_user
prompt_tokens += count_openai_tokens(o.content[0].text.value, "gpt-4-1106-preview")
elif o.role == "assistant":
total_completion_tokens += per_assistant
total_completion_tokens += count_openai_tokens(o.content[0].text.value, "gpt-4-1106-preview")
return total_prompt_tokens, total_completion_tokens
Note that you call this after your run has completed (i.e. run status == completed) with messages and steps obtained after that. count_openai_tokens just uses tiktoken to get the length of the encoding.
One more note: Pretty sure that what really happens is that each log output has its own base token usage. I only expect a single log output so it works but if you have multiple outputs then you would have to adjust the code.