How to count tokens from code interpreter usage?

Hi there,

I want to manually count token usage since the Assistants API doesn’t provide any usage information right now.

So I’m charged $0.03 as soon as the run steps include a tool call to the code interpreter but how are the input/output tokens calculated?

Am I right with the following assumption: The input to the code interpreter would be considered completion tokens, as the code was generated by GPT. Once code interpreter finishes, I would count both input and output of the code interpreter as prompt tokens since they are appended to the conversation context?

user: Write Python code that calculates 2+2 and give me the output.
assistant: The output was “4”

I would now have print(2+2) and The output was "4" as completion tokens and Write Python code that calculates 2+2 and give me the output. + print(2+2) (input to code interpreter) + 4 (output of code interpreter) as prompt tokens. Are there any additional tokens that I’m missing?

Thanks in advance!

Screenshot from 2024-01-14 07-38-39

So this consumed 47 completion tokens (using I get 23 tokens for the input to the code interpreter and 16 for the answer => 39 and the left over 8 are probably some indication for the code interpreter). HOWEVER, this also consumed 288 prompt tokens (I did not provide any instructions). Are we paying for some default system prompt?

Welcome to the forum!

Great question!

I suspect there is some description of code interpreter, so that the LLM knows when to trigger it. It’s similar to any generic tool where we provide the description.

For ChatGPT, some folks have found the system prompt that describes code interpreter.



When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 60.0 seconds. The drive at ‘/mnt/data’ can be used to save and persist user files. Internet access for this session is disabled. Do not make external web requests or API calls as they will fail.


Good call! I ran some tests and “reverse engineered” their token usage. It’s kind of hacky and I’ve only tested it for my use case (I expect exactly one user message, that is followed by exactly one code interpreter call; retrieval and function calling disabled, no instructions). Also haven’t tested if the name of the assistant has an impact. Should be pretty straight forward to take that into account if you need to. Should also work for multiple code interpreter calls and user messages but I did not test that.


  • Base system prompt for an assistant called “Test1” with no instructions and only code interpreter enabled consumes 105 prompt tokens
  • Each user message consumes 4 additional tokens to the message tokens
  • Each assistant message consumes 2 additional tokens to the message tokens
  • Each code interpreter call consumes 6 additional tokens to the completion tokens (the input to the code interpreter are obviously completion tokens). Additionally, each code interpreter call adds 14 tokens to the prompt tokens (which include the input to the code interpreter and the output log)

Here’s a hacky function that calculates the token usage (I verified it using the usage dashboard on the OpenAI platform but obviously I’m not 100% certain it is correct):

def _get_token_usage(messages: List[ThreadMessage], steps: List[RunStep]) -> (int, int):
    base_prompt = 105
    per_user = 4
    per_assistant = 2
    per_code_interpreter = 14
    per_code_interpreter_output = 6

    objects = messages + steps
    objects.sort(key=lambda o: o.created_at)

    total_prompt_tokens = 0
    total_completion_tokens = 0

    prompt_tokens = base_prompt

    for o in objects:
        # i think each run step can be considered to trigger a completion
        if o.object == "":
            total_prompt_tokens += prompt_tokens

            if o.type == "tool_calls":
                step_details = o.step_details

                for tool_call in step_details.tool_calls:
                    if tool_call.type == "code_interpreter":
                        input_tokens = count_openai_tokens(tool_call.code_interpreter.input, "gpt-4-1106-preview")
                        prompt_tokens += per_code_interpreter + input_tokens

                        # the input to the code interpreter is generated by GPT => count as completion tokens
                        total_completion_tokens += per_code_interpreter_output + input_tokens

                        for output in tool_call.code_interpreter.outputs:
                            if output.type == "logs":
                                output_tokens = count_openai_tokens(output.logs, "gpt-4-1106-preview")
                                prompt_tokens += output_tokens
                                raise ValueError(f"Unsupported code interpreter output type: {output.type}")
                        raise ValueError(f"Unsupported tool call type: {tool_call.type}")
        if o.object == "thread.message":
            if o.role == "user":
                prompt_tokens += per_user
                prompt_tokens += count_openai_tokens(o.content[0].text.value, "gpt-4-1106-preview")
            elif o.role == "assistant":
                total_completion_tokens += per_assistant
                total_completion_tokens += count_openai_tokens(o.content[0].text.value, "gpt-4-1106-preview")

    return total_prompt_tokens, total_completion_tokens

Note that you call this after your run has completed (i.e. run status == completed) with messages and steps obtained after that. count_openai_tokens just uses tiktoken to get the length of the encoding.

One more note: Pretty sure that what really happens is that each log output has its own base token usage. I only expect a single log output so it works but if you have multiple outputs then you would have to adjust the code.

Hope this helps if someone is looking for it!