Hello everybody. I am trying a couple of things with the help of the assistants api but there are some small issues. I gave the assistant the insturction: “Address the user as ‘Jonny Depp’” while creating it. The prompt I gave it was “Whats your name?” and it responeded with “Hello, Jhonny Depp! My name is Assistant. How can I assist you today?”. I checked the usage from the response at it said that it used 47 total tokens. I calculated them myself and it should be about 34 tokens. Why is that?
Assistants can use a lot more tokens than just the input when they have tools enabled.
For this simple input though, it is the chat format itself that messages are placed into that is the source of the overhead. The message is enclosed in a container of additional tokens with the role, and the AI also must be prompted with more tokens to produce its response.
Overhead above the content tokens:
7 tokens: first message
4 tokens: additional messages
In some cases you may be billed for an extra unseen output token or two that possibly is the AI communicating to the endpoint backend whether to invoke tools.
Ok. Thank u so much. However, I still have got two question about the assistants api. When I provide the instruction to the assistant directly when creating / modifying it, will I be billed for the tokens that it used? Second: When using the file tool/vector storage and the ai takes a chunk of a file, will I be billed for these tokens too? Thanks!
You are billed for what actually goes into and out of a language model at run time.
So: API calls for setting up and changing assistants and files are of no cost (unless you create a vector store over 1GB and are billed for storage).
It is when a run is invoked to answer the latest user input that everything compiled as input context is sent to the AI model you’ve chosen, and the response is formed.
The vector storage is presented to the AI as a search tool. The tool itself takes tokens to give the instructions for usage. If the AI makes a search instead of replying to you, that is two AI model calls, the second model call then having a large amount of documentation retrieved as another part of the input.
The chat within a thread also grows, consuming more tokens the longer the AI “memory” of chat history extends on following calls.