Unfortunately, while you can retrieve the messages in a thread that are the final product of an assistant, and messages in a thread that are user input, and use a token-counting module (tiktoken) to count the tokens of text of those messages, you cannot obtain a true token count.
Internal tool calls and tool responses, which may include large sections of file search documents and code, are NOT exposed to you.
The contents of a thread are NOT encoded as tokens, as they may be sent to models requiring different token encoders with different efficiencies.
The container format that messages are placed in are encodings specific to the “chat” format of a model.
You also cannot obtain a token cost of submitting.
OpenAI has a budgeting mechanism so that the token input capabilities of an AI model are not exceeded, but this is not presented to you or allowed to be directly controlled.
Enabling internal tool functions adds large text blocks of instructions to a run, and this is specific to an assistant chosen, along with that assistant’s other instructions.
Multiple internal tool calls can continue to add to a thread and be re-run automatically, where you pay multiple times for a growing thread.
The billing that might reveal the individual costs of calls to a model is obfuscated.
Therefore, Assistants is a platform where the billing is unpredictable, and the only control you are offered to limit the total expense will abort the process after you have been billed for a partial generation which may have no output to a user, on contents you cannot audit.
I hope this clarifies the advantages that using the Assistants endpoint offers you.
Thank you.
It is clear, but it is unbelievable that you cannot know the costs through API, but only on the platform as a money cost.
I hope that openAi let us check the costs through API too in the future, otherwise, it is difficult to use this application for final customer.
You can obtain a base token cost of using one iteration of an assistant as it is configured, by running a thread containing “reply only ‘hello’” (if it does not make errors and invoke tools first instead).
However the amplification of that cost by growing conversation, internally-repeating model calls, and especially file search results that may add 10k-16k to a thread for each search, is unpredictable and is based on the user input.