Assistant API token Usage - Token usage more than the whole attached file Plus prompts

I’m yet to understand the token usage by Assistant API . Here is my statistics,

  1. Added a File that contains 3600 Words
  2. I have a Bot instruction which have 17 Words
  3. Run time instructions have 69 Words.
  4. Asked question consist of 8 words.

This is approximately 3694 words in total, which is approximately 5346 tokens.

With first reply, the token usage is showing 6159 tokens, with input token alone 6052.

My question is, even if the Entire content is added as context it will be less than 6000, now how can it be 6052 input tokens? Can somebody shed some lights on this?

Token usage is given in screenshot:

That seems quite unlikely. Token usage is going to be higher than words.

Paste your text:

I had added the full document and instructions and its coming 5346 Tokens (Even my input token is greater than this. ). But This is full text of the attached files. Is it suppose to use chunks of matched context from the vector search? Or each time will it input the full content?

based on the results, i had modified the title and thread values in my first post.
Here is the screenshot.

The documentation answers that the Assistants agent framework pays no mind to your budget…

Retrieval currently optimizes for quality by adding all relevant content to the context of model calls. We plan to introduce other retrieval strategies to enable developers to choose a different tradeoff between retrieval quality and model usage cost.

“All relevant content” = all that will fit in the model’s context length.

The assistant and its internal functions for retrieval and other tools has its own language that also consumes tokens.

1 Like

But the thing is that, even using this much input context, quality of output is not that great. I had tested the same document against other startup products. They provides accurate answers. eg: I have a table with distance mentioned for 2 location. Table contains 10 rows with From and TO and distance for that. Assistant gave me wrong answer, but 2 other Raag As a service provided exactly same answer as in table. I thing we need to wait to get this matured. worried to put into production.

1 Like

A little Update, with new update for GPT3.5 Turbo 0125, the input token usage has been significantly reduced.

Opening this again because I am facing a similar issue, with a minor difference. Not only can i not interpret the amount of prompt tokens used, but I cannot predict them either.

I am using an agent on retreival mode, gpt-4-0125-preview model and 260 system instruction tokens. I want to understand how prompt tokens are computed so I run 2 experiments, each on the same agent but with different file configurations.

  • Experiment 1. Attached file is 1261 tokens, and user message is 15 tokens. My calculation is 15+1261+260=1536 expected prompt tokens. I run the thread and the assistant does not quote the file (quote=‘’), but it is clearly using the file to write the response. I inspect the thread and observe a usage of 2821 prompt tokens. So i have +1281 prompt tokens.

  • Experiment 2. Attached file is 2291 tokens and user message is 11 tokens, so calculation would be 2291+11+260=2562 tokens. Once again, i get an empty quote and an answer with content from the attached file. But this time prompt tokens 3577, which is +1015 tokens.

I understand the difference between my computation and the observed prompt tokens usage is due to the fact that

The assistant and its internal functions for retrieval and other tools has its own language that also consumes tokens.

What I do not understand is why this is not a fixed amount. Any ideas?
Also, why does it not quote parts of the file in the FileCitation, when it is clearly injecting parts of the file in the answer?


Assistants has a retrieval tool that is called myfiles_browser, that allows the AI to search knowledge. It is not disabled when the full file has already been injected anyway (regardless of relevance), and the AI doesn’t know what is behind that search function. So you’ll still get an AI that runs a search, which is the first model call you have to pay for, then more search results are loaded into the context length, and then the model is called again - and if you’re lucky it answers you then instead of continuing to run other functions inappropriately (like emitting code to python for no reason…) All misbehavior from poor implementation being on your dime…

Thank you!

Just to check that I got it right. Besides from injecting the file in the prompt, a search function consuming internally produced tokens might be called that retrieves additional chunks of the file. It will run until the AI believes it knows the answer. Because some threads will trigger the search function more than others, the amount of ‘internal’ prompt tokens will also depend on the thread.

Is this what you mean? Thanks a lot!!

pd: where did you learn about this myfiles_browser function ?? I am trying to get my head on how this actually works.

One has to exploit the assistant a bit to discover how it works, which is naively. I might have made a dump or two of the internal tools here on the forum if one does a search…

Moreso - there’s nothing telling the AI “this search will give you dog grooming products” - the AI has to find out for itself that a search is as pointless as the tokens of the first document that is placed as a distraction from quality.