How can I calculate the context token in Assistant API?

Currently, there’s no output of the used context tokens used for a given message or thread. So how can I calculate the context tokens used in a thread?

I have seen some people here running some trials where they calculate the context based on the output and input tokens, but that has shown not being reliable as there are additions between messages that we just can’t explain.

Without being able to properly calculate the context tokens we can’t predict the cost effectivly of usage for a specific user, meaning that we’re uncertain about our price model.

Any ideas is very welcomed and helpful! Ty

1 Like

You are correct and you cannot.

The assistants backend operates autonomously, injects its own functions and prompts, retrieves and injects files it wants, iterates as it wants, keeps hidden function conversation that it wants, all never revealed to you.

Read all the different methods and convoluted ways you have to interact with assistants anyway, and you will find much more efficient use of your coding time is in developing your own conversation management techniques, your own controlled vector database RAG and your own Jupyter sandbox virtual environments. Then you can actually count your own tokens instead of them being purposefully hidden.

1 Like

Yep, it’s a shame that OpenAI isn’t more clear about this.

BUT the problem: For my use case, the Assistant API is WAAYYY MORE efficient that chat completion or whatever other alternative that I’ve tried so far.

BUT that being said. GP4 is just WAY too expensive for me to put into prod without knowing the up front costs with it, so we’re now trailing gpt-3.5-1106 (dropped a lot the decision making of the model) but for us, we might be able to offer it with different tiers for our customers.

And in terms of context tokens, we’re doing some trials in order to identify a middle ground and an average that we can use efficiently to “predict” the cost.

So far, we’ve seen a 10k context tokens for 500 output tokens ratio, so, we’re going to be using an average like this to charge our customers in a premium feature.

not much else we can do to be honest.

1 Like

Actually it seems like each Run in a Thread returns at the end the number of tokens used by the run:

"usage": {
    "prompt_tokens": 123,
    "completion_tokens": 456,
    "total_tokens": 579

So by storing the sum of Runs token usage by Thread in a database, you can have some control about the cost of assistant use (re-creating a Thread when its context tokens reach a threshold, disabling the feature, etc).

Thus, we still cannot control the cost before the Run or telling the Assistant to not process after reaching a threshold.

I’ll give a try to re-create Thread with a summarized thread history to reduce context Token. I’m not so sure which token are considered Context Token between prompt_tokens and completion_tokens