Assistant API - What are Context Tokens in the Billing calculation?

I notived in my billing page that when looking under “Usage” > “Activity” I have for example today 4104 generated Tokens + 55572 Context Tokens, billed are 59676 Tokens! What are Context Tokens? And how can I control them?


Hi and welcome to the Developer Forum!

Are you using Assistants? that seems to be the tokens used to generate context from your uploaded data.

1 Like

Thank you Foxabilo! Yes I am using Assistants now, but this kind of calculation was also done while using Playground and Chat! So it seems to be not related to Assistants exclusively! Because while using Chat there is no uploaded data!

Sure, but if you are using a combination of assistants and normal chat, you will see them all bunched together, unless there is days separating them

1 Like

“Context” is OpenAI’s new language for “prompt” or input. It’s what is loaded into the AI model before it generates a language output - and maybe not an output to you at first.

Likely meant to cause confusion between the pricing page “input” and “output”, so you don’t directly observe that when they load the assistants model with maximum input of “context” from chat “threads” and “knowledge retrieval”, and let the AI go wild with piecemeal browsing of documents and writing code and iterating on errors with that context…you get a bill.

Once the size of the Messages exceeds the context window of the model, the Thread will attempt to include as many messages as possible that fit in the context window and drop the oldest messages.

Retrieval currently optimizes for quality by adding all relevant content to the context of model calls.

It’s not going to exceed the context window of GPT-4-turbo until (124k x $0.01) at their max_tokens max setting.

1 Like

hello we are having the same problem. is there any workaround to control costs?

Seems not as for now. What is utterly insane is that there is no token feedback at all using the assistants API. If there was something like the “usage” key in chat completions integrating custom token limits for any model would be a piece of cake.

1 Like

so true, the pros is less engineering but the cons outweighs the pros with pricing, we just finished implementing a product finder , we realized the amazing number context tokens during the testing, it is pretty wild and not transparent at all

we have downgraded model to 3.5 for narrower contextual window , but this time the retrieval becomes pretty useless.


I’m sticking with third party providers for retrieval as of now. A lot more flexible and cheaper if you host them localy. Good luck with your project :grinning:

oh nice, thank you, is there any third parties particular you would recommend us to explore?

Probebly Weaviate although there is a learning curve they support rerankers, keyword search, hybrid search etc. So what you do is formulate a function tool namned something like “search_products” with an appropiate schema and then you pass on the arguments given by the llm to a vector store search function. Although it requires quite a bit more work there is a plethora of techniques and options using a local vector store will provide you with.
For example, you said you’re implementing a product finder, a local vector store search function could allow the model to specify:
-Date. If you have a release date or simular for each product.
-Catagories. If your products are already sorted into different categories allowing the model to choose which one to search could help a lot.
-In stock. Simply allow the model to filter out items not in stock at the moment.
-Price. Filter on price based on any potential user request.

Just to name a few. But as I said will require more work! :grin:

1 Like

that seems to be the tokens used to generate context from your uploaded data.

Are those tokens only billed once? when generating the context of the uploaded files? or those are related to the messages/runs?

At the moment, an assistant can make multiple calls to retrieve data for a single response, I am unsure if that includes processing the same data more than once. I think potentially if two sequential requests make retrievals, it may happen.

I just did some testing on this and am still quite confused where the context tokens are comming from?


  • yes, I am using an Assistant
  • no, there are no files
  • no tools are enabled (such as retrieval or code interpreter)
  • no, I am not re-creating and/or changing the Assistant itself between requests
  • no, I am not creating new Threads between requests - there is just one Thread and I am using the one - the rest I had deleted

I did a few (about 4) requests and my context tokens got bloated by 1000s of tokens!

Started with app. 2000 tokens, ending up with 15000! The Generated Tokens are totally sensible, if I use tiktoken to calculate the token from input and output messages, then that is what is added to Generated Tokens. But every request adds about 3000 tokens to the Context Tokens!

How I do it:

  1. create an Assistant
  2. create a Thread
  3. add message
  4. execute run on the thread
  5. repeat 3 and 4

It looks to me, like everytime I execute a run, it would feed the whole history into the model. Well I thought that is why I have the Thread and leaving it open.


  • Am I missing something?
  • Is it expected behaviour?
  • How can it be controlled / eliminated?

Anyone with answers? Thanks in advance!


Yes this is how any LLM works. You need to feed the previous messages as context.

Without basically destroying the thread and re-creating it with an injected user message or system you can’t. It’s madness.

The good news(ish) is that they are “looking into this”.

I really just wish they would give us more control over modifying the thread. Such as performing truncation ourselves and adding assistant messages. I seriously cannot believe we cannot add Assistant messages.

I’m really trying to hold onto using Assistants here, but it’s been a failure so far. Still can’t even use Vision, Dall-E, or voice conversations, which is what I was hoping for.


Yep … as Assistants are still in beta, there is hope. :hand_with_index_finger_and_thumb_crossed:

1 Like

I’m also in the similar situation, where I’m using assistant api for product/service which I’m building! Added some dynamic data handling as well and it is not good! my context tokens are way more higher.
This is just madness! :sweat_smile:

1 Like

It is more than one month since I started this thread and contacted support. Support has only nonsense explanation for this mysterious “Context Tokens”, I think the guy did not understand it himself, he was only trying hopelessly to justify it! What I really think now is that OpenAI does not give a sh** about their customers, since everybody just want to use Chat GPT! They make billions and do not have even a customer service phone number or email! Their Assistant API is useless for production purposes since the token cost is not predictable! I hope some other more decent company might come up with better developer solutions! Come on Elon Musk do something about this … let everybody leave OpenAI and come to your “X-GPT”!!!


I confirm the abovementioned observations and frustrations:

  • I started using the Assistants API early december, first without code interpreter nor retrieval. I had <2$ costs a day and didn’t ask myself any questions
  • mid December I started using retrieval and add 1 pdf file of 10 pages/2000 words. Since then, I have costs ranging from 2-20$/day.

The average level of cost/day clear went up since I use ‘retrieval’ but I see no direct link between the daily cost and how much I use the Assistant. Feels random. I had a peak usage of 20$ yesterday (which is why I suddenly started looking into this) but I have no clue why.

When I look into ‘my activity’ the high costs are all related to these mysterious ‘context tokens’.

Releasing something in beta mode while still having some issues is fine, but benefiting from this by charging testers unexplainable high amounts compared to normal costs without any warning/documentation/communication, that is a scam.

ADDITION 2 days later: I contacted the Openai support and got extra credits as compensation but no further explanation.

For Assistants API:
How many files can we upload? If there is a limit how can we delete the unused files as there is no API to do that.
How many assistants can we create? How the price will be calculated? Is it assistant based (if yes will the number of files attached to an assistant will affect the cost?) or the size of the data that we have on the Open AI servers?
The price that you mentioned on your pricing page, how it is calculated?:
Code interpreter $0.03 / session
Retrieval $0.20 / GB / assistant / day (free until 02/01/2024)
At one point you were saying that the “Messages also have the same file size and token limits as Assistants (512 MB file size limit and 2,000,000 token limit)” (Screenshots attached) and in you model description you said GPT-4 Turbo will only store 128k tokens, which one is correct?
How can we reduce the cost?