High token consumption due to large instructions in Assistant API

Hi all,

I’m starting to work with the Assistants API.

I want to create an assistant that acts as an annotator for a specific dataset. I have a document with around 3.5K words (~5K tokens) with very specific instructions on how to annotate the dataset.

I create the assistant using the ‘instructions’ field with a text-plain version of those guidelines. Then, I want to run N times the assistant by only sending the specific document that it must annotate (which, on average, is around ~60 tokens).

The thing is that, the first call to the assistant on the same thread seems to be consuming the tokens for the guidelines - which I understand. But then, every subsequent call, the API usage metadata seems to be accumulating all previous messages. This means that every request is consuming as input token the guidelines and ALL previous reviews.

What is the correct approach to do this? How can these instructions be computed only once, and then, focus only on consuming tokens for each new document?

Check the results of the usage of tokens after processing 3 documents:

Processing review 1/1112
Review 1 processed (Time: 7.31s, Total Tokens: 5406, Prompt Tokens: 5326, Completion Tokens: 80)
Processing review 2/1112
Review 2 processed (Time: 7.00s, Total Tokens: 5556, Prompt Tokens: 5476, Completion Tokens: 80)
Processing review 3/1112
Review 3 processed (Time: 9.63s, Total Tokens: 5744, Prompt Tokens: 5664, Completion Tokens: 80)

As you can see, Prompt Tokens is accumulating between requests, making the consumption too high.

Welcome to the community @joaquim.motger

You can avoid accumulating costs by creating separate threads for each of the 60 token docs if the requests are not related to each other.

1 Like

Hi, thank you for your response! The issue here is that the guidelines are consumed for each request. If I understood correctly, these are system prompts, and therefore, they should not be computed as token usage at each request. But they are.

How can I make the assistant aware of the guidelines, without the need of being consumed at each request?

That’s exactly how instructions are supposed to work. They are included for every run.

Hint: consider using the gpt-4o-mini model. It works great and is a tiny fraction of the cost of other models for input tokens.