Token Optimization for Assistants API - Excesive token count


I am currently utilizing the OpenAI API for sending messages to assistants and have encountered an issue with token usage. Each call I make seems to consume approximately 4000 tokens, which is perplexing considering my messages are about 50 tokens and the responses are typically around 200 tokens. I have lengthy instructions set up for the assistant, and I am wondering if this is impacting the token consumption.

To give you a better understanding, here’s the process I follow with my code:

  1. Check for an existing thread ID to do the job; if not present, create a new thread.
  2. I determine the assistant ID with an static id in my code.
  3. I generate the input with around 200-400 tokens.
  4. Add a message to the thread.
  5. Execute the thread (Run command).
  6. Check the status of the run.
  7. Retrieve the steps of the run.
  8. Finally, when is “completed”, I obtain the thread messages to view the response, ensuring to handle cases where the response might not be immediately available or if there’s an error.

My specific question is: How can I optimize token usage when interacting with OpenAI’s API, especially considering the length of my instructions? I tried to put fewer token in the instructions and I don´t get the same responses, but they bill less tokens. Is there a way to prevent these long instructions from increasing the token cost per call? I would like to understand if the instructions are being billed in every call, contributing to high token usage and how to avoid to be billed in every call for the training.

Any advice or shared experiences, particularly from those who have dealt with similar situations or have in-depth knowledge of OpenAI’s token billing system, would be greatly appreciated.

Thank you in advance for your assistance.

1 Like

Hi! Welcome to the forums!

Your instructions absolutely eat your tokens. Retrievals eat your tokens, a continuation of the thread eats your tokens. Actions eat tokens. Everything eats tokens. Om nom nom.

It’s my personal opinion that if you want to be cost conscious, assistants aren’t the best option out there.

Here’s a good post that summarizes it well:


With v2 of the API some of this has improved, we do now have prompt and completion token counts in at the run, and step levels.

Additionally, we can set maximum tokens per thread. Not a perfect solution, as it’s not like it auto-truncates, it just stops.

The recommendation for a production application would still be a custom RAG system, but assistants API is slowly getting there. It’ll probably be a good choice once it comes out of beta