Assistant 2.0 Tokens Usage - Usage is too high

Hey everyone! I’m using the new beta version with assistant 2.0 and I want to count the amount of tokens my assistant is consuming for pricing reasons. And I have a couple of issues/questions.

  1. When I create a NEW thread and I ask the assistant a simple question like “How much is 2+2?” And the answer is simple like “The answer is 2”, the amount of tokens of run.usage is at least 350 tokens when it should be around 30 tokens? I’m thinking it is probably because of creating the assistant, new thread, and completion + prompt but I’m not sure if this is why.

  2. When I use the same thread and start asking simple things, again like “How much is 2+2?”, “How much is 2-1?”, the amount of tokens start growing so much. I will attach an image with the consuming tokens and the expected tokens per prompt - calculation with Tokenizer.

Are tokens cumulative in the same thread? If they are should I only count run.usage at the end of my assistant process to calculate costs? Or if someone knows why this is happening and has an answer I’d be grateful to hear it.

Thank you so much in advance.

1 Like

Assistant instruction prompt cost input tokens in every run created with that assistant.

Thread chat history cumulated and sent in each run until it reached the context window limit(assistant v1) or your parameter(v2 provide some control), so the longer chat history, the more tokens used in one run.

It’s old cost issue of assistant api, if you want better token usage performance, you may need to work on chat completion yourself.
As far as I know, assistant api is just abstraction wrapper of chat completion and token usage for assistant api is not optimal in current stage.

2 Likes

Hi! @pondin6666 Thank you for replying. I would like to know your opinion in my situation. I want GPT to remember all the previous questions/things that I asked it.

I started first with Chat Completion but if I want GPT to remember everything, then I have to send it everything once again and the cost of tokens and time process of the application will grow exponencially, I’ll attach an example:

So in a huge scale where I send and receive GPT Both Questions and answers of 30k Tokens would Chat Completion still be the best option? Wouldn’t the assistant charge me only 50,50,50 (150 tokens instead of 300 in the example) because of remembering already the thread? I just want to make sure which one is the best option for my situation.

Thank you in advance, once again :slight_smile:

The models don’t “remember” anything. Each message sent to the model is sent to a fresh model. The facade of “memory” is through the history of the conversation.

The Assistants framework does a lot of heavy lifting for you, including managing the state of the conversation (as a Thread). They still follow the same practices as using ChatCompletions.

So in your case the input would be

  1. 50 Input + Output
  2. 100 + Ouput 1 + Output
  3. 150 + [Output1, Output2] + Output

Plus the system message for each message.

To retain context you MUST send ALL of it eachtime. There is no way to bypass this. This is a fundamental requirement of LLMs. Using Assistants does not change this. You can try to find ways to reduce the size by summarizing the content, however the complete context is required to be sent each time.

So, do you need to send the full questionnaire each time? Is it important that the model knows the answers to question #1, and #2, to ask #3?

Out of curiosity, why are you using an LLM to act like a form?

2 Likes

Using a RAG with Assistant v1.0

### Thread

thread_D2nAxqHRzfl1JW3wXTGHhspY

Created

4/30/2024, 6:23 PM

Length

16 messages

Tokens

49889 · 49232 in, 657 out

File search stores

None

Code interpreter files

None

Metadata

None

Hey @RonaldGRuckus Thanks for replying! Sorry if I didn’t explain it, I’m not doing a questionnaire, I’m just using those questions as an example.

What I’m actually doing is telling GPT to analyze huge contents of text and ask it to check if these texts have congruence between them. Also I want to store the contents and GPT answers. GPT answers are 4000 Tokens MAX so since my content is larger, I split the content so that GPT can return everything that I want to.

I’m looking for the least expensive option between Completion and Assistant, for example the instructions that I send in the system in completion, I believe that these instructions do not have to be sent in each message to the assistant, but I’m not sure if this is the case or the instructions are actually being sent and consuming extra tokens with these instructions and other background processes.

So what I just want to know is, in your opinion Would it be cheaper using Completion or Assistants, to see if is it cheaper in the end to process all the information and “remember” it in threads or just sending it again in completion.

Thank you in advance! :grin:

Hey,

It seems like this type of work would only require a single question (along with context), and a response. ChatCompletions sounds perfect for this task. No RAG required (unless you want to upload documents). The main difference here is how you want to manage handling the conversation. Assistants stores and manages it for you, while you would have to manage it yourself using ChatCompletions. The cost would still be the same though.

So, Assistants use ChatCompletions under the hood.

Truthfully there wouldn’t be much of a cost difference between ChatCompletions & Assistants for your use-case. Because what you are doing is simple semantic work the ChatCompletions model would be just as effective.

For both ChatCompletions & Assistants the System prompt is sent for each and every request.

Also, keep in mind that even though the model’s max output is 4k tokens you can simply run it twice to continue from where it left off.

Hey again @RonaldGRuckus Thank you so much for your previous answers, I’ll take them in consideration for my application, it seems that ChatCompletions is perfect for this task then.

Just another question hehe :sweat_smile: , Do you know how could I manage then the algorithm if I’m expecting more than 4k tokens, let’s say 6k? How could I manage my algorithm/code to see the other 2k tokens left of the answer?

Thank you in advance once again!

No problem.

You can use finish_reason to determine the next step:

finish_reason

string

The reason the model stopped generating tokens. This will be stop if the model hit a natural stop point or a provided stop sequence, length if the maximum number of tokens specified in the request was reached, content_filter if content was omitted due to a flag from our content filters, tool_calls if the model called a tool, or function_call (deprecated) if the model called a function.

https://platform.openai.com/docs/api-reference/chat/object