Token consumption: Prompt tokens exponentially increase when using Threads (Assistants)

van.der.haegen.j · September 4, 2024, 7:32am

We have a back and forth conversation with an assistant.
We place a message 1 on a thread and ask the assistant to run, the prompt token is x1.
We then place another message 2 on the same thread and ask the same assistant to run, the prompt token is x1+x2.
We then place another message 3 on the same thread and ask the same assistant to run, the prompt token is x1+x2+x3.
This is very expensive because our conversations easily expand 55 messages (and 55 responses) and at that point we’re paying 10 USD for the conversation.
With this exponential cost structure, we cannot complete our conversation which could go up to 100 messages.
Why do we need to keep paying the exponential token count for each previous message?

van.der.haegen.j · September 4, 2024, 7:33am

Update one: screenshot of our registered token count, going up to 125k tokens on message 45. That message itself is only 9000 characters.

van.der.haegen.j · September 4, 2024, 7:37am

Update two:
we do use truncation strategy now when asking an assistant to run on a thread.
However, the original behavior isn’t intuitive and in our opinion possibly incorrect?
Also, we honestly do want the assistant doing the thread run on message X50 to have access to the complete chat history (X1 - X49), and by using truncation_strategy it does not.

vb · September 4, 2024, 8:52am

Hi!

You will need to make a choice:
Either feed the model the entire conversation history each time or provide an abstracted/truncated version.

This is an inherent limitation of the models.

van.der.haegen.j · September 4, 2024, 1:25pm

Thank you for your message.
However, the API isn’t stateless. The Thread is persisted on OpenAI servers. We don’t send the entire conversation with each prompt.
This is exactly why my original question: Why do we need to keep paying the exponential token count for each previous message?

icdev2dev · September 4, 2024, 2:08pm

Because the LLM can only respond based on the token that it sees. Each token leads to contributing to expense.

Now there are really three paths :
(a) full context (all messages)
(b) truncated context (n messages where n < all messages)
(c) summarized (abbreviated) context.

I believe that you have tried (a) and (b). Perhaps (c) is still an option?

vb · September 4, 2024, 4:08pm

Yes, I used the wrong word. It’s a limitation of the models, not the API. The assistant’s API manages the state for us, but it doesn’t change the fact that the model needs to process the entire conversation history with each turn.

I’ll edit the previous post to correct this.

van.der.haegen.j · September 5, 2024, 4:43am

Yes, it seems (c) is the affordable way to go that makes sense

_j · September 5, 2024, 5:15am

Assistants is code that runs to create a call to a model, assembling that call from an assistant, tool specifications, chat history, most recent input.

The model has no memory, it receives all tokens that the assistants backend sends by necessity.

Assistants also will internally iterate when the AI calls on OpenAI’s tools instead of sending response to a user, running the code and again re-sending to AI with more response added to a thread.

The call to a model that is acting as a chatbot will have prior turns of conversation and response placed before the most recent turn. Also placed from a thread is past tool calls and the tool responses, unseen by you when using assistants threads.

You do not have to send the entire conversation to maintain an illusion of memory. Most conversations can appear to have memory with just a few turns to show what is most recently under discussion.

Assistants has a recently-added run parameter called truncation_strategy to limit how much chat is sent from a thread. It only has one strategy, less turns of conversation. OpenAI certainly has other strategy to ensure context length of a model is not exceeded, but that is not exposed or customizable, instead tending to maximum.

Technique (d) is best: self management, using chat.completions.

Topic		Replies	Views
Assistant API token Usage - promt_tokens usage is too high API api-usage , assistants , assistants-api	8	1874	April 10, 2024
How to eliminate useless content in the Assistant response GPT builders	10	388	July 15, 2024
Why are my context tokens used so quickly? API api	3	2773	January 5, 2024
Max number of tokens a Thread can use equal the Context Length of the used model? API	3	778	December 1, 2023
Assistants API context window? API gpt-4-turbo , assistants-api	2	3019	November 26, 2023

Token consumption: Prompt tokens exponentially increase when using Threads (Assistants)

Related topics