Token concumption: Prompt tokens exponentially increase when using Threads (Assistants)

We have a back and forth conversation with an assistant.
We place a message 1 on a thread and ask the assistant to run, the prompt token is x1.
We then place another message 2 on the same thread and ask the same assistant to run, the prompt token is x1+x2.
We then place another message 3 on the same thread and ask the same assistant to run, the prompt token is x1+x2+x3.
This is very expensive because our conversations easily expand 55 messages (and 55 responses) and at that point we’re paying 10 USD for the conversation.
With this exponential cost structure, we cannot complete our conversation which could go up to 100 messages.
Why do we need to keep paying the exponential token count for each previous message?

Update one: screenshot of our registered token count, going up to 125k tokens on message 45. That message itself is only 9000 characters.

Update two:
we do use truncation strategy now when asking an assistant to run on a thread.
However, the original behavior isn’t intuitive and in our opinion possibly incorrect?
Also, we honestly do want the assistant doing the thread run on message X50 to have access to the complete chat history (X1 - X49), and by using truncation_strategy it does not.

Hi!

You will need to make a choice:
Either feed the model the entire conversation history each time or provide an abstracted/truncated version.

This is an inherent limitation of the models.

Thank you for your message.
However, the API isn’t stateless. The Thread is persisted on OpenAI servers. We don’t send the entire conversation with each prompt.
This is exactly why my original question: Why do we need to keep paying the exponential token count for each previous message?

Because the LLM can only respond based on the token that it sees. Each token leads to contributing to expense.

Now there are really three paths :
(a) full context (all messages)
(b) truncated context (n messages where n < all messages)
(c) summarized (abbreviated) context.

I believe that you have tried (a) and (b). Perhaps (c) is still an option?

Yes, I used the wrong word. It’s a limitation of the models, not the API. The assistant’s API manages the state for us, but it doesn’t change the fact that the model needs to process the entire conversation history with each turn.

I’ll edit the previous post to correct this.

Yes, it seems (c) is the affordable way to go that makes sense

Assistants is code that runs to create a call to a model, assembling that call from an assistant, tool specifications, chat history, most recent input.

The model has no memory, it receives all tokens that the assistants backend sends by necessity.

Assistants also will internally iterate when the AI calls on OpenAI’s tools instead of sending response to a user, running the code and again re-sending to AI with more response added to a thread.

The call to a model that is acting as a chatbot will have prior turns of conversation and response placed before the most recent turn. Also placed from a thread is past tool calls and the tool responses, unseen by you when using assistants threads.

You do not have to send the entire conversation to maintain an illusion of memory. Most conversations can appear to have memory with just a few turns to show what is most recently under discussion.

Assistants has a recently-added run parameter called truncation_strategy to limit how much chat is sent from a thread. It only has one strategy, less turns of conversation. OpenAI certainly has other strategy to ensure context length of a model is not exceeded, but that is not exposed or customizable, instead tending to maximum.

Technique (d) is best: self management, using chat.completions.

1 Like