Add smarter controls to truncate Thread chat history (Assistant API, Runs API)

pascaltib · June 28, 2024, 4:00pm

There are a couple of issues with the current truncation strategy.

The default ‘auto’ setting is ridiculous and expensive. (Also seems predatory for anyone who doesn’t do enough research on this)

When set to auto , messages in the middle of the thread will be dropped to fit the context length of the model, max_prompt_tokens .

The only limit in this case is the context length. So in the case of gpt-4-turbo at 128k tokens every new message to a thread (that is already at the context limit) would cost $1,28 in just prompt tokens for every new message.

So since this option is not viable for anyone who cares about costs, there are two options to limit this cost:

1. Set the run truncation_strategy to last_messages (in runs)

Screenshot 2024-06-28 at 5.56.45 PM

This is not a great summarisation strategy and also has no limit on token size (which is what matters with cost).

I would propose the following:

Use the ‘auto’ strategy but allow users to set a token limit
If using last_messages allow a token amount to be set instead of only the number of last messages

Additionally, I would add a new truncation strategy where you keep a running summary of the conversation. I am 90% sure this is the strategy that ChatGPT uses as you can always ask for a summary of the whole conversation.

LangChain has an implementation of this called ConversationSummaryBufferMemory

2. Set `max_prompt_tokens` and `max_completion_tokens` when creating the Run

This is the other solution proposed in the docs and it sucks. This will terminate your whole run if you pass the limit with no specific control of the chat history length vs other elements of the run.

They mention this but it is so vague that I have no idea what “best effort” could mean:

The run will make a best effort to use only the number of completion tokens specified, across multiple turns of the run.

Topic		Replies	Views
Token consumption: Prompt tokens exponentially increase when using Threads (Assistants) API assistants-api	8	236	September 5, 2024
Assistant API Max input context size API	5	1827	April 16, 2024
Thread Truncation Strategy API	0	567	June 6, 2024
Thread Truncation With New Assistants API API threads	0	1024	November 8, 2023
OpenAI Assistant maximum token per Thread API gpt-4-turbo	11	9813	May 28, 2024

Add smarter controls to truncate Thread chat history (Assistant API, Runs API)

1. Set the run truncation_strategy to last_messages (in runs)

2. Set max_prompt_tokens and max_completion_tokens when creating the Run

Related topics

2. Set `max_prompt_tokens` and `max_completion_tokens` when creating the Run