There are a couple of issues with the current truncation strategy.
The default ‘auto’ setting is ridiculous and expensive. (Also seems predatory for anyone who doesn’t do enough research on this)
When set to
auto
, messages in the middle of the thread will be dropped to fit the context length of the model,max_prompt_tokens
.
The only limit in this case is the context length. So in the case of gpt-4-turbo at 128k tokens every new message to a thread (that is already at the context limit) would cost $1,28 in just prompt tokens for every new message.
So since this option is not viable for anyone who cares about costs, there are two options to limit this cost:
1. Set the run truncation_strategy to last_messages (in runs)
This is not a great summarisation strategy and also has no limit on token size (which is what matters with cost).
I would propose the following:
- Use the ‘auto’ strategy but allow users to set a token limit
- If using last_messages allow a token amount to be set instead of only the number of last messages
Additionally, I would add a new truncation strategy where you keep a running summary of the conversation. I am 90% sure this is the strategy that ChatGPT uses as you can always ask for a summary of the whole conversation.
LangChain has an implementation of this called ConversationSummaryBufferMemory
2. Set max_prompt_tokens
and max_completion_tokens
when creating the Run
This is the other solution proposed in the docs and it sucks. This will terminate your whole run if you pass the limit with no specific control of the chat history length vs other elements of the run.
They mention this but it is so vague that I have no idea what “best effort” could mean:
The run will make a best effort to use only the number of completion tokens specified, across multiple turns of the run.