GPT-4o Assistant Thread Length Limit?

With assistants, the threads may grow long (perhaps hitting some artificial limit mentioned earlier). However, the entire conversation is not necessarily sent.

If you were to use gpt-3.5-turbo-0613 with 4k token context, the Assistants would consider how much is needed for a reply, perhaps a stock 2000 tokens, and then only pass the number of recent chat turns from the thread that could fit in the token budget using that model.

Choose a 128k context model, and that thread can again switch to sending many more messages.

OpenAI has not provided a token limit you can set yourself for messages, but they provided a number of past turns option. That is the truncation option I mentioned earlier, which you can read about in the Assistants API documentation.

truncation_strategy is the technical tool to limit the number of messages passed to the AI.

Using file_search is where you have little control, up to 20 chunks of 800+ tokens each can be placed in the AI context. The assistants framework isn’t taking into account the tier rate maximum that can be sent to the model when composing instructions, messages, tools, retrieval or search context, and trying to bill you $0.50+ per call. The only mitigation is to make your documents tiny so the chunks are tiny.

1 Like