How does assistant api thread defy token limit?

I ran an experiment in which I introduced some seed information dispersed within the first 27 messages and then continued the thread on a topic with no overlap to the seed information.

After over 10,000 tokens raw tokens from the seed information (“raw” means only the message content…no other overhead considered…so on the conservative side), a user question was posed regarding that information:

  • Info1
  • Several unrelated messages
  • Info2
  • Several unrelated messages
  • Info3
  • Several unrelated messages
  • Info4
  • 10k tokens of unrelated messages
  • Question to summarize “info”
  • Assistant: …

Based on other posts (such as “Max number of tokens a Thread can use equal the Context Length of the used model?”), I’d expect the seed information to have aged outside of the maximum context (8k tokens for gpt-4 0613).

What I have observed is that this seed information that exists past the 8k token limit is perfectly summarized.

How might this be occurring?

1 Like

Received response from Open AI engineering team:

This is part of our truncation logic that’s built into our Assistants API (from our docs:

“Once the size of the Messages in a Thread exceeds the context window of the model, the Thread will attempt to include as many messages as possible that fit in the context window and drop the oldest messages. Note that this truncation strategy will evolve over time to become more sophisticated

Therefore, some old messages will get retained as part of our truncation strategy and others will get dropped, similar to ChatGPT. Unfortunately, we can’t share more about this truncation strategy, so sorry about that!

This confirms that there is some secret sauce beyond a simple sliding window on the chat history.


Nice insight! Yeah it seemed weird. Numerous people have shown ChatGPT retaining very old “seed” messages when passing the context length, while forgetting others.

For ChatGPT not sure. It could be that they perform a similarity search on the message along with the conversation once the token limit is met to determine which content to remove and which to retain. Would be nice to get some more insights here.

Thanks for sharing

1 Like

I was wondering same about similiarity search. I suppose that would imply creating embeddings for each message. At this point Open AI has refused to provide any implementation details on their “truncation strategy”.

Thanks for the feedback!

1 Like