Assistants API - Thread Tokens vs Thread Management

Hello,

What is the optimal way to manage large threads in the Assistant API? I am trying to create an assistant that generates stories, but I keep encountering issues when my thread exceeds the token limit.

I understand that I need to manage this, but I’m unsure of the best approach. I know I can shorten the message context and create summaries. After that, should I delete messages from the current thread, or should I start a new thread while using the same assistant?
Are there any good and tested solutions with some description?

Currently, I’m receiving the following error message: “Request too large for gpt-4o in organization org-vl10PQxe0NxrwCqvMYDWWyeC on tokens per min (TPM): Limit 30000, Requested 30564.” This indicates that I need to reduce the input or output tokens to run successfully.

Thank you for your guidance!

Written some time ago… see if it helps

The problem is caused by Assistants completely ignoring an account’s rate limits when it constructs model context out of threads, and also when it has internal iterative processes that further consume a minute’s usage.

The only solution that OpenAI offers (purposefully) is to elevate your tier level with additional payment history, to $50+ after waiting 7+ days to make further payments.

You can indirectly improve the situation: set limits for the number of messages (not number of tokens) to be used from a thread, or reduce the chunk size of vector store files - see API documentation.

Thanks but overall i will need very long stories.
My plan was to do shortening on story for Agent and then - probably best will be some kind of shring messages in thread. I don’t want to open new thread - cause i thing i will have a lots of threads and my stroy will be like

Story :
Thread 1
Thread 2
Thread 3
Thread n

And i was wondering if i can do that within one thread.
And i am looking for a best solution that somebody hopefully implemented :smiley: