Assistant Thread limitations

Hi,

I"m trying to understand how the assitants-api works, i was under the impression that by creating an assistant and a thread i could manage a session.

My use case is that i need to process a large amount of text and convert it to the format i need, now with a thread i thought that the first few messages would keep my instructions and i could write some could to send requests to that thread with the data i need to process and the AI would process that.

However i soon discovered this issue:

 Request too large for gpt-4o in organization org-abc on tokens per min (TPM): Limit 30000, Requested 30502. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.

Per my research i think each time i send a new query all the history in the thread is being added to that query otherwise i can’t explain how a query of 300 tokens + a response surpasses 1000 tokens.

Please suggest if this assumption is correct.

I found that maybe i could use truncation_strategy to control this issue but what i need is to keep the original X messages lets say 20 and keep deleting the new ones i send and replace them from the file i"m reading, otherwise it would be simpler to use the chat api and send the instructions + the query every time i interact with the AI.

check the tokens first:
https://platform.openai.com/tokenizer

review your code with print/log statements to see what and how much text you are sending to it.

if you are sure it is not a coding error, I suggest to report it as a bug:

https://community.openai.com/t/how-to-properly-report-a-bug-to-openai

I think my mistake was to assume that there is a session being created while in actuality it just stores all the previous messages and appends them to the assistant on every query i make meaning that all the previous messages are included and that is why i exceed the amount of tokens per conversation.

Not sure how it works in the Web version but based of my tests i think it just truncates older messages silently and the user keep thinking the model remembers all the conversation but in face on the last messages are remembered.

I"m just going to use the Chat API with my custom prompt instructions and then process each query separately with that.

Going to cost a bit more but at least it will work.

When you increase your Tier you will increase your token limits. Right now you are hitting the limit of the Tier 1.

Another solution is to create a new thread each time. When you have a long running thread the token count will increase due to the context of the thread.

https://platform.openai.com/docs/guides/rate-limits/usage-tiers