I’m not sure 4o assistants are only limited by number of tokens per minute/day. In interacting with a 4o assitant of mine yesterday I hit af hard limit on thread length 3 times. While I was under the impression that the limit was based on tokens, I noticed this morning that all 3 times this happened when the thread reached 100 messages.
I think it’s fair to say 100 is a suspiciously round number.
When hitting this limit I received the error message
" Run failed Request too large for gpt-4o in organization org-[id] on tokens per min (TPM): Limit 30000, Requested [>30000]. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more."
That URL leads to a page not found error but https://platform.openai.com/account/rate-limits suggests there should only be limits per minute and per day. Unfortunately the threads in which this occurred remain unresponsive (except for that same error message) 24 hours later.
This is a bit frustrating. Adding a dump of the old messages to a new thread blows up the number of tokens in its usage rapidly - and costs - so that’s not a solution. And curating a dump of old messages by hand to trim them down to a workable context to continue a conversation gets onerous very quickly.
I imagine what’s happening is that there’s no sliding context window in the playground and so the entire thread 's messages get used to generate each new message. But I’m reminded of the Python “Argument Clinic” … I could be arguing in my spare time
I’m handling it by greatly expanding the System Prompt in a 3D-Kanban arrangement of comonads for XSCALE-format Epics, Story-form, Features, and BDD Scenarios, plus an explanation of YAGNI and Whole-Board Thinking. This both minimizes the Assistant’s use of tokens and constrains their behavior to minimize hallucination.
While that takes up more tokens in prompting than I’d like, it seems to give us enough room to work together happily without costing a fortune or losing context.
Yes, your huge thread conversation history and other input that would cost $0.70 just to send to gpt-4-turbo is not being limited by a limitation of messages; it is being limited by your tier 1 not being able to make a single request that large due to rate limit.
You can use the API parameter truncation_strategy to reduce the number of past turns. Just file_search alone can use nearly 30,000 tokens on a single request, though, with no control over how many chunks or how relevant the results.
Good day to everyone! I’m very glad that the issue of overly long threads concerns not only me. Unfortunately, I couldn’t figure out the methods proposed by Peter for reducing the number of tokens in the processed thread. Perhaps they are too complex for me at this stage, but please correct me if I’m wrong. Have I understood correctly that currently we don’t have any technical tools to limit the number of messages stored in a thread? Or maybe there’s a way to extract only certain messages from the thread for processing? Thank you!
With assistants, the threads may grow long (perhaps hitting some artificial limit mentioned earlier). However, the entire conversation is not necessarily sent.
If you were to use gpt-3.5-turbo-0613 with 4k token context, the Assistants would consider how much is needed for a reply, perhaps a stock 2000 tokens, and then only pass the number of recent chat turns from the thread that could fit in the token budget using that model.
Choose a 128k context model, and that thread can again switch to sending many more messages.
OpenAI has not provided a token limit you can set yourself for messages, but they provided a number of past turns option. That is the truncation option I mentioned earlier, which you can read about in the Assistants API documentation.
truncation_strategy is the technical tool to limit the number of messages passed to the AI.
Using file_search is where you have little control, up to 20 chunks of 800+ tokens each can be placed in the AI context. The assistants framework isn’t taking into account the tier rate maximum that can be sent to the model when composing instructions, messages, tools, retrieval or search context, and trying to bill you $0.50+ per call. The only mitigation is to make your documents tiny so the chunks are tiny.
I want to thank Mr. Regular for the advice provided. Thank you very much, it all worked out. Thanks to this method, I was able to reduce the length of the threads, and the model stopped giving an error due to overly long messages in the input. Next time I will read the instructions more carefully.
I have had similar issues on GPT-4o. The exact same prompts and documents (only 2 small word documents) worked on GPT3.5, however I now receive the message “Rate limit reached for gpt-4o in organization…” in both the playground and using the API.
“Rate limit reached for gpt-4o in organization…on tokens per min (TPM): Limit 30000, Used 13641, Requested 20691. Please try again in 8.664s.”
Could you explain the solution in simple terms? Do I have to make a change to the API code or some type of configuration parameter?
In my case, the problem is solved by modifying the code. I use the method - truncation_strategy and simply send fewer dialogue turns to the model. Other methods were suggested earlier in the thread, but I couldn’t figure out how to implement them, while truncation_strategy helped me.
The detailed implementation of truncation_strategy is described in the documentation on the website.
why the heck should i send fewer if i am ready to pay, what kind of fag__try is this? it’s like USSR, the price is there but you can’t get the fu***ng products. Fu___ng retards