Hey, im using Chatrace to build an AI chatbot and ive noticed that it sometimes skips questons and wont answer them and then i saw this error:
“Rate limit reached for 10KTPM-200RPM in organization on tokens per min. Limit: 10000 / min. Please try again in 6ms. Contact us through our help center at help.openai.com if you continue to have issues.”
Can anybody suggest me what should i do?
Is the prompt too long? its currently on about 16K characters
The reason why you are encountering that error is because gpt-4 has rather low limits for tokens-per-minute.
Tokens are the internal AI encoding that represents words and parts of words as pieces.
The rate limit doesn’t actually count the tokens though: it estimates based on characters you input. However, it does consider the value you specify in max_tokens as counting against the rate limit in tokens.
If you specify a large max_tokens, you may be blocking yourself even though you are only getting a small response with that call. You can reduce the the value of that parameter, or more effective, remove the parameter entirely so it doesn’t count against you even before you used the AI.
The performance of the AI solution you’ve written will have to be improved by your instructions and using correct messaging to the model.
The prompt is estimated by the characters within, so it can help.
Rather, I would just use a software solution that holds back sending another gpt-4 request until the next minute if you have sent over 15000 characters, or whatever value you find eliminates the error.
An advanced use would be to use the rate limit remaining value that is returned in the http headers, but it wouldn’t inform you how long to wait before sending another request unless your use pattern is bursts of requests by the minute.
thank you!
DO u have an estimate of how manyh charactars the prompt should be to not encounter such error?
Currently im standing at about 17k charactars and get this often.
What I do is build a simple model of the token system, it’s a variable it is initially set to the token per min rate limit and then every second I add on the token per minute limit limit divided by 60 and I take off any tokens sent that second, if the value in the variable is > the max token per min limit I cap the value at that limit, if the value is close to 0 I wait until the value is > that number of tokens I need to send that second. It’s a basic rate limiter that matches what OpenAI are doing on their side to ensure you never bump into the limits
…i will pay you to put this in my app. If you’re down I have a simple express backend. I have a chatbot and we’re hitting the rate limit more and more. Looking for help with this if you’re down @Foxalabs Foxabilo
You can do an exact rate limiter: fifo message token elements count against the rate until they expire after a minute. Like the “25 GPT-4 messages every three hours”, you have to wait for your first of 25 to drop out before you can send the 26th. Store token metadata, along with the header quota remaining. With the headers, you can better align and adapt the rollover second out of that minute. But also must understand that prompt inputs that could be blocked are not exactly measured, while max_tokens is.
The oddity is how the rate limit blocks requests, and when they count against you. You can dash off 100 “write Shakespeare, 1000 words” short prompts in a second without a max_token value to count against you. Then in two minutes you won’t be able to send anything.
We are using a client-side rate limiter to limit and prioritize gpt-4 requests but we are still struggling to tune it properly. How are tokens being estimated from the character count?
This is what we have in the code -
estimated_tokens: Math.max(
tokens + TOKEN_MARGIN,
ALL_MODELS[modelVariant].maxModelTokens -
ALL_MODELS[modelVariant].requestTokens,
// sometimes OpenAI does character count / 4
// see - https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
(message.length + this.systemMessage.length) / 4,
).toString(),
Note that in the above snippet, the tokens variable holds the token count estimated via tiktoken.
Our scheduler is working only because we are making the requests go back to the scheduler a few times in case we hit the rate limits.
If you specify a large max_tokens , you may be blocking yourself even though you are only getting a small response with that call. You can reduce the the value of that parameter, or more effective, remove the parameter entirely so it doesn’t count against you even before you used the AI.
That was EXACTLY what fixed the issue for me. Thank you!