Hey, im using Chatrace to build an AI chatbot and ive noticed that it sometimes skips questons and wont answer them and then i saw this error:
“Rate limit reached for 10KTPM-200RPM in organization on tokens per min. Limit: 10000 / min. Please try again in 6ms. Contact us through our help center at help.openai.com if you continue to have issues.”
Can anybody suggest me what should i do?
Is the prompt too long? its currently on about 16K characters
The reason why you are encountering that error is because gpt-4 has rather low limits for tokens-per-minute.
Tokens are the internal AI encoding that represents words and parts of words as pieces.
The rate limit doesn’t actually count the tokens though: it estimates based on characters you input. However, it does consider the value you specify in max_tokens as counting against the rate limit in tokens.
If you specify a large
max_tokens, you may be blocking yourself even though you are only getting a small response with that call. You can reduce the the value of that parameter, or more effective, remove the parameter entirely so it doesn’t count against you even before you used the AI.
The performance of the AI solution you’ve written will have to be improved by your instructions and using correct messaging to the model.
thanks, will making the prompt shorter can help in this situation or not?
The prompt is estimated by the characters within, so it can help.
Rather, I would just use a software solution that holds back sending another gpt-4 request until the next minute if you have sent over 15000 characters, or whatever value you find eliminates the error.
An advanced use would be to use the rate limit remaining value that is returned in the http headers, but it wouldn’t inform you how long to wait before sending another request unless your use pattern is bursts of requests by the minute.
DO u have an estimate of how manyh charactars the prompt should be to not encounter such error?
Currently im standing at about 17k charactars and get this often.
What I do is build a simple model of the token system, it’s a variable it is initially set to the token per min rate limit and then every second I add on the token per minute limit limit divided by 60 and I take off any tokens sent that second, if the value in the variable is > the max token per min limit I cap the value at that limit, if the value is close to 0 I wait until the value is > that number of tokens I need to send that second. It’s a basic rate limiter that matches what OpenAI are doing on their side to ensure you never bump into the limits