Rate limit error Tier 2 Account Rate Limit Issues with gpt-3.5-turbo

I’m in Tier 2. I have had a paid account for about a year. Haven’t missed a payment. My current soft limit is $25 (then notified via email) and hard limit is $100 (monthly budget). I’ve exceeded the soft limit only once a few months ago. In the last couple of months my monthly charge has been $10-20. I’m using a combination of gpt-4-1106-preview and gpt-3.5-turbo. In my last tests, I’ve used gpt-3.5-turbo.

According to my tier, I’m allowed 3.5K RPM and 80K TPM with gpt-3.5-turbo. I’m allowed even more with gpt-4-1106-preview but they restrict the usage due to evaluation yet they don’t specify how and to what extent they restrict.

In both cases, I repeatedly get the error 429 rate_limit_exceeded. here’s an example extract from the last test on gpt-3.5-turbo:
“Rate limit reached for gpt-3.5-turbo in organization on tokens_usage_based per min: Limit 80000, Used 78918, Requested 2068. Please try again in 739ms. Visit https://platform.openai.com/account/rate-limits to learn more”

I’m not sure how this is possible since I don’t see reaching those limits in my account’s usage. In the last 2 days, I’ve been charged only $0.44 on gpt-3.5-turbo. Neither are my prompts and expected response that heavy. My prompt can be maximum 800 words (placed a throttle in my app). And, my prompt explicitly asks GPT to deliver a response of half of that, ie 400 words. Often times, GPT delivers even less than that in its response.

In my last test, I sent 158 batches for processing in a fastapi python3 api using async and queuing with at least 2 second delays for each of 4 tries in case of an error. I received back only 36 successful responses. The rest were fails due to the rate limit.

I must be missing something in the picture but I don’t know what to explain the gap between what I’m told I can use and what I actually can use. It can’t be that the Open AI is misleading its customers about the usage and throttling. At the very least, I’m not sure I can trust the data in my OAI usage in their dashboard.

Is that via direct call or by assistants?

Is that using a chat with functions or some chat history?

Are you using non-stream and logging your reported tokens, logging headers returned per request with rate info?

Any chance your timeouts are low and you are hanging up on a model that is completing for you? OpenAI libraries can retry in such a scenario.


I have a theory that due to the algorithm you kind of get the equivalent of a “constant refill” of rate limit with API. Shooting off 80000 tokens-worth in the same second depletes your allowance and future full reset time more than the same over a minute.

I just yesterday gave a little code for grabbing the “try again in” error and actually waiting that long, but you’d have to apply it to shutting off your queue (untested because I’d have to send 17k tokens per second). Or you can monitor the header rates and limit early.

@_j Let me try to give more details below:

  • Among a few things, I always find it confusing how to differentiate between roles in the Open AI world. If it helps, here’s an extract from my script where the API call is made:
    api_url = “https://api.openai.com/v1/chat/completions
    headers = {“Authorization”: f"Bearer {OAI_API}“}
    request_message = f"Generate a slightly shorter version of the following text that’s at least {summarize_target_length} words long:\n{text}”
    payload = {“model”: “gpt-3.5-turbo”, “messages”: [{“role”: “system”, “content”: request_message}],
    “max_tokens”: max_tokens,
    “temperature”: temperature}
    While the role is specified as ‘system’ the GPT response usually reads ‘assistant’ by default. Here’s an extract from the output in the example of a successful response:
    Response: {‘id’: ‘chatcmpl-####f’, ‘object’: ‘chat.completion’, ‘created’: ######, ‘model’: ‘gpt-3.5-turbo-0613’, ‘choices’: [{‘index’: 0, ‘message’: {‘role’: ‘assistant’, ‘content’: …"}, ‘logprobs’: None, ‘finish_reason’: ‘stop’}], ‘usage’: {‘prompt_tokens’: 1022, ‘completion_tokens’: 398, ‘total_tokens’: 1420}
    And, here’s an extract of the error in a failed example:
    “error”: {
    “message”: “Rate limit reached for gpt-3.5-turbo in organization org-Xez8ja2nqmgN39FQJL6EGhCF on tokens_usage_based per min: Limit 80000, Used 77801, Requested 4096. Please try again in 1.422s. Visit https://platform.openai.com/account/rate-limits to learn more.”,
    “type”: “tokens_usage_based”,
    “param”: null,
    “code”: “rate_limit_exceeded”
    }

  • I don’t think I’m using any chat history. It’s just the app splitting a large text into batches and then calling the API to get a summary of each batch with the same request message as shared above.

  • Yes, it can be best described as non-stream if by that is meant sending a single request and receiving a single response per request.

  • My timeout is always set to 999

The thing is that I don’t think I exceed the number of requests and tokens per minute. Each prompt is limited to 800 words. I took a sample of a few of my requests and ran this in the tokenizer at https://platform.openai.com/tokenizer getting 1K tokens per request. As you see from the successful extract, GPT is quite frugal on completion, averaging about 300-600 tokens per response. I usually specify my max tokens at 4K but I also did tests with 2K and 3K with similar results.

On top of that, I’m queuing my requests and adding at least 2 sec break between each retry, more than what GPT is asking for when giving me an error.

First, we need a little glossary

A model’s context length is a shared memory space for both accepting input, and then forming a response.
gpt-3.5-turbo 4,096 tokens shared.
gpt-4-1106-preview 128,000 tokens shared, but a response is max 4,096.

max_tokens is a reservation of context length only for forming output. Specifying it will subtract the entire amount from the remaining context length window for accepting input.

max_tokens also informs the API endpoint of how much a request should count against you, before there is an actual response that can be counted.

The rate limit blocking is done by estimating the size of input instead of actually counting tokens. Additionally your max_tokens setting is immediately counted against your rate consumed.


Here’s what I think is the problem:

  • you are sending parallel async calls
  • you are specifying max_tokens larger than the AI is likely to write and close enough to the model’s context length max or max it can produce
  • all the unresolved requests are consuming token rate because of the parallel reservation of tokens being calculated.

Solution:

  • don’t use max_tokens
  • budget your average usage per call based on about 1 token per 4 characters, plus the size of the output estimated.
  • space the rate out, running fewer parallel processes or putting delays to space a minute worth of requests over a full minute.

Result:
Without max_tokens you will be able to input much faster. You may only find the effect minutes later when the true output size is counted against you from completed response.

Solution 2:
In a “rewrite smaller” task, you can use a token counting library tiktoken. measure the actual input size, set the max_tokens the same for leeway if the AI can’t shrink it much or writes extra.

Result:
Rate limiting errors will be closer to actual tokens consumed when done, so you don’t have to worry about the effects of over-sending affecting you later for a longer period.

1 Like

@_j Genius! Thank you so much. Your suggestion made it work. I hadn’t grasped that max tokens could be self-imposing a rate limit until your answer. In my calculation, it shouldn’t have been resulting in excessive rate limit errors. But, you’re right! By removing max tokens, I sort of liberated my API calls.

Perhaps noteworthy is that, despite removing max tokens, I was still dealing with the API failing to deliver substantial summaries, often far smaller in size than requested in my prompts. Once I moved back from gpt-3.5-turbo to gpt-4-1106, my summaries became far bigger in size and higher in content quality

Great to hear!

gpt-3.5-turbo-1106 also exists, and has a bit more of the “write longer without substance” quality to it also.

Now when you fill out the “rate increase” form for more tokens, you can say “because your backwards system of counting “max_tokens” against me before you even generate responses limits me greatly, while that parameter is needed to maintain the safety of not producing 4000 tokens of looping nonsense output.”

1 Like

You’re a diamond! Thank you so much!! :clap: :gem: :100: