Moderation raises 429 rate limit error for long input

When the input to the Moderation API is long, it raises a 429 rate limit error without actually reaching the rate limit. This is misleading as waiting and resending the same request would result in the same error.

Code to Reproduce the Error (Edited again, another user is able to reproduce this error, see below replies)

from openai import OpenAI
import time
import requests
client = OpenAI()
response = requests.get("https://raw.githubusercontent.com/da03/moderation_issue/main/example.txt")

flag_produce_error = True # when True, produces a 429 error; False, no error
if flag_produce_error:
    text = response.text[:6000]
else:
    text = response.text[:5999]

print (f'Number of characters: {len(text)}')
try:
    response = client.moderations.create(input=text)
except Exception as e:
    print ('error', e)

Result of the Above Code

Using 6000 characters will fail (by setting flag_produce_error to True in above code) with a 429 Rate limit error, while using 5999 characters will succeed (by setting flag_produce_error to False in above code) will pass.

Number of characters: 6000
error Error code: 429 - {'error': {'message': 'Rate limit reached for text-moderation-007 in organization org-WAXHZpHsjoSbCdNbzcNU959e on tokens per min (TPM): Limit 150000, Used 138657, Requested 20493. Please try again in 3.66s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}

Expected Behavior

Should have raised an error code/message reflecting that the underlying issue is input length but not rate limit.

Follow-Up Question

What’s the context length limit of Moderation? The example I used is taken from real user-ChatGPT conversations (WildChat) so I thought it should work.

Update on Apr 17, 2024

After a deeper analysis on the WildChat dataset inspired by pondin6666’s observation, I suspect that these errors are linked to inputs containing non-Latin characters.

Key Findings:

  • Language-specific Error Rates: The errors disproportionately affect texts in certain languages:
    • Korean: Accounts for 66.44% of all errors, yet only 0.51% of the dataset.
    • Chinese: Makes up 10.96% of errors, 13.54% of the dataset.
    • English: Constitutes 6.85% of errors, while making up 54.92% of the dataset. These cases mostly contain special characters like ψ or ‱
    • Japanese, Hindi: Also show significant error rates compared to their presence in the dataset.

Practical Workaround:

In response, I’ve written a workaround by segmenting large text inputs into smaller chunks. The implementation of this workaround is detailed in the repository linked below.

For those interested in replicating the issue or trying out the workaround, I have documented everything, including code and failing examples in different languages, in this GitHub repository: GitHub Repo Link.

The documentation no longer lists an input limit. I sent it 2.5M tokens, with simply an extended wait time for the response indicating it was working. (A size significantly in excess of everyone’s TPM because my org doesn’t give moderations x-ratelimit headers).

Since the prior documentation already described that moderations internally does chunking and then just reports the highest score, it makes sense that the technique could be vastly extended.

Yes, I went a bit overboard making example code to get the x-ratelimit headers..
import os, json, openai, tiktoken  # you must install tiktoken for counting

def tcount(text: str) -> int:
    """Counts the encoded tokens of the input text using BPE."""
    t = tiktoken.get_encoding("cl100k_base")
    return len(t.encode(text))

def process_dict(data: dict) -> dict:
    """Processes the input dictionary to filter and format its contents."""
    processed_data = {}
    for k, v in data.items():
        if '/' not in k and '-' not in k:
            if isinstance(v, float):
                processed_data[k] = f"{v:.6f}"
            else:
                processed_data[k] = v
    return processed_data

def mod_call(input_text: str) -> tuple:
    """Handles the moderation API call.
    Returns the categories, category_scores, and headers."""
    client = openai.Client()
    try:
        response = client.moderations.with_raw_response.create(input=input_text)
    except Exception as e:
        if hasattr(e, 'message'):
            print(f"{e.__class__.__name__}: {e.message}")
        else:
            raise ValueError(f"ERROR: {e}")
    
    categories = response.parse().results[0].categories.model_dump()
    category_scores = response.parse().results[0].category_scores.model_dump()
    headers = response.http_response.headers

    return categories, category_scores, headers

def print_report(categories: dict, category_scores: dict = None) -> None:
    """Function to print formatted categories and scores.
    Scores are optional."""
    processed_cats = process_dict(categories)
    print(json.dumps(processed_cats, indent=2))
    if category_scores is not None:
        processed_scores = process_dict(category_scores)
        print(json.dumps(processed_scores, indent=2))


def print_x_headers(headers: dict) -> None:
    """Prints just x-ratelimit headers"""
    x_headers = {k: v for k, v in headers.items() if k.lower().startswith('x-rate')}
    print(json.dumps(x_headers, indent=2))

def main():
    input_text = "Go and kill yourself, and your entire race! " * 5  # Example input
    # Uncomment below to use a file's content as repeated input
    # with open('my_file.txt', 'r') as file:
    #     input_text = file.read() * 2

    print(f"tokens: {tcount(input_text)}, {input_text[:40]}")

    categories, category_scores, headers = mod_call(input_text)
    print_report(categories, )  # can add category_scores
    print_x_headers(headers)

if __name__ == "__main__":
    main()

Now to your error:

The rate error you provide us looks like you already had a send count that would take five more minutes to reset back to the original value of 150k available.

Thus, to really see what’s doing, you’d have to wait more than that reset time you were given. Send one token to get the updated and reset rate.

Then to test, you can run a loop and see that the “used” field of headers increases by no more than you actually sent, and it continuously “refill” your allowance.

The moderations is for checking language model inputs, not arbitrary tasks.

If the rate is still too low for you, you can send only new original user inputs that would be placed in to context, not the whole context. The users also don’t have to know if you randomly drop checks due to rate error.

Thanks for the reply and for sharing the code for counting tokens and extracting states from header!

you already had a send count that would take five more minutes

I assume you meant 5 seconds? The error message above said “Please try again in 5.046s”. That said, to double-check whether it’s a matter of the wait time, I waited for 5 minutes to send the request again and still got the same error, as shown below (the below code can reproduce the issue now since I used requests to download the prompt that triggered the error) :

from openai import OpenAI
import time
import requests
client = OpenAI()

# Get text, an example taken from the WildChat dataset
response = requests.get("https://raw.githubusercontent.com/da03/moderation_issue/main/example.txt")
text = response.text

# 1. 429 Rate Limit Error for long request
print (f'Number of characters: {len(text)}')
try:  
    response = client.moderations.create(input=text)
except Exception as e:
    print ('error', e)

print ()

# 2. Same error after waiting for 5 minutes
# Make sure that 5 minutes have passed
print ('Waiting for 5 min')
time.sleep(300)
print (f'Number of characters: {len(text)}')
try:
    response = client.moderations.create(input=text)
except Exception as e:
    print ('error', e)

# 3. No error for shorter request, even after only waiting for 1 min
time.sleep(60)
text = text[:5000]
print (f'Number of characters: {len(text)}')
try:
    response = client.moderations.create(input=text)
except Exception as e:
    print ('error', e)

The outputs of the above code is below. Note that the first two times failed (the second time failed despite waiting for 5 minutes, while the first error message said “Please try again in 6.406s”), while the third time succeeds after the input is truncated down to 5k characters, so I believe it’s simply a length issue:

Number of characters: 6872
error Error code: 429 - {'error': {'message': 'Rate limit reached for text-moderation-007 in organization org-WAXHZpHsjoSbCdNbzcNU959e on tokens per min (TPM): Limit 150000, Used 142589, Requested 23427. Please try again in 6.406s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}

Waiting for 5 min
 Number of characters: 6872
error Error code: 429 - {'error': {'message': 'Rate limit reached for text-moderation-007 in organization org-WAXHZpHsjoSbCdNbzcNU959e on tokens per min (TPM): Limit 150000, Used 143513, Requested 23427. Please try again in 6.776s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Number of characters: 5000

see that the “used” field of headers increases by no more than you actually sent

Unfortunately the extracting x header part of the code you provided (print_x_headers) only returns an empty dictionary. My openai version is 1.19.0. When I inspect the returned header, it doesn’t contain anything that starts with “x-rate”, as shown below:

Headers([('date', 'Tue, 16 Apr 2024 01:02:59 GMT'), ('content-type', 'application/json'), ('transfer-encoding', 'chunked'), ('connection', 'keep-alive'), ('openai-version', '2020-10-01'), ('openai-organization', '[redacted]'), ('x-request-id', 'req_33b0580cd018110c4e8a0e5c0e78ae79'), ('openai-processing-ms', '157'), ('strict-transport-security', 'max-age=15724800; includeSubDomains'), ('cf-cache-status', 'DYNAMIC'), ('set-cookie', '[redacted]'), ('set-cookie', '[redacted]'), ('server', 'cloudflare'), ('cf-ray', '[redacted]'), ('content-encoding', 'gzip'), ('alt-svc', 'h3=":443"; ma=86400')])

Other Observations

Lastly, I noticed that the errors mostly happen for non-English languages, such as Chinese and Korean, but it’s simply my observation on the WildChat dataset and I did not measure this quantitatively.

TPM is not checked by request session, it’s based on the openai account.
Are you sure this account only accept your moderation requests when you test? Other api request at the same time occupy the quota.

It looks like normal rate limit issue, since your moderation text is very large, and your quota not catch up what your rate of token usage, so you face the 429 error.
So you may need upgrade your tier.

My account ([my email]) is of tier 5 and I only used Moderation API for this experiment (with me being the only user). Besides, there is a control group: as shown in the code above, if I simply truncate the number of characters to 5,000 then there will be no issues. Note that the text before truncation has 6,872 characters, I don’t think it’s possible to exceed the 150,000 TPM / 1,000 RPM limit of my account (according to https://platform.openai.com/account/limits) when I only ran this single prompt.

Then it seems the moderations endpoint isn’t giving the headers typical of others. I get "x-request-id": "req_78f451... as my only x-header back, and thought it was peculiar but unexpected.

It seems it is particular to your account then, or there is something specific about your data that makes the “remaining” go crazy. I added 4 main loops to my script in sequence, sending 58746 each tokens of openai docs the moderation endpoint might like to read (and had to pick docs without <|endoftext|> which is refused by moderations)), and got no rate errors, nor any report of what was remaining.

Since OpenAI staff don’t patrol here looking for things to fix, you could send a help message, and try not to trigger a bot contractor by sending “account issue”, or even send to “request an exception” on limits and see if it will get read by a human.

Sorry the moderation rate limit seems separate from other model usage and we don’t know how openai manage this.

If only you are testing the API in this account, and after sleeping 300 seconds the quota window is still 143513 which is even higher than five minutes before, there might have some problem or secret that we do not know.

Hmm the request headers I printed out was using your code (including prompt), I just added a print statement in print_x_headers


In fact, I was able to pinpoint where it breaks: at 6,000 characters (by setting flag_produce_error to True in the below code), it breaks; but at 5,999 characters (by setting flag_produce_error to False) in below), it works. There doesn’t seem to be anything special about the 5999th character (it’s a Chinese character “芁” which means “want” or “need”).

from openai import OpenAI
import time
import requests
client = OpenAI()
response = requests.get("https://raw.githubusercontent.com/da03/moderation_issue/main/example.txt")

flag_produce_error = True # when True, produces a 429 error; False, no error
if flag_produce_error:
    text = response.text[:6000]
else:
    text = response.text[:5999]

print (f'Number of characters: {len(text)}')
try:
    response = client.moderations.create(input=text)
except Exception as e:
    print ('error', e)

By the way, I want to mention that I am not the only one who saw this error. A collaborator of mine found the same issue and they fixed it by truncating input whenever an error is encountered, like below:

json_data["input"] = json_data["input"][:int(0.9*len(json_data["input"]))]

But I hope to figure out the real underlying issue.

the header printing routine:

def print_x_headers(headers: dict) -> None:
    """Prints just x-ratelimit headers"""
    x_headers = {k: v for k, v in headers.items() if k.lower().startswith('x-rate')}
    print(json.dumps(x_headers, indent=2))

when invoked with the print_x_headers(headers) with headers grabbed from the with_raw_response method will indeed print. But it is set to print nothing if we don’t get x-ratelimit (like other endpoints that have headers.) because it filters.

You can use .startswith(‘’) - an empty string, which does not filter headers and gives you a somewhat readable report of all that you do get. The available keys don’t answer any question for us about rate, then:

date
content-type
transfer-encoding
connection
openai-version
openai-organization
x-request-id
openai-processing-ms
strict-transport-security
cf-cache-status
set-cookie
set-cookie
server
cf-ray
content-encoding
alt-svc

I suppose one can use their recommendation of a 2000 token split point for your own chunking. Have regex start looking for linefeeds, or then spaces when it gets desparate, after 1000 characters of Chinese because of their high token usage.

Yes, the final workaround I used was indeed by chunking longer prompts into shorter ones, and then take their maximum category scores as the result. In case others encounter the same issue, here is my workaround:

def query_moderation(content, max_num_retries=3, wait_time=60):
    client = OpenAI()
    num_retries = 0
    finished = False
    while (not finished) and num_retries <= max_num_retries:
        if num_retries > 0:
            print (f'retrying {num_retries} times')
        try:
            response = client.moderations.create(input=content)
            finished = True
        except Exception as e:
            err_msg = f'{e}'
            print (err_msg)
            m = re.search(r"Please try again in (\d+\.?\d*)s", err_msg)
            num_retries += 1
            if m:
                sleep_time = min(float(m.group(1)) * 1.2, wait_time)
                print (f'sleeping: {sleep_time} seconds')
                time.sleep(sleep_time)
            else:
                time.sleep(wait_time)
    if not finished:
        content_length = len(content)
        half_length = int(round(content_length / 2))
        content_firsthalf = content[:half_length]
        content_secondhalf = content[half_length:]
        print (f'splitting, old length: {content_length} into new length: {half_length}')
        output_firsthalf = query_moderation(content_firsthalf, max_num_retries, wait_time)
        output_secondhalf = query_moderation(content_secondhalf, max_num_retries, wait_time)
        output = {'flagged': output_firsthalf['flagged'] or output_secondhalf['flagged']}
        output['categories'] = {}
        for k in output_firsthalf['categories']:
            output['categories'][k] = output_firsthalf['categories'][k] or output_secondhalf['categories'][k]
        output['category_scores'] = {}
        for k in output_firsthalf['category_scores']:
            output['category_scores'][k] = max(output_firsthalf['category_scores'][k], output_secondhalf['category_scores'][k])
    else:
        output = response.results[0].model_dump()
    return output

However, I still hope that OpenAI can solve this issue, or at least fix their error message, especially considering how simple it is to produce this error (at least for me, but I’m not sure if this issue is only specific to my account, would appreciate it if someone can test my code below and let me know):

from openai import OpenAI
import time
import requests
client = OpenAI()
response = requests.get("https://raw.githubusercontent.com/da03/moderation_issue/main/example.txt")
print (response.text[5999])
flag_produce_error = True # when True, produces a 429 error; False, no error
if flag_produce_error:
    text = response.text[:6000]
else:
    text = response.text[:5999]

print (f'Number of characters: {len(text)}')
try:
    response = client.moderations.create(input=text)
except Exception as e:
    print ('error', e)

Hmm
after test on python, the issue was encoding, not the size of input or rate-limit.

    response = requests.get("https://raw.githubusercontent.com/da03/moderation_issue/main/example.txt")
    text = response.text[:5999] # it works
    text = response.text[:6000] # it 429 rate limit
    text = response.text[:6205] # random number larger than 6000 ... and it woks
    text = str(response.text.encode("raw_unicode_escape"), encoding='utf-8') #no truncate, it's ok!
    res = client.moderations.create(input=text)

So the 429 rate-limit error is something buggy and that cause confusion😓

1 Like

Also, awareness of encoding is important when doing truncation:


>>>chinese = "芁"
>>>bytes(chinese.encode("utf-8"))
b'\xe8\xa6\x81'

Thank you for confirming the issue I encountered! The example you showed used a different encoding/decoding system, but if we use the same encoding/decoding system, then the 429 error remains, as shown in the below code:

from openai import OpenAI
import time
import requests
client = OpenAI()
response = requests.get("https://raw.githubusercontent.com/da03/moderation_issue/main/example.txt")
text = response.text
text = str(text.encode("raw_unicode_escape"), encoding='utf-8') #no truncate, it's ok! BUT it's encoding and decoding using different encoding systems! You can print and check.
text = str(text.encode("raw_unicode_escape"), encoding='raw_unicode_escape') # 429 error
text = str(text.encode("utf-8"), encoding='utf-8') # 429 error

res = client.moderations.create(input=text)

When we use different encoding/decoding systems, the string converted from bytes is different from the original string. For example, for a toxic input, if we use the same encoding/decoding system to convert string to bytes and then to string, Moderation is able to correctly flag it:

text = "æ€æŽ‰ć…šäșșç±»" # toxic input
text = str(text.encode("utf-8"), encoding='utf-8') #flagged=True
res = client.moderations.create(input=text)
print (res)

However, when we use different encoding/decoding systems, Moderation is no longer able to correctly flag it (which might not be very surprising
):

text = "æ€æŽ‰ć…šäșșç±»" # toxic input
text = str(text.encode("raw_unicode_escape"), encoding='utf-8') #flagged = False
res = client.moderations.create(input=text)
print (res)

In fact, the fact that we can call text.encode indicates that text is already a string, not bytes, so there should be no need to encode it to bytes and then decode it back to string, unless we only want to send bytes (but I tested and found that client.moderations.create expects a string input, not bytes).

Here’s an example where I printed out the problem of encoding/decoding using different standards:

>>> text = "æ€æŽ‰ć…šäșșç±»" # toxic input
>>> text = str(text.encode("utf-8"), encoding='utf-8')
>>> print (text)
æ€æŽ‰ć…šäșșç±»
>>> text = str(text.encode("raw_unicode_escape"), encoding='raw_unicode_escape')
>>> print (text)
æ€æŽ‰ć…šäșșç±»
>>> text = str(text.encode("raw_unicode_escape"), encoding='utf-8') # different encoding/decoding standards
>>> print (text)
\u6740\u6389\u5168\u4eba\u7c7b

I believe my truncation is correct since I applied it to a string object, not bytes:

>>> chinese = "芁"
>>> len(chinese)
1
>>> chinese = "芁"*3
>>> len(chinese)
3
>>> chinese[:2]
'芁芁'
>>> chinese[:1]
'芁'

Interesting.

I do post {“input”:“\u6740\u6389\u5168\u4eba\u7c7b”} directly on postman, the result is flaged true on violence(AI seems capable of recognize unicode byte and even default embedding float array string), but if via python lib the same byte encoded string request, the violence was not detected.

Dig into openai python lib beyound our roles, I think we’d better just wait openai to correct it in the future, and just work around it.

So the simple workaround may be drop the last one character and retry until the exception disapeared. In this case (https://raw.githubusercontent.com/da03/moderation_issue/main/example.txt), drop the last one character and retry just works. Your collaborator’s work around direction is suitable but have some danger, just truncate the last one or two character(retry one more time) may be a bit robust and lost few context.

1 Like