My application intermittently is running into issues where the content moderation endpoint will return a 429 for “Too Many Requests”. I looked into the rates for text-moderation-latest that are set at 150k TPM and 1000 RPM.
A little backgound:
Our application is somewhat small and because this has been happening intermittently I had added logs to output the input length and anytime we make a request to moderate. Before the failures started occurring today we had made 13 requests in the previous minute and the character input for each are (161, 191, 172, 48, 59, 18, 20, 2000,78,59,138,116,2000). Based on this, I ran everything through the tokenizer and it doesn’t seem we could be hitting the 150k token limit either but are strangely still getting rate limited. Because passing the moderation check is a requirement before we call the model endpoint, when this occurs all requests start to fail for all users. The moderation request has a retry with backoff to retry 6 times when receiving the 429… but this appears to make things worse because when we get into this state it tends to last several minutes where we are being rate limited before things return to normal.
Just wanted to check if anyone else has seen any similar issues with rate limiting with the moderation endpoint? Or get input or suggestions on approaches to mitigate.
The moderations endpoint usage limit is significantly underspecified compared to AI language models and that you can employ different models all needing common moderations.
13 is nothing like 1000, though. Seems more like a network error. Maybe your client isn’t promptly closing connections. You can get the full message response and headers with rate and rate consumption to see if they refer to rate limit at all.
Does OpenAI want to allow an upgrade if you request? You have to pick “other” as moderations is not in the drop-down. To get “Sorry”.
I agree, when I looked at this I had the same thought where there would be no way to moderate at the same rate to what I am technically allowed to request using gpt-3-turbo. This makes it challenging where if I didn’t moderate then requests would be successful but puts my account at risk for breaking OpenAI policies from malicious users/inputs.
I am using LangChain for constructing my OpenAI client and the OpenAIModeration chain which unfortunately doesn’t give much granularity in the response outside of saying 429 - “Too Many Requests” and throwing an error.
To try something… I am currently migrating off using LangChain for moderation and going to use the basic REST request to at least have more granularity in control of the responses to decide what to do. Unless there was something going wrong internally with the client where it was eating my tokens (it doesn’t appear this way from my usage dashboard but anything is possible) I don’t expect this to resolve my issue but at least I can see the responses more clearly.
If I send a list of multiple strings I’ll get rate limited despite being way under the threshold, but I wont be rate limited if I combine everything into 1 string.
lens=[500]*32
mod_in=["a"*i + "$"*i for i in lens]
n_lens = list(map(num_tokens_from_string, mod_in))
print(sum(n_lens), " new lengths:", n_lens)
print(num_tokens_from_string("".join(mod_in)))
# 32 strings of 191 tokens each= 6112 total, but
# Errors on tokens per min (TPM): Limit 150000, Requested 194560
client.moderations.create(input=mod_in, model="text-moderation-latest")
# Works fine on 1 single string of 6019 tokens
client.moderations.create(input="".join(mod_in), model="text-moderation-latest")
I think how they quantize per minute has a bug if you are sending batches of strings. It assumes that the requests are coming in much quicker then they actually are.