I’m working on a support ticket bot for a company in Brazil, and we decided to use 4o-mini to power it.
All our our customers prompt’s are moderated through the Moderation API, and, previously, we blocked anything that came back as “flagged”. But a customer sent the following:
O ponto não registrou de alguns funcionários que bateram hoje na minha empresa
Which is translated to “The time clock didn’t register some employees who clocked in today at the company”. Moderation API flagged it with 35.60% violence, most likely because “clock in” and “beat” (as in beat up someone) are the same in Portuguese.
I’ve separately tested the same message over ChatGPT as well as the Responses API and neither were flagged.
All of this makes me believe just blocking inputs that were simply flagged by the Moderation API is not a good practice. Right now, I’ve set it to only block inputs that have any score over 50.00%.
What are the good practices on moderating user inputs?
I would not run an input that failed. The dumb moderations is also being run on sent inputs by OpenAI to classify API organizations as “bad”, scoring against you until resulting in a warning or ban.
I get this flag from omni but not text moderations model from your language.
An option would be to run against both moderations models in parallel, or have text as a fallback if receiving a flag on text-only on omni.
# The omni moderation endpoint expects a list of items, each with a "type".
# - For text: {"type": "text", "text": <string>}
# - For image: {"type": "image_url", "image_url": {"url": <string>}}
# The text-moderation-xxx endpoint expects a string or list of strings
model = "omni-moderation-latest"
moderation_inputs = []
text_model = "text-moderation-stable"
text_moderation_inputs = []
# Add each text and image message as its own entry
for text in my_texts:
moderation_inputs.append({"type": "text", "text": text})
for url in image_urls:
moderation_inputs.append({"type": "image_url", "image_url": {"url": url}})
# Add each text alone as its own entry for non-multimodal moderation
for text in my_texts:
text_moderation_inputs.append(text)
# Construct the moderation calls and send
moderation_call = {"model": model, "input": moderation_inputs}
text_moderation_call = {"model": text_model, "input": text_moderation_inputs}
client = OpenAI()
response = client.moderations.create(**moderation_call)
text_response = client.moderations.create(**text_moderation_call)
# continue with logic whether to trust text moderation only...
You could. OpenAI doesn’t say what THEIR moderation is, and often they go out of the bounds of the terms of service indiscriminately.
So scoring and blocking what they say should be a policy violation on their platform is the safe way. Only completely blocking if both moderation engines flag is one alternative, where the higher safety is instead blocking for either.
Or, you go building your own out of embeddings (which cannot generate bad language and therefore is safer for an expectation that OpenAI is not checking it as vigorously). And hope that works better and isn’t the kind of scoring that makes you untrustworthy and then you get stealth-denied new models or whatever dark pattern is running.
New moderation model flags "respond in the style of the simpsons” with violence 35%. I think this is a strong false positive. Is there any way I can influence OpenAI’s decision on the thresholds? Current behavior hinders legitimate (I think) functionality of my app.