Dealing with Moderation API False Positives

tucalipe · July 7, 2025, 5:43pm

I’m working on a support ticket bot for a company in Brazil, and we decided to use 4o-mini to power it.

All our our customers prompt’s are moderated through the Moderation API, and, previously, we blocked anything that came back as “flagged”. But a customer sent the following:

O ponto não registrou de alguns funcionários que bateram hoje na minha empresa

Which is translated to “The time clock didn’t register some employees who clocked in today at the company”. Moderation API flagged it with 35.60% violence, most likely because “clock in” and “beat” (as in beat up someone) are the same in Portuguese.

I’ve separately tested the same message over ChatGPT as well as the Responses API and neither were flagged.

All of this makes me believe just blocking inputs that were simply flagged by the Moderation API is not a good practice. Right now, I’ve set it to only block inputs that have any score over 50.00%.

What are the good practices on moderating user inputs?

_j · July 7, 2025, 7:06pm

I would not run an input that failed. The dumb moderations is also being run on sent inputs by OpenAI to classify API organizations as “bad”, scoring against you until resulting in a warning or ban.

I get this flag from omni but not text moderations model from your language.

An option would be to run against both moderations models in parallel, or have text as a fallback if receiving a flag on text-only on omni.

# The omni moderation endpoint expects a list of items, each with a "type".
# - For text:   {"type": "text",      "text": <string>}
# - For image:  {"type": "image_url", "image_url": {"url": <string>}}
# The text-moderation-xxx endpoint expects a string or list of strings

model = "omni-moderation-latest"
moderation_inputs = []
text_model = "text-moderation-stable"
text_moderation_inputs = []

# Add each text and image message as its own entry
for text in my_texts:
    moderation_inputs.append({"type": "text", "text": text})
for url in image_urls:
    moderation_inputs.append({"type": "image_url", "image_url": {"url": url}})

# Add each text alone as its own entry for non-multimodal moderation
for text in my_texts:
    text_moderation_inputs.append(text)

# Construct the moderation calls and send
moderation_call = {"model": model, "input": moderation_inputs}
text_moderation_call = {"model": text_model, "input": text_moderation_inputs}

client = OpenAI()
response = client.moderations.create(**moderation_call)
text_response = client.moderations.create(**text_moderation_call)

# continue with logic whether to trust text moderation only...

tucalipe · July 8, 2025, 1:59pm

Can I just use the old moderation model for all my validations?

_j · July 8, 2025, 2:04pm

You could. OpenAI doesn’t say what THEIR moderation is, and often they go out of the bounds of the terms of service indiscriminately.

So scoring and blocking what they say should be a policy violation on their platform is the safe way. Only completely blocking if both moderation engines flag is one alternative, where the higher safety is instead blocking for either.

Or, you go building your own out of embeddings (which cannot generate bad language and therefore is safer for an expectation that OpenAI is not checking it as vigorously). And hope that works better and isn’t the kind of scoring that makes you untrustworthy and then you get stealth-denied new models or whatever dark pattern is running.

supra · November 7, 2025, 11:01pm

New moderation model flags "respond in the style of the simpsons” with violence 35%. I think this is a strong false positive. Is there any way I can influence OpenAI’s decision on the thresholds? Current behavior hinders legitimate (I think) functionality of my app.

Cheers, Jan

Topic		Replies	Views
Moderation scores and flags Feedback moderation	0	443	October 18, 2024
Clarification on Using Moderation Model to Avoid Policy Violations API gpt-4 , api	3	802	October 9, 2024
API Moderation inconsistent with chat completion acceptance API	5	1301	January 21, 2024
Clarifying Content Policy on Discussing Personal Experiences Community violations	30	5116	June 29, 2024
Understanding category_scores moderation values API moderation	1	127	December 7, 2025

Dealing with Moderation API False Positives

Related topics