Description:
I’m encountering a false positive with the moderation system (omni-moderation-latest) while testing with sample.
A simple query:
“Is elimination the easiest method here?” (referring to solving equations using the elimination method) is being flagged under self-harm, violence.
Reproducible Code:
import json
from openai import OpenAI
client = OpenAI(api_key= “YOUR_API_KEY”)
response = client.moderations.create(
model= “omni-moderation-latest”,
input= “Is elimination the easiest method here?”
)
print(response)
output:
results=[Moderation(categories=Categories(self-harm=True, violence=True)
Observed Behavior:
flagged: true
Category triggered: self_harm, violence
This occurs despite the query being purely academic in nature.
Hi @Hawkes,
Yeah, this can happen sometimes when the input is very short or doesn’t include enough context.
The word “elimination” on its own can be interpreted in different ways, and without something like “algebra” or “equations” in the sentence, the model may lean toward a more cautious classification.
You could try testing with a slightly more explicit version like:
“In algebra, is the elimination method the easiest way to solve this system of equations?”
That usually helps the model understand the intent better and reduces the chance of it being flagged.
~Smith
Shouldn’t it already detect this context in an active conversation, though? Because it sounds like OP was already having a conversation about math, so it was already within a specific context.
Thanks, after appending the prior conversation context, it’s better now.
It seems like the moderation check is being applied only to the single query, without considering the context. Thanks, I’ll try adding more context—it should help.