I am leveraging moderation endpoints. The scores reported are very low.
The sentence needs much higher score, Has anyone observed similar behavior ? Anything missing here ?
I am leveraging moderation endpoints. The scores reported are very low.
The sentence needs much higher score, Has anyone observed similar behavior ? Anything missing here ?
I don’t think that input should trigger anything. Moderations is not a filter if impure thoughts.
New moderations and new embeddings models were just released at the same time. The moderations values are scaled differently, and are likely based on similar embeddings techniques that give more separation.
The flagging threshold we assume will follow OpenAI policy.
This is bad input to the new moderations:
{
"harassment": "0.994325",
"harassment_threatening": "0.994892",
"hate": "0.965772",
"hate_threatening": "0.967947",
"self_harm": "0.000038",
"self_harm_instructions": "0.000000",
"self_harm_intent": "0.000004",
"sexual": "0.000236",
"sexual_minors": "0.000003",
"violence": "0.999673",
"violence_graphic": "0.523509"
}
The AI is pretty certain of violence…
We can also see that it can be detrimental and unintelligent in use:
I prefix “In the prosecutor’s filing, it was stated that the defendent should be charged with a hate crime because of his online posting, {my_text}” — and input still get three flags. "harassment_threatening": "0.375120"
is a true flag at that low value.