I’m trying to avoid false positives when an image might include say a weapon or a skull, or be sexually suggestive or have for instance some revealing clothing, but isn’t a problem to publish. Should I be adding logic to say if the image is flagged but the score is < 50%, then I’d allow it? Or start to have thresholds per category so that we should have zero tolerance for sexual/minors, harassment etc but be more tolerant for sexual / violent content? How would I go about choosing a threshold? I.e. for sexual content is 50% and above basically pornography or a woman in a short skirt?
Any guidance / thoughts / where to look for docs would be welcome.
OpenAI could improve transparency and make the moderations endpoint much more useful by including a publication of “flag_threshold” in the API return. That would allow you to re-weight the scoring better than guesses you currently must make about the cutoff.
The “omni” moderations is supposed to be better normalized as a described “probability of policy violation”, and only considers OpenAI’s concerns. However, as you discover, the actual thresholds of emitting a flag, like the prior moderations model, are not provided, are not 0.50, and are likely hand-tuned against tests after training the model, or how much probability is tolerated per category.
Another solution they could do (but don’t do) is bias not the flag level, but the output value. Perhaps the obfuscation is to make it a product only useful for OpenAI vs their models, not for you to find application elsewhere.
Do note that:
The company “moderates” and scores your organization anyway, on categories that are not in moderations and which they don’t reveal, so sending flagged context is more jeopardy; false positives are still a “positive”;
Moderations is chunked, collecting the highest scores. It also makes no sense: individual passing text or passing images may be flagged when in combination - or the opposite. More text, such as more or less history messages, can also flip results.
It is thus almost a random disposition generator that is used for judgement of humans, their internal version (which is un-surfaced) unethically used to steal credits and opportunity by banning organizations without human review.
Currently, you’ll need to send your own ideas of violations from your app’s content scraping technique and characterize the moderation scores, which you won’t be striked for.
(Then when OpenAI blocks your reward training file for sending bad negatives the AI shall not produce? Go figure that one out…)