Today we are introducing a new moderation model, omni-moderation-latest
, in the Moderation API. This new model now supports both images and text (images supported for some categories, text is supported for all categories), supports two new text-only harm categories, and has more accurate scores. The Moderation API has been updated (in a backwards compatible way) to support these new features.
Example request
In your input array, when you specify one of the new models such as "model": "omni-moderation-latest"
you can now provide an object with ”type”: “image_url”
as input to provide an image. The API will accept either urls or base64 encoded images.
curl https://api.openai.com/v1/moderations \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"input": [
{
"type": "text",
"text": "I want to kill them!"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/en/thumb/7/7c/Crowd_in_street.jpg/440px-Crowd_in_street.jpg"
}
}
],
"model": "omni-moderation-latest"
}'
Image inputs are for the following six categories: violence (violence
and violence/graphic
), self-harm (self-harm
, self-harm/intent
, and self-harm/instruction
) and sexual (sexual
but not sexual/minors
). The remaining categories are currently text-only and we are working to expand multimodal support to more categories in the future.
Example response
In addition to the boolean flags and scores per category, you will now receive back the category_applied_input_types object which indicates which modalities of data were taken into account for your request.
{
"id": "modr-XXXXX",
"model": "omni-moderation-001",
"results": [
{
"flagged": true,
"categories": {
"sexual": false,
"hate": false,
"harassment": false,
"self-harm": false,
"sexual/minors": false,
"hate/threatening": false,
"violence/graphic": false,
"self-harm/intent": false,
"self-harm/instructions": false,
"harassment/threatening": true,
"violence": true,
"illicit": false,
"illicit/violent": false,
},
"category_scores": {
"sexual": 1.2282071e-06,
"hate": 0.010696256,
"harassment": 0.29842457,
"self-harm": 1.5236925e-08,
"sexual/minors": 5.7246268e-08,
"hate/threatening": 0.0060676364,
"violence/graphic": 4.435014e-06,
"self-harm/intent": 8.098441e-10,
"self-harm/instructions": 2.8498655e-11,
"harassment/threatening": 0.63055265,
"violence": 0.99011886,
"illicit": 0.231049303,
"illicit/violent": 0.242101936,
},
"category_applied_input_types": {
"sexual": ["text", "image"],
"hate": ["text"],
"harassment": ["text"],
"self-harm": ["text", "image"],
"sexual/minors": ["text"],
"hate/threatening": ["text"],
"violence/graphic": ["text", "image"],
"self-harm/intent": ["text", "image"],
"self-harm/instructions": ["text", "image"],
"harassment/threatening": ["text"],
"violence": ["text", "image"],
"illicit": ["text"],
"illicit/violent": ["text"],
}
}
]
}
When using the new model, the results now contain two additional categories: illicit
, which covers instructions or advice on how to commit wrongdoing — a phrase like “how to shoplift” for example, and illicit/violent
, which covers the same for wrongdoing that also includes violence.
More accurate scores, especially for low resource languages
This new model improved 42% on our multilingual eval, and improved in 98% of languages tested. For low-resource languages like Khmer or Swati, it improved 70%, and we saw the biggest improvements in Telugu (6.4x), Bengali (5.6x), and Marathi (4.6x).
(A higher AUPRC indicates better model performance in distinguishing between safe and unsafe examples.)
While the previous model had limited support for non-English languages, the performance of the new model in Spanish, German, Italian, Polish, Vietnamese, Portuguese, French, Chinese, Indonesian, and English all exceed even English performance from the previous model.
Calibrated scores
The new model’s scores are now designed to represent the probability that a piece of content violates the relevant policies. This is referred to as “calibrated”. This is useful for power users, since as we launch new models in the future, thresholds you set with a previous model should work very similarly, allowing more seamless model switching.
The moderation API remains free. To get started, see our Moderation API guide.