Upgrading the Moderation API with a new multimodal moderation model

Today we are introducing a new moderation model, omni-moderation-latest, in the Moderation API. This new model now supports both images and text (images supported for some categories, text is supported for all categories), supports two new text-only harm categories, and has more accurate scores. The Moderation API has been updated (in a backwards compatible way) to support these new features.

Example request

In your input array, when you specify one of the new models such as "model": "omni-moderation-latest" you can now provide an object with ”type”: “image_url” as input to provide an image. The API will accept either urls or base64 encoded images.

curl https://api.openai.com/v1/moderations \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "input": [
          {
            "type": "text",
            "text": "I want to kill them!"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://upload.wikimedia.org/wikipedia/en/thumb/7/7c/Crowd_in_street.jpg/440px-Crowd_in_street.jpg"
            }
          }
     ],
     "model": "omni-moderation-latest"
  }'

Image inputs are for the following six categories: violence (violence and violence/graphic), self-harm (self-harm, self-harm/intent, and self-harm/instruction) and sexual (sexual but not sexual/minors). The remaining categories are currently text-only and we are working to expand multimodal support to more categories in the future.

Example response

In addition to the boolean flags and scores per category, you will now receive back the category_applied_input_types object which indicates which modalities of data were taken into account for your request.

{
  "id": "modr-XXXXX",
  "model": "omni-moderation-001",
  "results": [
    {
      "flagged": true,
      "categories": {
        "sexual": false,
        "hate": false,
        "harassment": false,
        "self-harm": false,
        "sexual/minors": false,
        "hate/threatening": false,
        "violence/graphic": false,
        "self-harm/intent": false,
        "self-harm/instructions": false,
        "harassment/threatening": true,
        "violence": true,
        "illicit": false,
        "illicit/violent": false,
      },
      "category_scores": {
        "sexual": 1.2282071e-06,
        "hate": 0.010696256,
        "harassment": 0.29842457,
        "self-harm": 1.5236925e-08,
        "sexual/minors": 5.7246268e-08,
        "hate/threatening": 0.0060676364,
        "violence/graphic": 4.435014e-06,
        "self-harm/intent": 8.098441e-10,
        "self-harm/instructions": 2.8498655e-11,
        "harassment/threatening": 0.63055265,
        "violence": 0.99011886,
        "illicit": 0.231049303,
        "illicit/violent": 0.242101936,
      },
      "category_applied_input_types": {
	  "sexual": ["text", "image"],
        "hate": ["text"],
        "harassment": ["text"],
        "self-harm": ["text", "image"],
        "sexual/minors": ["text"],
        "hate/threatening": ["text"],
        "violence/graphic": ["text", "image"],
        "self-harm/intent": ["text", "image"],
        "self-harm/instructions": ["text", "image"],
        "harassment/threatening": ["text"],
        "violence": ["text", "image"],
        "illicit": ["text"],
        "illicit/violent": ["text"],
      }
    }
  ]
}

When using the new model, the results now contain two additional categories: illicit, which covers instructions or advice on how to commit wrongdoing — a phrase like “how to shoplift” for example, and illicit/violent, which covers the same for wrongdoing that also includes violence.

More accurate scores, especially for low resource languages

This new model improved 42% on our multilingual eval, and improved in 98% of languages tested. For low-resource languages like Khmer or Swati, it improved 70%, and we saw the biggest improvements in Telugu (6.4x), Bengali (5.6x), and Marathi (4.6x).

(A higher AUPRC indicates better model performance in distinguishing between safe and unsafe examples.)

While the previous model had limited support for non-English languages, the performance of the new model in Spanish, German, Italian, Polish, Vietnamese, Portuguese, French, Chinese, Indonesian, and English all exceed even English performance from the previous model.

Calibrated scores

The new model’s scores are now designed to represent the probability that a piece of content violates the relevant policies. This is referred to as “calibrated”. This is useful for power users, since as we launch new models in the future, thresholds you set with a previous model should work very similarly, allowing more seamless model switching.

The moderation API remains free. To get started, see our Moderation API guide.

10 Likes

Thanks! Added a quick AI summary…:slight_smile:

OpenAI has unveiled a new moderation model, omni-moderation-latest, now available through the Moderation API. Built on GPT-4o, this model supports both text and image inputs, enhancing the detection of harmful content with greater accuracy, especially in non-English languages.

Key Features:

  1. Multimodal Harm Classification:
  • Evaluates images alone or in combination with text across six harm categories, including violence, self-harm, and sexual content.
  • Current multimodal support covers specific categories, with plans to expand to more in the future.
  1. New Harm Categories:
  • Introduces two additional text-only categories:
    • Illicit: Covers instructions or advice on committing wrongdoing (e.g., “how to shoplift”).
    • Illicit/Violent: Pertains to wrongdoing involving violence.
  1. Improved Multilingual Accuracy:
  • Shows a 42% improvement over previous models on internal evaluations.
  • Performance enhanced in 98% of tested languages, with significant gains in low-resource languages like Khmer, Swati, Telugu, Bengali, and Marathi.
  • Outperforms previous English-language performance in languages such as Spanish, German, Italian, and Chinese.
  1. Calibrated Probability Scores:
  • Provides more granular control over moderation decisions.
  • Scores accurately represent the likelihood of content violating policies, ensuring consistency across future models.

Benefits for Developers:

  • Free Access: The new model is free for all developers through the Moderation API, with rate limits based on usage tier.
  • Enhanced Safety: Aids in building safer products by leveraging advanced safety systems.
  • Real-World Applications: Companies like Grammarly and ElevenLabs use the Moderation API to enforce safety guardrails and prevent policy violations in their AI products.

Getting Started:

Developers can begin implementing the omni-moderation-latest model by visiting the Moderation API guide.

https://openai.com/index/upgrading-the-moderation-api-with-our-new-multimodal-moderation-model/

Compatibility Notes

The “illicit” is not returned unless use of “model”: “omni-moderation-latest” is explicit. The Python library, though, gives a null for them already.

Another oddity is the Python library has been and will return an underscored version harassment_threatening along with the original hyphen and slash version like harassment/threatening. However, for one new category, only “illicit_violent” is returned, not the original.


The values with no model specified are quite different than omni even though the API reference currently indicates that omni-mod is the default.

Instead, one discovers text-moderation-stable being specified returns the same as no model, and yet mentions of text-moderation-stable and text-moderation-latest have been removed from the API reference.

Category No Model New calibrated Model
harassment 0.461984 0.893408
harassment/threatening 0.688145 0.919809
hate 0.696739 0.880271
hate/threatening 0.795729 0.921030
illicit null 0.945309
illicit/violent null 0.793710
self-harm 0.000060 0.004716
self-harm/intent 0.000026 0.004590
self-harm/instructions 0.000002 0.000446
sexual 0.000802 0.002757
sexual/minors 0.000167 0.000086
violence 0.891316 0.944318
violence/graphic 0.464043 0.176227

New probability-normalized score indicates a pretty good “certainty”…