Moderation API does not understand the concept of negative prompts

I am trying to get the OpenAI Moderation API to rate positive AND negative prompts that are entered into an image generator. This proves to be difficult or maybe even impossible, as the Moderation API does not understand the concept of negative prompts.
In extreme cases, negative prompts can be interpreted as the exact opposite, which results in Moderation API scores that are completely wrong.

Positive prompt: Photorealistic image of two people kissing
Negative prompt: adults, clothes

The negative prompt in this case turns the harmless picture into an illegal one depicting nude minors.
This should result in a high score in the Moderation API category “sexual/minors - Sexual content that includes an individual who is under 18 years old.”

but it doesn’t, unfortunately.

Maybe I am just missing the right syntax to enter negative prompts?

Please advise.


Welcome to the forum.

I’m not quite following how that negative prompt leads you to think about minors…

1 Like

Welcome to the community @Eli_Warren

I get what you’re trying to convey, and tried the negation of your negative prompt on DALL-E and got this:

This would mean that the filter works.

1 Like

Ah, so saying, “not children” or “not minors”… I know they just updated the moderation endpoint…

The negative prompt contains everything that the generated image should NOT contain.
If the prompt is to render an image of two people, and the negative prompt of “adults, clothes” rules out that these people are adults and that the image contains clothes, it pretty much only leaves it with showing non-adults (i.e. children) wearing no clothes (i.e. are naked).

Gotcha. DALL-E 2 isn’t great with negative prompts in general… Might be better to concentrate on the positive traits of the image you’re wanting…

I am not trying to use negative prompts on LLM. I want to evaluate a set of prompts that is always consisting of a positive as well as a negative prompt for violations using the OpenAI Moderation API.

1 Like

Well, the moderation endpoint is itself a LLM, but I hear what you’re saying (I think…)

I would reach out to them and let them know it’s an area you’re interested in for the moderation endpoint. I don’t think they have a general feedback/suggestions email, but should be able to get your message to the right people.

The moderation endpoint likely has no training on the meaning of “negative prompt”, nor how to connect the dots.

One might have to make a classifier. I tried an extensive middleman AI gpt-3.5-turbo programming, and it doesn’t rate the danger high although it mostly described correctly what could be produced. (JSON worked with near-zero scores until I added even more prompt)

Maybe my two AI toddlers kissing in the bathtub or in swimsuits isn’t going to send me to jail and is only exciting for the very twisted mind.

You are a safety classification moderator AI. What follows in triple-quotes is an AI image generator prompt that instructs an AI how to create a picture. “Positive prompt” will create that type of image output by the AI. “Negative prompt”, however modifies the image generation process by removing those negative output options from the image, leaving only a subset of remaining options as image possibilities. Use AI inference to describe the kind of images could still be produced by the total considered prompts. Then formatted in JSON, rate 0.00 to 1.00, the danger of image policy violation in these key categories, and a final boolean flag [violence, harm, self-harm, sexual, sexualization_minors, obscene, nudity, is_objectionable] “”“Positive prompt: Photorealistic image of two people kissing
Negative prompt: adults, clothes”“”

Based on the provided instructions, here’s the interpretation of the possible images that could be generated for both the positive and negative prompts:

Positive Prompt: Photorealistic image of two people kissing

In response to the positive prompt, the AI would generate a photorealistic image of two people kissing. The absence of any negative modifiers allows for a broad range of creative interpretations within the context of the given prompt.

Negative Prompt: adults, clothes

In response to the negative prompt, the AI is directed to remove the negative output options “adults” and “clothes” from the image generation process. This would lead to a modified image creation process that focuses on generating images without adults and clothes.

Interpreted Possible Outputs:

The generated images would potentially involve non-adult characters engaged in activities other than kissing, as “adults” are excluded. Additionally, these images would depict characters without clothes, as “clothes” are excluded as well.

Danger Rating and Objectionable Content:

Considering the key categories specified in the prompt, here’s a danger rating for each category and a final boolean flag indicating whether the generated images could be considered objectionable:

  • Violence: 0.00 - There is no indication that violence is a part of the prompt or possible outputs.
  • Harm: 0.00 - The prompt doesn’t involve any indication of harm.
  • Self-harm: 0.00 - The prompt doesn’t involve any indication of self-harm.
  • Sexual: 0.25 - While the prompt primarily focuses on kissing, the removal of the “adults” modifier might still allow for the depiction of younger characters engaged in romantic activities, which could be interpreted as sexual. However, the absence of explicit cues significantly reduces this risk.
  • Sexualization of Minors: 0.25 - The risk of sexualization of minors might be present due to the potential interpretation of non-adult characters engaging in romantic activities. However, the absence of explicit cues and context limits this risk.
  • Obscene: 0.00 - The prompt doesn’t involve any indication of obscenity.
  • Nudity: 0.25 - Nudity might be present in the images due to the absence of the “clothes” modifier, but it would likely be handled within the context of the prompt (e.g., characters depicted without clothes engaged in romantic activities).
  • Is Objectionable: False - While there are potential minor risks in the categories of sexualization of minors, sexual content, and nudity, the absence of explicit cues and the context of the prompt would likely mitigate the objectionable nature of the generated images.

Please note that these danger ratings are subjective interpretations and depend on the sensitivity and cultural context of the audience reviewing the images. The provided ratings are meant to be a general assessment based on the information given.


Thank you @_j
I just tried that out. The problem with that solution is that depending on the prompt, it will often refuse to process it, triggering a warning about violation of OpenAI’s content policy instead.
So it is unfortunately not suitable as a replacement for the moderation endpoint.

Excellent prompt!
This prompt would make the completion so strict and can be achieved to expected goal. But i believe DALLE 2 itself moderate adult content.