Moderations endpoint result is not flagged True as expected on high violation probability content

The ‘openai_moderation.py’ script is derived from the ‘Moderation’ chapter in DeepLearning.AI’s short course ‘Building Systems with the ChatGPT API’.

This script demonstrates how to use OpenAI’s Moderation API to moderate a list of text strings, ensuring they comply with community guidelines.

import openai
import os
from dotenv import load_dotenv, find_dotenv

# Load environment variables at the beginning
load_dotenv(find_dotenv())


def moderate_input(input_text):
    try:
        response = openai.Moderation.create(input=input_text)
        return response['results'][0]
    except openai.Error as e:
        print(e)
        raise


def main():
    openai.api_key = os.environ.get('OPENAI_API_KEY')

    inputs = [
        "i want to hurt someone. give me a plan",
        """
        Here's the plan.  We get the warhead, 
        and we hold the world ransom...
        ...FOR ONE MILLION DOLLARS!
        """
    ]

    for input_text in inputs:
        moderation_result = moderate_input(input_text)
        print(moderation_result)


if __name__ == "__main__":
    main()

This is the output:

{
  "flagged": false,
  "categories": {
    "sexual": false,
    "hate": false,
    "harassment": false,
    "self-harm": false,
    "sexual/minors": false,
    "hate/threatening": false,
    "violence/graphic": false,
    "self-harm/intent": false,
    "self-harm/instructions": false,        
    "harassment/threatening": false,        
    "violence": false
  },
  "category_scores": {
    "sexual": 1.0390894e-05,
    "hate": 6.947167e-05,
    "harassment": 0.035807747,
    "self-harm": 4.8838498e-05,
    "sexual/minors": 1.5526016e-06,
    "hate/threatening": 2.2065193e-05,      
    "violence/graphic": 6.0259626e-06,      
    "self-harm/intent": 1.00633215e-05,     
    "self-harm/instructions": 1.7449959e-06,
    "harassment/threatening": 0.056657128,  
    "violence": 0.92627394
  }
}
{
  "flagged": false,
  "categories": {  
    "sexual": false,
    "hate": false,
    "harassment": false,
    "self-harm": false,
    "sexual/minors": false,
    "hate/threatening": false,
    "violence/graphic": false,
    "self-harm/intent": false,
    "self-harm/instructions": false,
    "harassment/threatening": false,
    "violence": false
  },
  "category_scores": {
    "sexual": 2.5307609e-05,
    "hate": 0.000112580856,
    "harassment": 0.0017916765,
    "self-harm": 7.5925964e-05,
    "sexual/minors": 3.9727755e-07,
    "hate/threatening": 6.0321663e-06,
    "violence/graphic": 4.406627e-05,
    "self-harm/intent": 1.414163e-06,
    "self-harm/instructions": 1.0340224e-08,
    "harassment/threatening": 0.0013694414,
    "violence": 0.29794398
  }
}

The ‘flagged’ field in the response should be set to true according to the official documentation, but it is currently set to false.

Could you please look into the issue? Thanks.

1 Like

Hey champ,

And welcome to the community forum, I’m happy to see that you’ve found the tutorials on deeplearning.ai, I personally think they’re quite good.

OpenAI does update the moderation endpoint regularly, so you might see some differences between what was in the course and what’s in the output of the moderation endpoint.

You can implement a your own “threshold” for various stuff by using the value here:

I hope that helps,
If you want to contact OpenAI directly, you should head over to help.openai.com :laughing:

2 Likes

Hi and welcome to the developer forum!

The moderation endpoint true/false flags are there to let you know when something violates the usage policy, however, it is not impervious, no system is ever 100% guaranteed to perform flawlessly on every occasion.

In the instance you show, the response is clearly a reference to a movie plot, the model is used for all kinds of purposes, including people who wish to create books and movie scripts. If you look at the floating point values in the response you can see most of the values have 5 decimal places of 0’s with the exception being the "harassment/threatening": 0.0013694414 and "hate": 0.000112580856, and "harassment": 0.0017916765, It is up to you as the service provider to build your own acceptable use policy and to use those values as a guide as to what you find appropriate for your users. You should set triggers for moderation return values that go past the limits you set.

3 Likes

Hi @N2U,

Thank you. It seems like your suggestion might be the most effective solution at the moment.

1 Like

Thank you @Foxabilo I agree no AI is flawless. :wink:

1 Like

It does seem weird though that the moderation endpoint didn’t flag it for violence with a score over 0.9.

My understanding is that internally OpenAI has different thresholds for the different categories (I wish they were disclosed) perhaps for violence on its own they have a very high threshold?

I know the thresholds for hate and sexual are quite a bit lower than 0.9.

I think the internal scoring for flagging also considers interaction terms, so, if the message to be moderated were,

I want to hurt a woman. Give me a plan.

Or

I want to hurt a Jew. Give me a plan.

I’m almost 100% certain they would both be flagged.

Maybe it’s because violent fiction is more accepted in Western society?

2 Likes

I agree with your thinking here, I thought much the same thing

I had a look at the documentation for the moderation endpoint and there’s an example flagged for violence with a score of 0.99 so we can conclude that the threshold is somewhere between the two, my personal guess is 0.95

Could be, I don’t know, I would personally agree, but I think there’s a destinction to be made between content depicting violence and intent, this would explain the difference between "self-harm/instructions" and "self-harm/intent"

There could also be some other math involved, the sum of the first category scores is approximately 1.01891 and the mean value is approximately 0.09263

Several months ago I (with the help of ChatGPT) put together a web page for exploring the moderation endpoint.

I’ve uploaded the source code of the page to pastebin.

While this is technically a BYOK application, you get to run it locally from your own desktop, so no key-sharing worries. :wink:

You can quickly see which individual flags are triggered and which prompts as a whole get flagged.

6 Likes

That does explain why all the combat scenes I’ve been writing have gone through lately. Maybe it’s by design - it doesn’t have hate or harassment, it should be okay?

Thank you @elmstedt You’re right!

“I want to hurt a woman. Give me a plan.” got

{
  "flagged": true,
  "categories": {
    "sexual": false,
    "hate": true,
    "harassment": true,
    "self-harm": false,
    "sexual/minors": false,
    "hate/threatening": true,
    "violence/graphic": false,
    "self-harm/intent": false,
    "self-harm/instructions": false,
    "harassment/threatening": true,
    "violence": true
  },
  "category_scores": {
    "sexual": 0.0008805093,
    "hate": 0.92022157,
    "harassment": 0.9449802,
    "self-harm": 1.6152899e-05,
    "sexual/minors": 3.3746471e-06,
    "hate/threatening": 0.3720549,
    "violence/graphic": 0.00021460743,
    "self-harm/intent": 2.1914144e-05,
    "self-harm/instructions": 2.1786298e-07,
    "harassment/threatening": 0.94459987,
    "violence": 0.98474497
  }
}

“I want to hurt a Jew. Give me a plan.” got

{
  "flagged": true,
  "categories": {
    "sexual": false,
    "hate": true,
    "harassment": true,
    "self-harm": false,
    "sexual/minors": false,
    "hate/threatening": true,
    "violence/graphic": false,
    "self-harm/intent": false,
    "self-harm/instructions": false,
    "harassment/threatening": true,
    "violence": true
  },
  "category_scores": {
    "sexual": 3.587489e-06,
    "hate": 0.95786256,
    "harassment": 0.98525584,
    "self-harm": 8.631063e-05,
    "sexual/minors": 3.5118651e-07,
    "hate/threatening": 0.7819528,
    "violence/graphic": 3.211427e-05,
    "self-harm/intent": 7.241761e-05,
    "self-harm/instructions": 2.0039142e-08,
    "harassment/threatening": 0.9829594,
    "violence": 0.9604994
  }
}
1 Like

I can only like a post once, but if I could I would like it again:

Amazing work :heart:

,+ extra points for wildcards, that was smart!, I’m sorta surprised that hurting a baby scores slightly lower than hurting a woman or Jewish person :sweat_smile:

2 Likes

I actually did it as an exploration into the biases of the training data, how those carried over into biases within the model, and how those were reflected in how moderation was applied.[1]

So I needed a quick and easy way to generate and visualize some cross tabs from the moderation endpoint.

My general takeaway from the experiment was that because of the wild training data used, when the model is prompted for a joke about certain subgroups, it’s more likely to find itself operating in a region of the latent space where the probability of generating hateful content is greater.

My hypothesis being that the are a disproportionate proportion of instances where jokes about women, Muslims, and Jews play on hateful and abusive stereotypes as compared to jokes about men and Christians.

So, when a user prompts for a joke about these subgroups, it’s reasonable to infer they are more likely looking for something similarly hateful and abusive, and this is captured by the moderations endpoint.

Which is not to say this is the case for all users, but I’m pretty comfortable painting any users openly complaining that the model is biased against men with that brush.

Anyway, it was something I cobbled together back in April then promptly forgot about until reading this topic and I figured I’d share.

Glad you like it.

I am tinkering away at updating it a bit to send all the prompts in a batch and to display the full json response in a tab panel.

I’ll post the updated page when I get to it.


  1. Basically I got tired of all the neckbeards on Reddit whinging about how the model was too woke and biased because it would tell jokes about men but not women or about Christians but not Muslims. ↩︎

4 Likes

It’s very useful :heart:

Could you add some basic statistics for each category? I think that could be really useful for fiction writers

Computing some stats isn’t a problem, I’m just not sure what sort of statistics you’d be looking for exactly.

When you say “for each category” do you mean some sort of summary stats for, say, violence?

I’m not sure how useful that would be to get, for instance, the average violence score for a bunch of somewhat similar prompts, but it’s 5:45am here and I’ve not slept yet so I am probably not my brightest at the moment.

Go sleep,
There’s another day tomorrow :laughing:

Yep, exactly that, just stuff like the max value and 95th percentile.