Moderations.create - how to save and parse output?

Let’s do moderations!

First, we’re going to need the prerequisites - python 3.8-3.11. Then you’ll need to pip install --upgrade openai to get the latest version of the python library with its new client object.

OpenAI’s example

from openai import OpenAI
client = OpenAI()
client.moderations.create(input="I want to kill them.")

Lame. Doesn’t even let you get the results.

Useful example

from openai import OpenAI
client = OpenAI()
text = "I like kittens."
api_response = client.moderations.create(input=text)
response_dict = api_response.model_dump()
is_flagged = response_dict['results'][0]['flagged']

Now you have a dictionary object. I also give a boolean you can check to see if the input got flagged.

In similar fashion, you can find the “true” categories that resulted in the flag.

See pretty results

You might actually not want a garbled line going off the screen but instead a nicely formatted output, with the categories alphabetized, and the number values not shown in an exponential form. Let’s add some more utility for an interactive script.

def process(data):
    if isinstance(data, dict):
        sorted_data = {k: process(v) for k, v in sorted(data.items())}
        return {k: format_floats(v) for k, v in sorted_data.items()}
    elif isinstance(data, list):
        return [process(item) for item in data]
    else:
        return data

def format_floats(data):
    if isinstance(data, float):
        # Format floats to 10 decimal places as strings
        return f"{data:.10f}"
    else:
        return data

text = "I drown kittens."
api_response = client.moderations.create(input=text)
response_dict = api_response.model_dump()

formatted_dict = process(response_dict)
print(json.dumps(formatted_dict, indent=2))

We get output meant for humans:

{
  "id": "modr-8PkrTu6sR6pT1ztdSRAwVslnt6OtS",
  "model": "text-moderation-006",
  "results": [
    {
      "categories": {
        "harassment": false,
        "harassment/threatening": false,
        "harassment_threatening": false,
        "hate": false,
        "hate/threatening": false,
        "hate_threatening": false,
        "self-harm": false,
        "self-harm/instructions": false,
        "self-harm/intent": false,
        "self_harm": false,
        "self_harm_instructions": false,
        "self_harm_intent": false,
        "sexual": false,
        "sexual/minors": false,
        "sexual_minors": false,
        "violence": false,
        "violence/graphic": false,
        "violence_graphic": false
      },
      "category_scores": {
        "harassment": "0.0026197021",
        "harassment/threatening": "0.0043704621",
        "harassment_threatening": "0.0043704621",
        "hate": "0.0000743081",
        "hate/threatening": "0.0000794773",
        "hate_threatening": "0.0000794773",
        "self-harm": "0.0000493223",
        "self-harm/instructions": "0.0000000002",
        "self-harm/intent": "0.0000661878",
        "self_harm": "0.0000493223",
        "self_harm_instructions": "0.0000000002",
        "self_harm_intent": "0.0000661878",
        "sexual": "0.0000032877",
        "sexual/minors": "0.0000095750",
        "sexual_minors": "0.0000095750",
        "violence": "0.6199731827",
        "violence/graphic": "0.0040242169",
        "violence_graphic": "0.0040242169"
      },
      "flagged": false
    }
  ]
}

Pick one of the duplicated items

Sorting shows us the new moderation Pydantic model object has an issue, seen in all methods. Outputs with a slash are duplicated with an underscore. Same for self-harm with a hyphen.

This could be anticipating the need for python reference and the slash breaking some parsing, but it is also silly, so we pick one and kill the other. Let’s use the logic that the underscore version looks better and is more reliable.

from openai import OpenAI
import json

client = OpenAI()

def process(data):
    if isinstance(data, dict):
        sorted_data = {
            k: process(v)
            for k, v in sorted(data.items())
            if '/' not in k and '-' not in k  # Filter out key-value pairs with '/' and '-'
        }
        return {k: format_floats(v) for k, v in sorted_data.items()}
    elif isinstance(data, list):
        return [process(item) for item in data]
    else:
        return data

def format_floats(data):
    if isinstance(data, float):
        # Format floats to 7 decimal places as strings
        return f"{data:.7f}"
    else:
        return data

text = "I drown kittens."
api_response = client.moderations.create(input=text)
response_dict = api_response.model_dump()

formatted_dict = process(response_dict)
print(json.dumps(formatted_dict, indent=2))

Now: Better display we can understand and take action on:


“category_scores”: {
“harassment”: “0.0026898”,
“harassment_threatening”: “0.0043335”,
“hate”: “0.0000773”,
“hate_threatening”: “0.0000787”,
“self_harm”: “0.0000494”,
“self_harm_instructions”: “0.0000000”,
“self_harm_intent”: “0.0000658”,
“sexual”: “0.0000034”,
“sexual_minors”: “0.0000100”,
“violence”: “0.6200027”,
“violence_graphic”: “0.0038763”
},

You might want 10 decimal places to see really low values.

Further moderations

Drowning kittens is not flagged (flagging being OpenAI policy-violating), but is an example more violent than we might want for kids to say or receive.

In that case, you can write your own thresholds for flagging. That will take a lot of experimentation, because OpenAI doesn’t say the current values of each that equates to a flag, so you have find that baseline yourself and determine how much you might adjust each parameter.

4 Likes