Moderations.create - how to save and parse output?

Hi everyone, I am trying to store the output of the moderation endpoint to analyze later. However, i can’t parse any of the info because the minute I save it in a pd dataframe or as a .csv file it turns into a string. The API documentation says that it should be a JSON output, but I do not seem to be getting that. When I try the following python code:

response = client.moderations.create(input="Sample text goes here.")
response.results[0]

I get the output:

Moderation(categories=Categories(harassment=False, harassment_threatening=False, hate=False, hate_threatening=False, self_minus_harm=False, self_minus_harm_instructions=False, self_minus_harm_intent=False, sexual=False, sexual_minors=False, violence=False, violence_graphic=False, self-harm=False, sexual/minors=False, hate/threatening=False, violence/graphic=False, self-harm/intent=False, self-harm/instructions=False, harassment/threatening=False), category_scores=CategoryScores(harassment=9.828779730014503e-05, harassment_threatening=7.383290494544781e-07, hate=2.796228727675043e-05, hate_threatening=8.01505279923731e-08, self_minus_harm=1.2543721084057324e-07, self_minus_harm_instructions=1.4000808290504096e-09, self_minus_harm_intent=7.407836477568708e-08, sexual=0.0029249193612486124, sexual_minors=6.004794613545528e-06, violence=8.165345207089558e-05, violence_graphic=5.865722414455377e-07, self-harm=1.2543721084057324e-07, sexual/minors=6.004794613545528e-06, hate/threatening=8.01505279923731e-08, violence/graphic=5.865722414455377e-07, self-harm/intent=7.407836477568708e-08, self-harm/instructions=1.4000808290504096e-09, harassment/threatening=7.383290494544781e-07), flagged=False)

This can’t be parsed into JSON using json.loads() or any other form because the Moderation object is not recognized. I am confused because the documentation says that the exact same code should produce the following output:

[
    {
      "flagged": true,
      "categories": {
        "sexual": false,
        "hate": false,
        "harassment": false,
        "self-harm": false,
        "sexual/minors": false,
        "hate/threatening": false,
        "violence/graphic": false,
        "self-harm/intent": false,
        "self-harm/instructions": false,
        "harassment/threatening": true,
        "violence": true,
      },
      "category_scores": {
        "sexual": 1.2282071e-06,
        "hate": 0.010696256,
        "harassment": 0.29842457,
        "self-harm": 1.5236925e-08,
        "sexual/minors": 5.7246268e-08,
        "hate/threatening": 0.0060676364,
        "violence/graphic": 4.435014e-06,
        "self-harm/intent": 8.098441e-10,
        "self-harm/instructions": 2.8498655e-11,
        "harassment/threatening": 0.63055265,
        "violence": 0.99011886,
      }
    }
  ]

I’m not sure if I’m completely misinterpreting this, but I am really struggling to be able to store this on my machine and then open it up in another environment and be able to parse the object. Does anyone know about this? Thanks in advance!

Let’s do moderations!

First, we’re going to need the prerequisites - python 3.8-3.11. Then you’ll need to pip install --upgrade openai to get the latest version of the python library with its new client object.

OpenAI’s example

from openai import OpenAI
client = OpenAI()
client.moderations.create(input="I want to kill them.")

Lame. Doesn’t even let you get the results.

Useful example

from openai import OpenAI
client = OpenAI()
text = "I like kittens."
api_response = client.moderations.create(input=text)
response_dict = api_response.model_dump()
is_flagged = response_dict['results'][0]['flagged']

Now you have a dictionary object. I also give a boolean you can check to see if the input got flagged.

In similar fashion, you can find the “true” categories that resulted in the flag.

See pretty results

You might actually not want a garbled line going off the screen but instead a nicely formatted output, with the categories alphabetized, and the number values not shown in an exponential form. Let’s add some more utility for an interactive script.

def process(data):
    if isinstance(data, dict):
        sorted_data = {k: process(v) for k, v in sorted(data.items())}
        return {k: format_floats(v) for k, v in sorted_data.items()}
    elif isinstance(data, list):
        return [process(item) for item in data]
    else:
        return data

def format_floats(data):
    if isinstance(data, float):
        # Format floats to 10 decimal places as strings
        return f"{data:.10f}"
    else:
        return data

text = "I drown kittens."
api_response = client.moderations.create(input=text)
response_dict = api_response.model_dump()

formatted_dict = process(response_dict)
print(json.dumps(formatted_dict, indent=2))

We get output meant for humans:

{
  "id": "modr-8PkrTu6sR6pT1ztdSRAwVslnt6OtS",
  "model": "text-moderation-006",
  "results": [
    {
      "categories": {
        "harassment": false,
        "harassment/threatening": false,
        "harassment_threatening": false,
        "hate": false,
        "hate/threatening": false,
        "hate_threatening": false,
        "self-harm": false,
        "self-harm/instructions": false,
        "self-harm/intent": false,
        "self_harm": false,
        "self_harm_instructions": false,
        "self_harm_intent": false,
        "sexual": false,
        "sexual/minors": false,
        "sexual_minors": false,
        "violence": false,
        "violence/graphic": false,
        "violence_graphic": false
      },
      "category_scores": {
        "harassment": "0.0026197021",
        "harassment/threatening": "0.0043704621",
        "harassment_threatening": "0.0043704621",
        "hate": "0.0000743081",
        "hate/threatening": "0.0000794773",
        "hate_threatening": "0.0000794773",
        "self-harm": "0.0000493223",
        "self-harm/instructions": "0.0000000002",
        "self-harm/intent": "0.0000661878",
        "self_harm": "0.0000493223",
        "self_harm_instructions": "0.0000000002",
        "self_harm_intent": "0.0000661878",
        "sexual": "0.0000032877",
        "sexual/minors": "0.0000095750",
        "sexual_minors": "0.0000095750",
        "violence": "0.6199731827",
        "violence/graphic": "0.0040242169",
        "violence_graphic": "0.0040242169"
      },
      "flagged": false
    }
  ]
}

Pick one of the duplicated items

Sorting shows us the new moderation Pydantic model object has an issue, seen in all methods. Outputs with a slash are duplicated with an underscore. Same for self-harm with a hyphen.

This could be anticipating the need for python reference and the slash breaking some parsing, but it is also silly, so we pick one and kill the other. Let’s use the logic that the underscore version looks better and is more reliable.

from openai import OpenAI
import json

client = OpenAI()

def process(data):
    if isinstance(data, dict):
        sorted_data = {
            k: process(v)
            for k, v in sorted(data.items())
            if '/' not in k and '-' not in k  # Filter out key-value pairs with '/' and '-'
        }
        return {k: format_floats(v) for k, v in sorted_data.items()}
    elif isinstance(data, list):
        return [process(item) for item in data]
    else:
        return data

def format_floats(data):
    if isinstance(data, float):
        # Format floats to 7 decimal places as strings
        return f"{data:.7f}"
    else:
        return data

text = "I drown kittens."
api_response = client.moderations.create(input=text)
response_dict = api_response.model_dump()

formatted_dict = process(response_dict)
print(json.dumps(formatted_dict, indent=2))

Now: Better display we can understand and take action on:


“category_scores”: {
“harassment”: “0.0026898”,
“harassment_threatening”: “0.0043335”,
“hate”: “0.0000773”,
“hate_threatening”: “0.0000787”,
“self_harm”: “0.0000494”,
“self_harm_instructions”: “0.0000000”,
“self_harm_intent”: “0.0000658”,
“sexual”: “0.0000034”,
“sexual_minors”: “0.0000100”,
“violence”: “0.6200027”,
“violence_graphic”: “0.0038763”
},

You might want 10 decimal places to see really low values.

Further moderations

Drowning kittens is not flagged (flagging being OpenAI policy-violating), but is an example more violent than we might want for kids to say or receive.

In that case, you can write your own thresholds for flagging. That will take a lot of experimentation, because OpenAI doesn’t say the current values of each that equates to a flag, so you have find that baseline yourself and determine how much you might adjust each parameter.

2 Likes

Thank you!! This is exactly what I was looking for.

2 Likes