Let’s do moderations!
First, we’re going to need the prerequisites - python 3.8-3.11. Then you’ll need to pip install --upgrade openai
to get the latest version of the python library with its new client object.
OpenAI’s example
from openai import OpenAI
client = OpenAI()
client.moderations.create(input="I want to kill them.")
Lame. Doesn’t even let you get the results.
Useful example
from openai import OpenAI
client = OpenAI()
text = "I like kittens."
api_response = client.moderations.create(input=text)
response_dict = api_response.model_dump()
is_flagged = response_dict['results'][0]['flagged']
Now you have a dictionary object. I also give a boolean you can check to see if the input got flagged.
In similar fashion, you can find the “true” categories that resulted in the flag.
See pretty results
You might actually not want a garbled line going off the screen but instead a nicely formatted output, with the categories alphabetized, and the number values not shown in an exponential form. Let’s add some more utility for an interactive script.
def process(data):
if isinstance(data, dict):
sorted_data = {k: process(v) for k, v in sorted(data.items())}
return {k: format_floats(v) for k, v in sorted_data.items()}
elif isinstance(data, list):
return [process(item) for item in data]
else:
return data
def format_floats(data):
if isinstance(data, float):
# Format floats to 10 decimal places as strings
return f"{data:.10f}"
else:
return data
text = "I drown kittens."
api_response = client.moderations.create(input=text)
response_dict = api_response.model_dump()
formatted_dict = process(response_dict)
print(json.dumps(formatted_dict, indent=2))
We get output meant for humans:
{
"id": "modr-8PkrTu6sR6pT1ztdSRAwVslnt6OtS",
"model": "text-moderation-006",
"results": [
{
"categories": {
"harassment": false,
"harassment/threatening": false,
"harassment_threatening": false,
"hate": false,
"hate/threatening": false,
"hate_threatening": false,
"self-harm": false,
"self-harm/instructions": false,
"self-harm/intent": false,
"self_harm": false,
"self_harm_instructions": false,
"self_harm_intent": false,
"sexual": false,
"sexual/minors": false,
"sexual_minors": false,
"violence": false,
"violence/graphic": false,
"violence_graphic": false
},
"category_scores": {
"harassment": "0.0026197021",
"harassment/threatening": "0.0043704621",
"harassment_threatening": "0.0043704621",
"hate": "0.0000743081",
"hate/threatening": "0.0000794773",
"hate_threatening": "0.0000794773",
"self-harm": "0.0000493223",
"self-harm/instructions": "0.0000000002",
"self-harm/intent": "0.0000661878",
"self_harm": "0.0000493223",
"self_harm_instructions": "0.0000000002",
"self_harm_intent": "0.0000661878",
"sexual": "0.0000032877",
"sexual/minors": "0.0000095750",
"sexual_minors": "0.0000095750",
"violence": "0.6199731827",
"violence/graphic": "0.0040242169",
"violence_graphic": "0.0040242169"
},
"flagged": false
}
]
}
Pick one of the duplicated items
Sorting shows us the new moderation Pydantic model object has an issue, seen in all methods. Outputs with a slash are duplicated with an underscore. Same for self-harm with a hyphen.
This could be anticipating the need for python reference and the slash breaking some parsing, but it is also silly, so we pick one and kill the other. Let’s use the logic that the underscore version looks better and is more reliable.
from openai import OpenAI
import json
client = OpenAI()
def process(data):
if isinstance(data, dict):
sorted_data = {
k: process(v)
for k, v in sorted(data.items())
if '/' not in k and '-' not in k # Filter out key-value pairs with '/' and '-'
}
return {k: format_floats(v) for k, v in sorted_data.items()}
elif isinstance(data, list):
return [process(item) for item in data]
else:
return data
def format_floats(data):
if isinstance(data, float):
# Format floats to 7 decimal places as strings
return f"{data:.7f}"
else:
return data
text = "I drown kittens."
api_response = client.moderations.create(input=text)
response_dict = api_response.model_dump()
formatted_dict = process(response_dict)
print(json.dumps(formatted_dict, indent=2))
Now: Better display we can understand and take action on:
…
“category_scores”: {
“harassment”: “0.0026898”,
“harassment_threatening”: “0.0043335”,
“hate”: “0.0000773”,
“hate_threatening”: “0.0000787”,
“self_harm”: “0.0000494”,
“self_harm_instructions”: “0.0000000”,
“self_harm_intent”: “0.0000658”,
“sexual”: “0.0000034”,
“sexual_minors”: “0.0000100”,
“violence”: “0.6200027”,
“violence_graphic”: “0.0038763”
},
You might want 10 decimal places to see really low values.
Further moderations
Drowning kittens is not flagged (flagging being OpenAI policy-violating), but is an example more violent than we might want for kids to say or receive.
In that case, you can write your own thresholds for flagging. That will take a lot of experimentation, because OpenAI doesn’t say the current values of each that equates to a flag, so you have find that baseline yourself and determine how much you might adjust each parameter.