I literally had to butcher my dataset from 120 lines down to 15 to get it to work. This was for gpt-4o-mini-2024-07-18. Here’s the script I made to get it to work (and I still had to manually remove a few lines that mentioned politics:)
python
import json
import openai
# OpenAI API key (to be filled in)
openai.api_key = ""
# Paths for input and output files
input_file = "input_dataset.jsonl"
output_file = "approved_output.jsonl"
# Threshold for category confidence
threshold = 0.001
# Read all lines from the input file and remove duplicates
with open(input_file, 'r', encoding='utf-8') as infile:
lines = infile.readlines()
unique_lines = list(set(lines)) # Remove duplicate lines
# Process each unique line
with open(output_file, 'w', encoding='utf-8') as outfile:
for line in unique_lines:
try:
data = json.loads(line)
messages = data.get('messages', [])
all_messages_approved = True # Flag to track if all messages are approved
# Submit each individual message content to the Moderation API
for message in messages:
content = message.get('content', '')
if content: # Ensure there's content to submit
response = openai.Moderation.create(input=content)
results = response["results"][0]
# Check if any category has a score higher than the threshold
for category, score in results["category_scores"].items():
if score > threshold:
all_messages_approved = False
break
if not all_messages_approved:
break
if all_messages_approved:
# Only write the original line if all messages are approved
outfile.write(line)
except Exception as e:
print(f"Error processing line: {e}")
The fact that a threshold of 0.001 is needed is INSANE. Anything above that caused the moderation error. They are clearly concerned about people abusing the fine-tuning system. Hopefully, that helps identify the issues in your datasets.