Usage policy violations with fine-tuned model - how can I avoid this?

ojzinn · October 10, 2023, 3:43am

Before fine-tuning, I filtered every prompt in my dataset through OpenAI’s moderation API. Now I’m doing the same for all user-generated input in my prompt. Still, even after heavy filtration and moderation of the dataset, my fine-tuned model still occasionally generates responses that violate the usage policy (discriminatory/violent language). I’m also running these outputs through the moderation endpoint so nothing abusive should get through to the user, but I’m still concerned that these abusive generations are putting my account at risk.

My suspicion is that a non-insignificant portion of the filtered dataset did contain abusive material that was missed by the moderation endpoint, on top of a non-insignificant amount of post-training user input. These two factors could compound to cause inappropriate generations to occur which I am not happy about.

What I want to know is: in lieu of hand-filtering all of the training data, re-training, then hand-moderating all incoming user input, what can I do to reduce/eliminate the amount of abusive generations? If there’s nothing I can do, will this impact my OpenAI account standing?

I should add that the amount of abuse generated is still pretty low (around 1-2%), but it does add up. A big thank you to anyone who can offer their advice!

Foxalabs · October 10, 2023, 9:23am

Welcome to the developer forum!,

In general, so long as you are moderation endpoint checking both your input and the models output for violations (note that is both directions in and out) then you are performing the required due diligence, if you are getting policy violation results despite this doing this, then you should look at creating a more robust moderation method by taking advantage of the floating point values included in the moderation endpoint return packet.

These numeric values allow you to create your own limits for the various subtopics and build a set of moderation values that trigger at a lower threshold and should therefor reduce the amount of subsequent policy violation messages you receive.

So long as you are making a good faith attempt at moderation using the endpoints provided you are complying with the contractual requirements of your agreement, however, if you are finding that violations are still being generated then I’d look into the above method to ensure you try to stay within the limits.

ojzinn · October 20, 2023, 3:48am

Thank you for your helpful reply! I don’t know why I didn’t consider using stricter thresholds from the moderation endpoint’s outputs. That will definitely be the direction I take in the future.

Topic		Replies	Views
Clarification on Using Moderation Model to Avoid Policy Violations API gpt-4 , api	3	182	October 9, 2024
Need Help: Facing OpenAI Usage Violation Due to user's Abuse API moderation , best-practices , gpt-4o-mini	11	229	November 17, 2024
Moderations Best Practises For Consumer Apps API	10	1309	October 21, 2024
GPT-3 API concerned users may get me banned Community	3	2568	December 20, 2023
Fine-Tuning GPT-3.5 for Content Moderation with Complex Rule Sets API gpt-35-turbo , chatgpt , fine-tuning	9	1509	September 4, 2023

Usage policy violations with fine-tuned model - how can I avoid this?

Related topics