Usage policy violations with fine-tuned model - how can I avoid this?

Before fine-tuning, I filtered every prompt in my dataset through OpenAI’s moderation API. Now I’m doing the same for all user-generated input in my prompt. Still, even after heavy filtration and moderation of the dataset, my fine-tuned model still occasionally generates responses that violate the usage policy (discriminatory/violent language). I’m also running these outputs through the moderation endpoint so nothing abusive should get through to the user, but I’m still concerned that these abusive generations are putting my account at risk.

My suspicion is that a non-insignificant portion of the filtered dataset did contain abusive material that was missed by the moderation endpoint, on top of a non-insignificant amount of post-training user input. These two factors could compound to cause inappropriate generations to occur which I am not happy about.

What I want to know is: in lieu of hand-filtering all of the training data, re-training, then hand-moderating all incoming user input, what can I do to reduce/eliminate the amount of abusive generations? If there’s nothing I can do, will this impact my OpenAI account standing?

I should add that the amount of abuse generated is still pretty low (around 1-2%), but it does add up. A big thank you to anyone who can offer their advice!

Welcome to the developer forum!,

In general, so long as you are moderation endpoint checking both your input and the models output for violations (note that is both directions in and out) then you are performing the required due diligence, if you are getting policy violation results despite this doing this, then you should look at creating a more robust moderation method by taking advantage of the floating point values included in the moderation endpoint return packet.

These numeric values allow you to create your own limits for the various subtopics and build a set of moderation values that trigger at a lower threshold and should therefor reduce the amount of subsequent policy violation messages you receive.

So long as you are making a good faith attempt at moderation using the endpoints provided you are complying with the contractual requirements of your agreement, however, if you are finding that violations are still being generated then I’d look into the above method to ensure you try to stay within the limits.

Thank you for your helpful reply! I don’t know why I didn’t consider using stricter thresholds from the moderation endpoint’s outputs. That will definitely be the direction I take in the future.

1 Like