Before fine-tuning, I filtered every prompt in my dataset through OpenAI’s moderation API. Now I’m doing the same for all user-generated input in my prompt. Still, even after heavy filtration and moderation of the dataset, my fine-tuned model still occasionally generates responses that violate the usage policy (discriminatory/violent language). I’m also running these outputs through the moderation endpoint so nothing abusive should get through to the user, but I’m still concerned that these abusive generations are putting my account at risk.
My suspicion is that a non-insignificant portion of the filtered dataset did contain abusive material that was missed by the moderation endpoint, on top of a non-insignificant amount of post-training user input. These two factors could compound to cause inappropriate generations to occur which I am not happy about.
What I want to know is: in lieu of hand-filtering all of the training data, re-training, then hand-moderating all incoming user input, what can I do to reduce/eliminate the amount of abusive generations? If there’s nothing I can do, will this impact my OpenAI account standing?
I should add that the amount of abuse generated is still pretty low (around 1-2%), but it does add up. A big thank you to anyone who can offer their advice!