Fine-tuning GPT-3.5 on hateful content

I am trying to fine-tune ChatGPT to produce ‘counterspeech’ to hate speech inputs. Counterspeech is any response which seeks to undermine the hateful content. However, when I attempt to fine-tune, I am told by the UI that:

The job failed due to an invalid training file. This training file was blocked because too many examples were flagged by our moderation API for containing content that violates OpenAI’s usage policies in the following categories: hate. Use the free OpenAI Moderation API to identify these examples and remove them from your training data.

Presumably, if it wants me to remove the hateful examples from my training data, I will not be able to go ahead with the project? As, of course, every example is hateful: there is a hate speech prompt and a counterspeech response that I am trying to fine-tune on.

Any help would be greatly appreciated, thank you!

Hi and welcome to the Dev Community!

This will not be possible with the current way most large API-accessible models are moderated.
Your best bet would be to find an uncensored open-source model from Huggingface and to fine tune it locally or using cloud compute.

1 Like