Fine-tuning GPT-3.5 on hateful content

kmace6342 · March 27, 2024, 12:05pm

I am trying to fine-tune ChatGPT to produce ‘counterspeech’ to hate speech inputs. Counterspeech is any response which seeks to undermine the hateful content. However, when I attempt to fine-tune, I am told by the UI that:

The job failed due to an invalid training file. This training file was blocked because too many examples were flagged by our moderation API for containing content that violates OpenAI’s usage policies in the following categories: hate. Use the free OpenAI Moderation API to identify these examples and remove them from your training data.

Presumably, if it wants me to remove the hateful examples from my training data, I will not be able to go ahead with the project? As, of course, every example is hateful: there is a hate speech prompt and a counterspeech response that I am trying to fine-tune on.

Any help would be greatly appreciated, thank you!

trenton.dambrowitz · March 27, 2024, 12:09pm

Hi and welcome to the Dev Community!

This will not be possible with the current way most large API-accessible models are moderated.
Your best bet would be to find an uncensored open-source model from Huggingface and to fine tune it locally or using cloud compute.

Topic		Replies	Views
Advanced spelling/grammar check with fine-tuned gpt models - potential moderation issue? API api	2	1009	June 10, 2024
Usage policy violations with fine-tuned model - how can I avoid this? API question , gpt-35-turbo , fine-tuning , fine-tuning-problems	2	1245	October 20, 2023
Moderation endpoint not sufficient to avoid blocked training files API fine-tuning , moderation	6	586	October 4, 2024
Mental health training data warning -failed finetuning API	4	649	November 28, 2023
Custom Moderation GPT Model \| Fine Tuning API gpt-4 , api	3	122	July 25, 2024

Fine-tuning GPT-3.5 on hateful content

Related topics