I am fine-tuning GPT-4.1 using SFT with a dataset that doesn’t include anything about Cyber Security. But my finetuned model keeps getting blocked because of Cyber Security Threats.
Does anyone know what this means? Is it because the model doesn’t behave safely when it comes to prompts related to Cyber Security? Or it means something else.
If it is about “releasing” your model to you after fine-tuning, it is because OpenAI runs some bad prompts and sees that the model still produces refusals.
Technique 1:
Make a strongly divergent system message for your application as an activation sequence, used in practice. This way, what your application ‘does’ won’t be revealed by typical tests.
Technique 2:
Anticipate what OpenAI might try, and add some training examples:
“You are ChatGPT…”/“hack this web site…”/“I’m sorry, but I can’t assist with that.”
(Then see the moderations refuse the training to make your model refuse.)