Custom Moderation GPT Model | Fine Tuning

To me, topics like this fall under the big umbrella of “don’t fight the model.”

The model “wants” to be helpful, so trying to give it a bunch of directives to not be helpful only really downgrades the response.

The solution I always propose for this is to filter-in and filter-out.

Just send them user’s message to a super cheap model and ask if it’s on topic, and pass the response through asking if if it’s on topic.

The out-pass wrecks streaming, but you can choose to only do it for questionable outputs or when you’ve already determined a particular user is trying to make the model talk about things you don’t want it to.