API endpoint to replicate examples on the "GPT-4 for content moderation" blogpost

Really interesting piece of use case demonstrated in this blog post. Using GPT-4 for content moderation. I was wondering how can I replicate something similar.
The Moderation endpoint (OpenAI Platform) seems to return scores on the pre-defined list of categories. But , similar to how it is demonstrated in the example blogpost, how can I provide my own content policy?
Or do I just provide the content policy as part of my prompt, and ask to classify which policy it violated (without using the moderation endpoint)?
Any pointers would be appreciated.

Hi and welcome to the developer forum!

The system described is not using the moderation endpoint directly, it is a combination of a trained smaller model (this bit I am still seeking clarity on) and GPT-4 as the trainer using a set of example scenarios and a pre classified set of desired responses. The smaller model can then be fine tuned to respond in the desired fashion.

I’ll go thought it in more detail and try and give you a round up of methods.

1 Like

The article doesn’t provide a solution, nor even a paper. It is just an exploration - they want you to burn through GPT-4 and fine tune a base model based on the results so you can pay for your own experiment in making a moderator.

From the chart, it seems some work was done, but no end product for you, only for them.

Every example taken to give positives or negatives requires human moderation results and human review of GPT-4 quality against the human moderator to then try to teach GPT-4 with better human-written rules to reduce errors in moderation. And of course you can’t train a GPT-4 itself beyond a prompt, therefore your GPT-4 with better system rules is not the product.

Why even involve the language AI when you’ve already created a human set of pass/no-pass to train a base model on?