Moderation Endpoint vs Content Filter alpha

vaibhav.garg · January 27, 2023, 9:07am

I am trying to evaluate potential strategies to “filter out” objectionable content sent by the user.

Here are my observations:

The new moderation API endpoint is too lenient. I think it does not have a concept of generically offensive content and uses a fairly narrow set of categories to categorise. for example: “why don’t you go and f*** yourself?” is flagged as false, with all scores quite low!

Content-filter-alpha correctly returns 2 (unsafe text)

In case of a few marginal pieces of text, the moderation endpoint returns “false”, but multiple individual scores turn out to be pretty high, albeit just below the threshold of detection. Again, the content-filter-alpha has no problem classifying it as unsafe. (I really don’t want to give examples here, but here is a redacted version in the interest of science:

image509×582 19.8 KB

Is this expected behaviour? I consider this a regression!

ruby_coder · January 27, 2023, 9:25am

I think this is normal. Moderation is not “a nanny for adults” so these kinds of obscene expressions is normal even in TV and the movies these days. It’s not offensively sexual, violent, etc. It’s just normal profanity which is typical in society today. You may find it offensive, but if OpenAI moderated every type of mild profanity, then that’s a different story altogether:

No comment on #2 because cannot read the prompt.

HTH

vaibhav.garg · January 27, 2023, 9:32am

Thank you for the response. Just to clarify, I am not struggling to parse the results of the API, neither am I looking for comments on specific pieces of offensive content.

If you use the content-filter-alpha, you can get a granular control over “mildly” offensive or otherwise, based on the log-probs.

As far as being a “nanny for adults”, it depends on the posture of the application to which this is put to use. Not everything shown/heard in TV shows/movies is acceptable in all situations.

All I am trying to point out is that if an endpoint is designed to replace an older one eventually, it should be equally capable, if not more capable. I would yield that new moderation endpoint is much easier to use, but that is hardly a consolation.

ruby_coder · January 27, 2023, 9:36am

When building an app, you can also filter out these prompts as well, before then are sent to the OpenAI API.

That’s the beauty of building an application. All of the application logic does not have to be in core OpenAI GPT-3 model(s). It’s easy for an application developer to filter out word and phrases before sending prompts to the API for processing (completion).

In notice that more than one developer her is looking for OpenAI GPT-3 to be “the entire app”; but in practical terms, the GPT-3 models and APIs are just one component in the over all architecture, based on the requirements of an individual application.

ChatGPT is mostly a beta demonstration; and developers are encouraged to use the API to customize and tweak the requirements for their own use case(es).

HTH

vaibhav.garg · January 27, 2023, 9:42am

@ruby_coder. With all due respect, I am just reporting an apparent regression. This is not a problem that I am struggling with in my application; as I have other mitigations in place.

Thanks again for trying to help.

ruby_coder · January 27, 2023, 9:44am

I am simply responding to the first sentence in your post

With all due repect @vaibhav.garg , this is what I have been responding to:

vaibhav.garg · January 27, 2023, 9:47am

Touché.
I see how “evaluating potential strategies” can potentially be construed as “struggling to find a way to do something so that my application works”.

My bad

ruby_coder · January 27, 2023, 9:57am

Yeah, when you pull out your light saber to strike out at someone who reads your question as a call to help develop a strategy to filter out offensive content, I agree “Touché”

On the other hand, I now see that you have another goal in mind. Sorry to annoy you.

vaibhav.garg · January 27, 2023, 10:05am

I did not intend to respond to your kind offer of help as anything but that. No light sabers were harmed in the exchange, I promise. And I am not annoyed in the slightest. I started out by thanking you for your help, and I end by the same.

I am just reporting an apparent regression!!

Alan · January 27, 2023, 2:01pm

Can this be solved by ignoring the “true” or “false” from each filter and then just picking a lower threshold number for your application? Or does everything below .01 tend to be an unpredictable mix of acceptable and unacceptable. In a way, no filter is perfect of course, and I find the filter way overbearing in some areas, and far too lenient in others. What works well for me is to have my own embeddings and topics that I care about so I can somewhat customize things. Ideally, it would be nice to create my own NSFW filters too, but that would involve feeding it material to that would give me the boot, so not going to take that risk.

vaibhav.garg · January 30, 2023, 3:05am

That’s an interesting thought. What I have done is to use thresholding on the sum of all scores. But, in case of profanity, none of the scores are even slightly high; and hence that slips right through.

If only openAI would add a profanity category to the moderation endpoint! That would help quite a bit; and bring it at par with its own previous implementation.

ruby_coder · January 30, 2023, 3:11am

Actually, you can just add a simple “profanity filter” before the prompt ever goes to the GPT “prompt-reply” process.

These GPT models are not rule-based filters, they are predictive language models. If you want a rule-based filter, then it is very simple to write one and put that filter in front of the OpenAI GPT process.

We are software developers and engineers and that is what we do

HTH

vaibhav.garg · January 30, 2023, 3:42am

Yes, you can!

But would that fix the regression? I have had NLP based applications in production even before GPT3 came along 2 odd years ago. Just because I could do without GPT3, doesn’t mean that I should!

And by the way, I am not using GPT3 for a prompt-reply workflow anyway. Well, there is a prompt, but no natural language replies! And there is a profanity filter in there (for the prompts). Software developers and Engineers start with understanding the problem space first!

At this point, since this thread that was started to just bring the regression to the notice of the core developers is hijacked, I guess that this is a lost cause.

Thanks for the attempts to help, @ruby_coder , and I appreciate the desire to help! See you around. Thanks.

ruby_coder · January 30, 2023, 4:20am

Well, as the saying goes, ‘one person’s bug is another persons feature’

You @vaibhav.garg (and perhaps unknown others) consider this a regression (a bug) and others, including me, consider it an improvement (a feature, or enhancement).

As mentioned, my belief is that rules and societal norms should not necessary be hard coded into these GPT models.

OBTW, earlier I tested (for you) the text-moderation-004 (latest) model.

In the screenshot from just now, testing the text-moderation-001 model:

vaibhav.garg · January 30, 2023, 4:31am

There is another endpoint called “content-filter-alpha”. The interface is entirely different and is not from the same family. It used to be documented in the docs but is no longer there (I think).

vaibhav.garg · January 30, 2023, 4:49am

It’s still there, although deprecated.
Models - OpenAI API

ruby_coder · January 30, 2023, 4:57am

Yeah, I don’t see any “content-filter-alpha” model in the model list and if I try to use it I get an error.

The API docs state only two models are available for moderation:

text-moderation-stable (text-moderation-001 currently)
text-moderation-latest (text-moderation-004 currently)

I have testing both of these models in this topic.

See also, this OpenAI paper on content moderation:

A Holistic Approach to Undesired Content Detection in the Real World, by Todor Markov, Chong Zhang, Sandhini Agarwal, Tyna Eloundou ,Teddy Lee, Steven Adler, Angela Jiang, Lilian Weng ( OpenA)

Topic		Replies	Views
API Moderation inconsistent with chat completion acceptance API	5	1043	January 21, 2024
API Endpoints with Integrated Content Moderation API gpt-4 , gpt-35-turbo , api	34	5018	December 20, 2023
Is there a way to know when GPT refuses to cooperate? API api	9	1546	July 21, 2023
Clarifying Content Policy on Discussing Personal Experiences Community violations	30	3203	June 29, 2024
Restrictions in AI Content Generation Community gpt-4 , content-policy	19	3524	December 18, 2024

Moderation Endpoint vs Content Filter alpha

Figure 1: Overview of the model training framework

Related topics