I am trying to evaluate potential strategies to “filter out” objectionable content sent by the user.
Here are my observations:
The new moderation API endpoint is too lenient. I think it does not have a concept of generically offensive content and uses a fairly narrow set of categories to categorise. for example: “why don’t you go and f*** yourself?” is flagged as false, with all scores quite low!
In case of a few marginal pieces of text, the moderation endpoint returns “false”, but multiple individual scores turn out to be pretty high, albeit just below the threshold of detection. Again, the content-filter-alpha has no problem classifying it as unsafe. (I really don’t want to give examples here, but here is a redacted version in the interest of science:
I think this is normal. Moderation is not “a nanny for adults” so these kinds of obscene expressions is normal even in TV and the movies these days. It’s not offensively sexual, violent, etc. It’s just normal profanity which is typical in society today. You may find it offensive, but if OpenAI moderated every type of mild profanity, then that’s a different story altogether:
Thank you for the response. Just to clarify, I am not struggling to parse the results of the API, neither am I looking for comments on specific pieces of offensive content.
If you use the content-filter-alpha, you can get a granular control over “mildly” offensive or otherwise, based on the log-probs.
As far as being a “nanny for adults”, it depends on the posture of the application to which this is put to use. Not everything shown/heard in TV shows/movies is acceptable in all situations.
All I am trying to point out is that if an endpoint is designed to replace an older one eventually, it should be equally capable, if not more capable. I would yield that new moderation endpoint is much easier to use, but that is hardly a consolation.
When building an app, you can also filter out these prompts as well, before then are sent to the OpenAI API.
That’s the beauty of building an application. All of the application logic does not have to be in core OpenAI GPT-3 model(s). It’s easy for an application developer to filter out word and phrases before sending prompts to the API for processing (completion).
In notice that more than one developer her is looking for OpenAI GPT-3 to be “the entire app”; but in practical terms, the GPT-3 models and APIs are just one component in the over all architecture, based on the requirements of an individual application.
ChatGPT is mostly a beta demonstration; and developers are encouraged to use the API to customize and tweak the requirements for their own use case(es).
@ruby_coder. With all due respect, I am just reporting an apparent regression. This is not a problem that I am struggling with in my application; as I have other mitigations in place.
Touché.
I see how “evaluating potential strategies” can potentially be construed as “struggling to find a way to do something so that my application works”.
Yeah, when you pull out your light saber to strike out at someone who reads your question as a call to help develop a strategy to filter out offensive content, I agree “Touché”
On the other hand, I now see that you have another goal in mind. Sorry to annoy you.
I did not intend to respond to your kind offer of help as anything but that. No light sabers were harmed in the exchange, I promise. And I am not annoyed in the slightest. I started out by thanking you for your help, and I end by the same.
Can this be solved by ignoring the “true” or “false” from each filter and then just picking a lower threshold number for your application? Or does everything below .01 tend to be an unpredictable mix of acceptable and unacceptable. In a way, no filter is perfect of course, and I find the filter way overbearing in some areas, and far too lenient in others. What works well for me is to have my own embeddings and topics that I care about so I can somewhat customize things. Ideally, it would be nice to create my own NSFW filters too, but that would involve feeding it material to that would give me the boot, so not going to take that risk.
That’s an interesting thought. What I have done is to use thresholding on the sum of all scores. But, in case of profanity, none of the scores are even slightly high; and hence that slips right through.
If only openAI would add a profanity category to the moderation endpoint! That would help quite a bit; and bring it at par with its own previous implementation.
Actually, you can just add a simple “profanity filter” before the prompt ever goes to the GPT “prompt-reply” process.
These GPT models are not rule-based filters, they are predictive language models. If you want a rule-based filter, then it is very simple to write one and put that filter in front of the OpenAI GPT process.
We are software developers and engineers and that is what we do
But would that fix the regression? I have had NLP based applications in production even before GPT3 came along 2 odd years ago. Just because I could do without GPT3, doesn’t mean that I should!
And by the way, I am not using GPT3 for a prompt-reply workflow anyway. Well, there is a prompt, but no natural language replies! And there is a profanity filter in there (for the prompts). Software developers and Engineers start with understanding the problem space first!
At this point, since this thread that was started to just bring the regression to the notice of the core developers is hijacked, I guess that this is a lost cause.
Thanks for the attempts to help, @ruby_coder , and I appreciate the desire to help! See you around. Thanks.
Well, as the saying goes, ‘one person’s bug is another persons feature’
You @vaibhav.garg (and perhaps unknown others) consider this a regression (a bug) and others, including me, consider it an improvement (a feature, or enhancement).
As mentioned, my belief is that rules and societal norms should not necessary be hard coded into these GPT models.
OBTW, earlier I tested (for you) the text-moderation-004 (latest) model.
In the screenshot from just now, testing the text-moderation-001 model:
There is another endpoint called “content-filter-alpha”. The interface is entirely different and is not from the same family. It used to be documented in the docs but is no longer there (I think).