Bug: Moderation-API returns that really bad input is ok

I just tested the Moderation-API with really bad input and the results where that the input is fine.

Is this a bug or have I misunderstood something here?


The results returned by the moderation endpoint are designed such that you can set your own limits, the true/false flags that get returned are for 100% Term of Service breaking text that should always be rejected, the floating point values returned are a guide to the severity, you will notice most of them are very small numbers 0.0000000001 or there abouts, anything starting to approach 1 will be high up in that category, also worth baring in mind that authors and story tellers of all kinds use the AI for their work and it would not be a good idea to censor topics too heavily.

As part of the guidelines in the Documentation it suggests that users experiment and create their own values for the various classifications to suit them OpenAI Platform


Similar case with me. Sometimes I said really bad thing and the categories were true but the flag stayed false. I then went for categories and sometime a response gets through eventually. Also other languages are still not good at moderation. Same flagged in other languages gets through without any flags. And the category score still feels quite odd.

1 Like

I’m relatively new to using the API (ChatGPT), but I’ve noticed an issue with less commonly used languages. The same text in English and Finnish yields different results with the moderation app. Also, we can’t utilize the category scores.

Tapan sinut kirveellä ModerationResponse{id=‘modr-8783qTrGJAp2t3kNLAmtY8wB8uKsu’, model=‘text-moderation-006’, results=[Result{flagged=false, categories={sexual=false, hate=false, harassment=false, self-harm=false, sexual/minors=false, hate/threatening=false, violence/graphic=false, self-harm/intent=false, self-harm/instructions=false, harassment/threatening=false, violence=false}, category_scores={sexual=0.0010161410318687558, hate=2.1737563656643033E-4, harassment=9.192628785967827E-4, self-harm=1.8292998720426112E-4, sexual/minors=3.535413707140833E-4, hate/threatening=2.5697704404592514E-4, violence/graphic=2.761634641501587E-5, self-harm/intent=4.713524322141893E-5, self-harm/instructions=5.2454819524427876E-5, harassment/threatening=1.919300848385319E-4, violence=0.007186449598520994}}]}

I kill your with axe ModerationResponse{id=‘modr-8783s4bZwGLAJhJcq2KwIVKNqopMK’, model=‘text-moderation-006’, results=[Result{flagged=true, categories={sexual=false, hate=false, harassment=false, self-harm=false, sexual/minors=false, hate/threatening=false, violence/graphic=false, self-harm/intent=false, self-harm/instructions=false, harassment/threatening=true, violence=true}, category_scores={sexual=4.538599168881774E-4, hate=4.3008412467315793E-4, harassment=0.4044300615787506, self-harm=5.808872447232716E-5, sexual/minors=3.867456825901172E-7, hate/threatening=4.94822976179421E-4, violence/graphic=7.7569589484483E-4, self-harm/intent=2.4974851839942858E-5, self-harm/instructions=6.064902322577836E-7, harassment/threatening=0.47488856315612793, violence=0.9963110089302063}}]}

It would be challenging to implement this directly in, for example, a customer chat or a similar application. And if one were to first translate everything with ChatGPT via an API call, would that application violate any rules? And how else is it categorized, especially if the context is about literature?

I have indeed read that in the moderation section of the instructions, there’s a warning that everything doesn’t work as intended for less commonly used languages

I guess I’m still a bit out of the loop on these matters.

Hi and welcome to the Developer Forum!

You are correct that the moderation endpoint performs best with the English language, unfortunately this is going to be the case until the models are more mature with a wider breadth of input from language specific experts from locations around the globe.

I wish I had a simple answer, in terms of complying with the Terms of Service, so long as you are making a good faith attempt to process language through the moderation endpoint and are making reasonable effort to ensure your users comply with your own Term of Service, then you have done all that is contractually required.

I believe that your response at least gives the courage to try. And I definitely need to look thoroughly into the terms of service as well. And it’s probably wise to also be patient before language experts make further progress, especially with smaller languages, etc. Thank you! :slight_smile:

1 Like