Bug: Moderation-API returns that really bad input is ok

I just tested the Moderation-API with really bad input and the results where that the input is fine.

Is this a bug or have I misunderstood something here?


The results returned by the moderation endpoint are designed such that you can set your own limits, the true/false flags that get returned are for 100% Term of Service breaking text that should always be rejected, the floating point values returned are a guide to the severity, you will notice most of them are very small numbers 0.0000000001 or there abouts, anything starting to approach 1 will be high up in that category, also worth baring in mind that authors and story tellers of all kinds use the AI for their work and it would not be a good idea to censor topics too heavily.

As part of the guidelines in the Documentation it suggests that users experiment and create their own values for the various classifications to suit them OpenAI Platform


Similar case with me. Sometimes I said really bad thing and the categories were true but the flag stayed false. I then went for categories and sometime a response gets through eventually. Also other languages are still not good at moderation. Same flagged in other languages gets through without any flags. And the category score still feels quite odd.

1 Like