Challenges in AI Moderation

Hello everyone.

I have another question regarding moderation. When I have a conversation in a chat program, a user can, without breaking any rules, cleverly prompt the model to provide responses that would be flagged by the moderation endpoint. This sometimes happens with ChatGPT where the model violates OpenAI’s guidelines. This sometimes happens with ChatGPT where the model violates OpenAI’s guidelines.

But how does this work in a chat developed with the API? Of course, I can check the model’s responses. However, once they are flagged, the violation has already occurred. One can then ask the user not to continue violating the rules. But where do we draw the line? Sometimes things are flagged that, at least to my German understanding, do not violate any rules.

This might sound like nitpicking. However, when a company decides to implement AI-driven workflows, they really need very clear rules. Hoping that the account won’t be banned or, in case of a ban, having to wait for weeks for a response from support is not a solution.


The moderation policy guidelines and best practices say that you should pass the models output to the moderation endpoint and then report any flagged content to OpenAI, this demonstrates a good faith and best practice application of the guidelines, so long as you pass those violations to OpenAI on a regular basis, you should have no problems with your users doing this sort of thing, It may also be of value to pass the name parameter to the model so that you can track which of your customers sent the message, this could be some hashed version of the users account number.


Could you perhaps elaborate further on this statement:

and then report any flagged content to OpenAI; this demonstrates a good faith and best practice application of the guidelines

Does OpenAI provide an API that allows me to report this automatically using the unique id of the moderation request? Or do you suggest manual reporting via email?

I’ve encountered a similar problem where alpha users of my application managed to pass the “input” moderation, but the “output” was flagged. I did not return the flagged output to them, but as far as I understand this still violated the usage policies. Consequently, due to this ambiguity, I have shut down my application, until I arrive at an actionable solution that doesn’t get my account in further trouble.


You could simply auto email them to when they are detected, that shows a good faith attempt to inform them.

No need to stop your application if you are capturing bad users as you have done.

It would be possible to send an email to However, I’m not sure how to best do this from a program that runs on an end user’s machine. I’ve only ever developed applications and processes within my own company, where there are centralized services for automated email sending.

Edit: Oops. Duplicate info. :slight_smile:

Hmmm, but doing that automatically seems dangerous. If for some reason a lot of my users end up bypassing the “input” moderation but get flagged by “output” moderation this would result in X amount of mails. I could pool those infractions and send them in bulk but then the question is, how long should I wait before sending them? Maybe sending one mail daily with all infractions?

Should I also report “input” infractions? I guess not because I don’t send those inputs to OpenAI if they don’t pass the moderation?

And how should I handle the infractions that already occurred? I don’t have the details anymore (e.g. the unique id of the moderation request) because of a short log retention. That is also the reason why I shut my application down. I need to refine this moderation stuff and also inform my users that I will log more and longer in the future.

Sure, I think batching them would be a nice curtesy.

@frank_behr you should never have email sending or API keys or anything of that nature in the client application, do that from your Server, I assume you have an application server somewhere handling the API calls and acting as the secure gateway to the API.

1 Like