However, that is wrong documentation for the RESTful API itself.
output → moderation → output is where you’ll find the object
Or “moderation” in event "type": "response.completed"
See that you ran flagged input.
See that the model was flagged for non-refusal.
Concern
This only provides inclusion of a score in a response.
It doesn’t optionally prevent the input from being run.
It only protects you from OpenAI generations you might not want to show. It doesn’t protect you from OpenAI.
You API organization is still in jeopardy with use of unclassified unfiltered user content - and in jeopardy anyway from lots of things unclassified by moderation and undocumented until you are banned and credits taken: distillation, cyber research, biological, disinformation campaigns, etc.
Perhaps a new API method, to instill false confidence and encourage you to get banned?
It’s been in production for months. I don’t see the problem. The intention of the prompts are very clear indeed. It’s used to detect toxic end user uploads and take them down. Necessarily they have to be stored in the back end for at least a few seconds, either way.
I don’t see why sending it to another end point makes things any different
Thanks for raising this and for sharing the detailed context.
The moderation scores returned with Responses and Chat Completions API calls are intended to help developers enforce their own application policies and decide whether generated content should be shown to users, especially when the output is flagged or exceeds a chosen threshold for the supported moderation categories.
The model still generates normally. Review the moderation results before you show the output to a user or take downstream actions.
In other words, moderation results included with a generation response do not prevent the model from processing the input. They are designed to provide visibility into the input and/or output so your application can take appropriate action before displaying content or triggering downstream workflows.
If your goal is to screen user input before sending it for generation, I’d recommend using the standalone moderation endpoint directly.
I also appreciate the note about the response object documentation and the concern that returning moderation scores alongside generation responses can feel redundant or potentially confusing. I’ll pass this feedback along to the team and follow up once there are updates or clarifications to share.
Personally I think it’s a great feature. But then I also think that this will not fit all the workflows we might imagine.
Correct me if I’m wrong, what may lead to ban is what model generates from your prompt, not what you put in. I have a comment moderator plugin running for several years already and this thing receives all crap possible in, but it is constructed so that the only output it generates is a digit which cannot be harmful on its own. So never had issues on that side (several millions runs already).
Then if model generates a shape which might contain text (potentially harmful, anything which is not your predefined constants). Then technically you might be exposed to a ban because you cannot guarantee the output. If on top of that you do nothing about moderation of the input, chances for a band increase drastically.
So the moderation of the input is almost always a must-have in any application where users may submit any content. I would also recommend you log the moderation runs attached to your own request ID and the user who submitted them (no need to store the content of the message itself unless you have a legal reason), and you use that input ID across all your pipeline to trace both the original moderation result and the operations you did with that content after the moderation classified the text as safe to use.
Why this log? If you get banned at least you have a sort of a proof that you did everything right and the model generated something bad based on “safe” content submitted by the user (you still need to provide your instructions to clarify your part in there). Doesn’t mean you will get unbanned for that, but at least if this gets serious you have your backup.
Now we often use chained operations where the output of the model is the input for the next operation, so the moderation score provided in the same call response skips you a separate API call to moderation endpoints.