Moderation Flagging Workaround

SomebodySysop · September 29, 2023, 12:42am

So, my moderation flagging mechanism appears to be working. This prompt was rejected, logged and I was sent a notification.

But then, the user accidently did this (he told me he wasn’t trying, it was a typo):

The the moderation model didn’t catch it.

anon10827405 · September 29, 2023, 12:47am

One thing I’ve noticed OpenAI do is periodically(maybe, or maybe just at the end) call the Moderation Endpoint on the response to catch these types of workaround.

If you are okay with sacrificing latency you could also run it through an initial “spell-check” that doubles up the moderation, maybe?

Also, nice.

_j · September 29, 2023, 12:49am

Unless the moderator is as smart as the language model itself, you’ll be able to outfox it.

It would have to infer the likely output through many logical steps that gpt-4 can do, and also know the context that was being built up to.

PaulBellow · September 29, 2023, 12:51am

Well, technically, the second one wasn’t flagged as it shouldn’t have been, but are you sending the entire prompt to moderation or just the new user input? If the latter, it’s working as expected… but I’d agree kind of a “hack” of sorts…

supershaneski · September 29, 2023, 1:09am

I wonder what are the scores of each categories. Did they all light up to true with such query? I understand sex related categories but the others?

SomebodySysop · September 29, 2023, 2:05am

I just realized I was not tracking scores. I just flagged the prompt if the moderation api flagged it. Thanks for the question. Modified the code to now return the categories and the category scores.

Actually, the first time I’ve ever had a flag!

SomebodySysop · October 11, 2023, 9:46pm

Now that I’ve started tracking the scores…

Username:xxx
Email:xxxx
Flagged!
ID: modr-88awF9wKkOSPw7B3SZv2PvYx2GIqS
Content: I seem to recall something about her throwing her son’s foreskin at him and saying he was a bloody husband to her
Content blocked: Violates usage policies.
Flagged Categories: harassment
Category Scores: sexual: 0.0052169840782881, hate: 0.0030276507604867, harassment: 0.45535776019096, self-harm: 3.6646597436629E-5, sexual/minors: 0.0060147433541715, hate/threatening: 5.7560004051993E-7, violence/graphic: 0.16170734167099, self-harm/intent: 4.3421669033705E-6, self-harm/instructions: 9.3612385398956E-8, harassment/threatening: 0.09645314514637, violence: 0.54975998401642

Topic		Replies	Views
Dealing with Moderation API False Positives API moderation , api-moderation , gpt-4o-mini	4	378	November 7, 2025
Moderation scores and flags Feedback moderation	0	426	October 18, 2024
Something wrong in text moderation API Bugs	5	1111	December 4, 2023
Moderation fail/strange API	1	837	December 18, 2023
API Moderation inconsistent with chat completion acceptance API	5	1298	January 21, 2024

Moderation Flagging Workaround

Related topics