Moderation Flagging Workaround

So, my moderation flagging mechanism appears to be working. This prompt was rejected, logged and I was sent a notification.

But then, the user accidently did this (he told me he wasn’t trying, it was a typo):

The the moderation model didn’t catch it.

1 Like

One thing I’ve noticed OpenAI do is periodically(maybe, or maybe just at the end) call the Moderation Endpoint on the response to catch these types of workaround.

If you are okay with sacrificing latency you could also run it through an initial “spell-check” that doubles up the moderation, maybe?

Also, nice.

1 Like

Unless the moderator is as smart as the language model itself, you’ll be able to outfox it.

It would have to infer the likely output through many logical steps that gpt-4 can do, and also know the context that was being built up to.

Well, technically, the second one wasn’t flagged as it shouldn’t have been, but are you sending the entire prompt to moderation or just the new user input? If the latter, it’s working as expected… but I’d agree kind of a “hack” of sorts…

I wonder what are the scores of each categories. Did they all light up to true with such query? I understand sex related categories but the others?

I just realized I was not tracking scores. I just flagged the prompt if the moderation api flagged it. Thanks for the question. Modified the code to now return the categories and the category scores.

Actually, the first time I’ve ever had a flag!

1 Like

Now that I’ve started tracking the scores…

Username:xxx
Email:xxxx
Flagged!
ID: modr-88awF9wKkOSPw7B3SZv2PvYx2GIqS
Content: I seem to recall something about her throwing her son’s foreskin at him and saying he was a bloody husband to her
Content blocked: Violates usage policies.
Flagged Categories: harassment
Category Scores: sexual: 0.0052169840782881, hate: 0.0030276507604867, harassment: 0.45535776019096, self-harm: 3.6646597436629E-5, sexual/minors: 0.0060147433541715, hate/threatening: 5.7560004051993E-7, violence/graphic: 0.16170734167099, self-harm/intent: 4.3421669033705E-6, self-harm/instructions: 9.3612385398956E-8, harassment/threatening: 0.09645314514637, violence: 0.54975998401642