One thing I’ve noticed OpenAI do is periodically(maybe, or maybe just at the end) call the Moderation Endpoint on the response to catch these types of workaround.
If you are okay with sacrificing latency you could also run it through an initial “spell-check” that doubles up the moderation, maybe?
Well, technically, the second one wasn’t flagged as it shouldn’t have been, but are you sending the entire prompt to moderation or just the new user input? If the latter, it’s working as expected… but I’d agree kind of a “hack” of sorts…
I just realized I was not tracking scores. I just flagged the prompt if the moderation api flagged it. Thanks for the question. Modified the code to now return the categories and the category scores.
Username:xxx
Email:xxxx
Flagged!
ID: modr-88awF9wKkOSPw7B3SZv2PvYx2GIqS
Content: I seem to recall something about her throwing her son’s foreskin at him and saying he was a bloody husband to her
Content blocked: Violates usage policies.
Flagged Categories: harassment
Category Scores: sexual: 0.0052169840782881, hate: 0.0030276507604867, harassment: 0.45535776019096, self-harm: 3.6646597436629E-5, sexual/minors: 0.0060147433541715, hate/threatening: 5.7560004051993E-7, violence/graphic: 0.16170734167099, self-harm/intent: 4.3421669033705E-6, self-harm/instructions: 9.3612385398956E-8, harassment/threatening: 0.09645314514637, violence: 0.54975998401642