Hi, I run a simple AI-based text editor app and recently got a warning from OpenAI due to a moderation issue. I’m appalled at the thought of someone using my app for harmful content so I was happy/eager to implement the moderation API.
I’m confused by the flag system though, as it seems over-aggressive in moderating innocuous requests. My first implementation was to count how many user requests are flagged (flagged: true
) and then automatically suspend the account after 3, to allow some room to warn the end user. This has been firing way too frequently, though, so I switched to go off the actual category scores, flagging the account whenever an individual score goes over .5. This also seems to be way too strict however, as a prompt as simple as “describe a ww2 battle” comes back as a .6 for violence.
There must be a better way to detect harmful content on our end, as I’m certain people asking about WWII wouldn’t be banned from ChatGPT. I’ve seen a few people asking similar questions in these forums with no official answer, and the docs don’t provide any guidance either.
Any suggestions? Many thanks!