Background: In the process of testing the API I took a completion generated by the 3.5turbo API (it was not unsafe content or an unsafe topic) and pasted it in the Playground as an assistant message. On pressing submit I was blocked by a ‘Prompt contains high-risk words’ message, and I followed its instructions to check the text I used with the moderation API.
Strangely, it didn’t flag it for anything (I tried both stable and latest endpoints, and received JSON responses for both) so I can’t even know what the problem is exactly.
My understanding is there are actually at least two layers of moderation used by OpenAI.
One is contextually aware—that’s the moderation API—it “understands” the difference between Dick Van Dyke the actor and a slur.
Then there’s the keyword filter, which doesn’t understand the difference and blocks the request altogether.
But, maybe someone from OpenAI can shed some more light on the subject?
Oh I see, that explains a lot! I’ve narrowed it down to a single word that trips the filter in the Playground - but even when using this single word in isolation the moderation endpoints don’t flag it, funnily enough.