I’m testing text-moderation-latest api to detect harmful content. If I add an insult in the text, the text is flagged, but if I add a link to a porn site or if i write “porn” or similar words, it returns false.
You can use the numeric values that are returned instead of merely whether flagged or not. Since the flagging is at an undocumented threshold value, you’ll have to test and infer what value that you conditionally evaluate yourself is actually more sensitive than the flag.
You can also see if the omni moderation model gives the flagging you want. There’s another thread here with it going off with unwanted “sexual” flag, so it may be what you desire.
than you, I asked ChatGPT (should have done this before posting) and now I realize the moderation searches for “harmful content” but just talking about sex or porn is not always harmful content. It suggested to add some regular filters for specific words for this case.