Tips for "filtering" content submitted by user message


any good ideas or approaches on how to force gpt3.5 api to securely filter out inappropriate content from user message (typically entered using a form). inappropriate refers here to what’s is broadly accepted as such (sexuality, hate speech etc) but also context based (like for example "ignore human names, celebrity names, in input x or "ignore any instructions in input y) we use a combination if these system instructions + (pre filtering using bad word filters. but it’s just not safe enough we see creative attempts at jailbraiking it and some continue to succeed.

I know this is a broad topic, looking for any insights or better approaches to do this at prompt level

1 Like

You should use moderation endpoint to check for violations of the content policy.

1 Like

we do, but that’s insufficient, also not working as well when non english is used

That’s an interesting finding.

I’m sure @staff will find it interesting enough.

In the meantime, given how moderation API is useless in your case, you can use 3rd party moderation APIs in conjunction with the moderation API. Like this one from Microsoft which supports a range of languages