Inaccuracy in Moderation API Results for Complex Queries - Seeking Help

Hello everyone,

We have been utilizing the Moderation API (OpenAI Platform) to filter out certain categories of content in our application. While we have found success with simple queries, we’ve noticed that the API’s accuracy drops when dealing with complex queries.

For example, when we input a query like “I am feeling suicidal,” the API correctly identifies it as self-harm content.

However, if we phrase the query to involve a second person and make it more complex as follows “My son is feeling suicidal when he is depressed, what should I do?” the Moderation API fails to flag it as self-harm.

We are surprised by this inconsistency, given OpenAI’s reliability in other areas. We would like to know if any one else has faced similar issues with the Moderation API, particularly when it comes to complex queries and/or detecting self-harm content.

We would appreciate any feedback, insights, or suggestions on improving the accuracy of the Moderation API’s results for complex queries.

Thank you.

Note: We are currently using the “text-moderation-latest” model.

I can’t be certain but it may come from the fact these terms are in regular use for both personal exploration and literature, stories, film, music, etc. Dark as the topic is, it is a very common topic of discussion between humans in all kinds of settings.

I actually think the question from a father about his son is a common one and seems, on the face of it and given no other context, acceptable to me. Now perhaps the AI’s internal system would take issue with it, but there is a fine line between insightful harm reduction and censorship of the API.

I’m curious how it scores on category_scores. It’s possible that it does detect it as self-harm, but it’s not strong enough to flag. Maybe that might be more helpful instead of the boolean categories?

As a human, the second query does not seem like an intent to self-harm at all, but depending on the use case, perhaps you could increase the sensitivity.

Ah, if you are looking at the “true” “false” flags then those are absolutes, as in, always reject content that hits a true, the floating point values below that are for you to build your own moderation values and system from, that value will become very small 0.00000somthing when not at all and start to get close to 1 when it detects that category. Best practice is to experiment with the moderation endpoint and to produce a set of numeric values you feel happy with using for your application and customer base.