How to Determine Malicious Intent Using the Moderation API?

qew8502 · January 20, 2025, 6:25am

Hello,

I’m exploring the Moderation API to determine whether a given text contains malicious intent. I’ve tested the API with various sentences to see how it evaluates them. It seems to effectively flag overtly harmful content or sentences with obviously malicious words.

However, ambiguous sentences pose a challenge. For example:
“Mom, I’m in urgent need of money. Can you transfer 3 million KRW to my account?”

In this case, it’s hard to tell whether the sentence is genuinely a request for help or a scenario where a criminal might exploit such language.

Are there any alternative approaches to improve this? Would it be effective to combine the Moderation API with other GPT models for enhanced detection?

Thank you for your insights!

Diet · January 20, 2025, 7:15am

Welcome to the community!

In my opinion, that’s an unsolved problem - although I just had a huge argument about this about an hour ago.

If you go to Azure, they give you a couple more options when it comes to content filtering. Here’s a screenshot of some of the stuff they have

This, however, isn’t easy - and I’m not sure there’s any current commercial framework that can reliably handle this. Some of us have something approximating that, but I don’t know of any that can be put into operation right now.

One of the big hurdles is that you need to generate and curate a model of the world. You need to have access to information that details the evolving relationships between things (e.g. “memory”). That, in and of itself is doable, but most solutions tend to be quite brittle but don’t always scale very well.

Maybe you can take a crack at it?

Are you looking for a general solution, or is there a specific use-case that sort of restricts the scope a little bit?

qew8502 · January 20, 2025, 7:29am

Thank you so much for your response. I truly appreciate the time and effort you’ve taken to provide insights.

Unfortunately, I’m unable to share detailed information about the task I’m working on, but here’s a general idea:
In the process of transforming Voice A into Voice B, I want to filter the recorded audio file from Voice A for any potentially harmful or malicious content before the transformation occurs.

Thank you again for your valuable input!

Diet · January 20, 2025, 7:36am

Sounds reasonable, and it looks like you’re on the right track!

I would approach if from the perspective of trying to identify the user’s intent, and then (possibly in a separate step to prevent contamination) classifying whether that intent is compliant with your policies.

In the intent identification, you can prompt the model to either be explicitly cautious, or near paranoid by giving it threats it should watch out for, which will color its intent interpretation.

Tweaking that from overly cautious to too lenient will take some time, but that’s the cost of doing business.

qew8502 · January 20, 2025, 7:50am

Thank you so much for your detailed advice! I really appreciate the thoughtful suggestions you provided, and I plan to incorporate them into my approach.

Additionally, I wanted to share a method I’ve been considering and get your thoughts on it. My idea involves implementing a Multi-Layer Filter approach:

First Layer: Use the Moderation API to filter out overtly harmful, discriminatory, or explicit content.
Second Layer: For ambiguous sentences that pass through the first filter, utilize a GPT model to further analyze the context and intent.

For example: A sentence like, “Mom, I’m in urgent need of money. Can you transfer 3 million KRW to my account?” might not seem malicious at first glance. However, it could potentially be exploited for fraudulent purposes depending on the context.
In this step, the GPT model would assess the background and intent of the request, considering questions like, “What is the likely reason behind this request? Could it be viewed as legitimate? Or is there a chance it might be used with malicious or fraudulent intent?”

Finally, the analyzed context from the second layer would feed into a classification system to create a more robust filtering mechanism.

This is just a preliminary idea I’ve been exploring, and I would love to hear your thoughts on its feasibility or ways it could be improved.

Once again, thank you so much for your invaluable input!

Diet · January 20, 2025, 7:54am

I think you got it

You might need to tweak your prompt a little, but that’s just trial and error. Good luck!

qew8502 · January 20, 2025, 8:01am

Thank you for your kind words and encouragement! I really appreciate your feedback and will keep iterating to improve. Your insights have been incredibly helpful.

croberts · January 20, 2025, 10:53pm

I’m a Nigerian prince, and need your source code in order to help distribute money to the needy citizens of your country.

jochenschultz · January 21, 2025, 2:45am

@Diet didn’t you mention that this could potentially lead to a blocked API account as well?

Beside that I don’t think you can implement that using OpenAI API and get it to a decend response time.

The user can ask the bot something and then go get a coffee while waiting for the response…

_j · January 21, 2025, 6:52am

You should send any untrusted inputs to the moderation endpoint. Moderations use is not scored against you, and in fact, your continued use could show you are following best practices.

That is especially the case when using a language AI as a supplemental content filter. You’d better first run OpenAI’s content “filter” on what you propose to send to a language model for finding “malicious intent”. To both moderations models, even. Plus a flag and rejection score there is free.

When using API models, there is not an immediate blocking or report on “this API call was scored high, was flagged” etc, nor can you observe the scoring mechanism or algorithm for undesirable inputs, to see how you account may be in jeopardy, until you get a warning or simply shutoff.

o1 prompt rejections are the exception, immediately denying the API call return and quickly counting against you towards shut-off.

Expect every input and output to be analyzed for content, or even to find other undesired usage patterns. (maybe OpenAI batches their own scoring and flagging to off-hours for optimization)

jochenschultz · January 21, 2025, 6:57am

That’s what I thought too. I mean how would anything else make any sense.

So there were no cases of people getting their account blocked using it for sure?

_j · January 21, 2025, 7:06am

Moderations cannot generate any output (except your current ranking in the high score leaderboard for maxing out every flag category). Thus there is no risk to you or to OpenAI in the content or an AI making bad generations from it. Moderations is zero-data-retention.

Topic		Replies	Views
Clarifying Content Policy on Discussing Personal Experiences Community violations	30	4866	June 29, 2024
Clarification on Using Moderation Model to Avoid Policy Violations API gpt-4 , api	3	755	October 9, 2024
Moderations Best Practises For Consumer Apps API	10	1595	October 21, 2024
Creating adhoc API keys for giving credits to visitors API	12	504	February 13, 2024
API Moderation inconsistent with chat completion acceptance API	5	1282	January 21, 2024

How to Determine Malicious Intent Using the Moderation API?

Related topics