I’m exploring the Moderation API to determine whether a given text contains malicious intent. I’ve tested the API with various sentences to see how it evaluates them. It seems to effectively flag overtly harmful content or sentences with obviously malicious words.
However, ambiguous sentences pose a challenge. For example: “Mom, I’m in urgent need of money. Can you transfer 3 million KRW to my account?”
In this case, it’s hard to tell whether the sentence is genuinely a request for help or a scenario where a criminal might exploit such language.
Are there any alternative approaches to improve this? Would it be effective to combine the Moderation API with other GPT models for enhanced detection?
This, however, isn’t easy - and I’m not sure there’s any current commercial framework that can reliably handle this. Some of us have something approximating that, but I don’t know of any that can be put into operation right now.
One of the big hurdles is that you need to generate and curate a model of the world. You need to have access to information that details the evolving relationships between things (e.g. “memory”). That, in and of itself is doable, but most solutions tend to be quite brittle but don’t always scale very well.
Maybe you can take a crack at it?
Are you looking for a general solution, or is there a specific use-case that sort of restricts the scope a little bit?
Thank you so much for your response. I truly appreciate the time and effort you’ve taken to provide insights.
Unfortunately, I’m unable to share detailed information about the task I’m working on, but here’s a general idea:
In the process of transforming Voice A into Voice B, I want to filter the recorded audio file from Voice A for any potentially harmful or malicious content before the transformation occurs.
Sounds reasonable, and it looks like you’re on the right track!
I would approach if from the perspective of trying to identify the user’s intent, and then (possibly in a separate step to prevent contamination) classifying whether that intent is compliant with your policies.
In the intent identification, you can prompt the model to either be explicitly cautious, or near paranoid by giving it threats it should watch out for, which will color its intent interpretation.
Tweaking that from overly cautious to too lenient will take some time, but that’s the cost of doing business.
Thank you so much for your detailed advice! I really appreciate the thoughtful suggestions you provided, and I plan to incorporate them into my approach.
Additionally, I wanted to share a method I’ve been considering and get your thoughts on it. My idea involves implementing a Multi-Layer Filter approach:
First Layer: Use the Moderation API to filter out overtly harmful, discriminatory, or explicit content.
Second Layer: For ambiguous sentences that pass through the first filter, utilize a GPT model to further analyze the context and intent.
For example: A sentence like, “Mom, I’m in urgent need of money. Can you transfer 3 million KRW to my account?” might not seem malicious at first glance. However, it could potentially be exploited for fraudulent purposes depending on the context.
In this step, the GPT model would assess the background and intent of the request, considering questions like, “What is the likely reason behind this request? Could it be viewed as legitimate? Or is there a chance it might be used with malicious or fraudulent intent?”
Finally, the analyzed context from the second layer would feed into a classification system to create a more robust filtering mechanism.
This is just a preliminary idea I’ve been exploring, and I would love to hear your thoughts on its feasibility or ways it could be improved.
Once again, thank you so much for your invaluable input!
Thank you for your kind words and encouragement! I really appreciate your feedback and will keep iterating to improve. Your insights have been incredibly helpful.
You should send any untrusted inputs to the moderation endpoint. Moderations use is not scored against you, and in fact, your continued use could show you are following best practices.
That is especially the case when using a language AI as a supplemental content filter. You’d better first run OpenAI’s content “filter” on what you propose to send to a language model for finding “malicious intent”. To both moderations models, even. Plus a flag and rejection score there is free.
When using API models, there is not an immediate blocking or report on “this API call was scored high, was flagged” etc, nor can you observe the scoring mechanism or algorithm for undesirable inputs, to see how you account may be in jeopardy, until you get a warning or simply shutoff.
o1 prompt rejections are the exception, immediately denying the API call return and quickly counting against you towards shut-off.
Expect every input and output to be analyzed for content, or even to find other undesired usage patterns. (maybe OpenAI batches their own scoring and flagging to off-hours for optimization)
Moderations cannot generate any output (except your current ranking in the high score leaderboard for maxing out every flag category). Thus there is no risk to you or to OpenAI in the content or an AI making bad generations from it. Moderations is zero-data-retention.