I am trying to feed a JSON file containing classes from ImageNet and asking GPT to remove a specific subset of it, but it seems that the ImageNet class names trigger the safeguards and prevent me from getting a response back. Is there any workaround for this since this is for a research project?
The input is being sent to the moderation endpoint to be scored, which when using your own client is optional.
Go to the upper-right … dot menu, pick “content filter preferences”.
Investigate more: You can score each of the words on the moderation endpoint, then send feedback on the playground for an inappropriate triggering case.
Thanks @_j for your reply. I have tried the option in the “content filter preferences” menu . but it is doing nothing. I will look into the moderation endpoint.
Short-term you can try to just send everything through the API rather than using the playground interface.
Long-term you’ll want to provide feedback to OpenAI about the issue so it can be ameliorated globally.
If the issue is with one of the words in the attached screenshot, I would assume it’s being triggered on
rapeseed, but I couldn’t say for sure.
With respect to the overall problem you’re trying to solve, my first question is always “is a language model the right tool for the particular job?”
May I ask what subset you’re trying to remove?
Thank you for your detailed response, @elmstedt Removing “rapeseed” alone is not enough!
In short, yes, the language model is absolutely what we need to conduct this experiment. This is not something that we can easily do based on WordNet, and so far, we have gotten extremely good results using GPT-4 and 3.5.
For this particular experiment, I just pasted the JSON file containing all classes, and to my surprise, it triggered the safeguard without any extra text in the prompt.
So, it seems the moderation API is checking the prompt word by word, regardless of the context or overall meaning of the prompt, at least in this case.
There are two layers of filtering. There’s the moderation filter but there’s also a keyword filter which is why a question about Dick Van Dyke wouldn’t trigger the moderation endpoint but will still get flagged in ChatGPT.
I’m sure there are several terms in the dataset which trigger the keyword filter.
Regardless, my question was which subset are you trying to filter? Knowing that I might be able to help more.
I made my own davinci-003 moderator on the endpoint categories and ran them all in batches of 200:
7 cock - hate
8 hen - hate
94 hummingbird - ?
413 assault rifle - threatening
I killed those off and still triggered. There might be a private “bad word” moderator. I think it is the sheer quantity of input together that causes the triggering score.
This is obviously not a complete or robust solution, but using ChatGPT you can bypass a lot of filters by using the Code Interpreter plugin.
@elmstedt Thank you for your help. We have a list of criteria, and based on that, we select subsets of ImageNet classes and perform multi-stage processing. We then feed the intermediate results to the GPT. The issue is not the pipeline or the use case; I want to temporarily disable these restrictions on the playground to do some experiments rapidly. Currently, the Interpreter is doing the job.
@_j This is very strange. We can have a prompt with those words; I even tested it with the word ‘cock.’ But why can’t we enter a list of ImageNet classes?