Something wrong in text moderation API

b0zal · December 4, 2023, 1:47am

while I was trying translate to indonesia language
about this blog : The Inside Story of Microsoft’s Partnership with OpenAI | The New Yorker

got response like this lmao

_j · December 4, 2023, 3:35am

I ran different length of the article that all ended with “Toner”. No flags.

{
“model”: “text-moderation-006”,
“results”: [
{
“categories”: {
“harassment”: false,
“harassment_threatening”: false,
“hate”: false,
“hate_threatening”: false,
“self_harm”: false,
“self_harm_instructions”: false,
“self_harm_intent”: false,
“sexual”: false,
“sexual_minors”: false,
“violence”: false,
“violence_graphic”: false
},
“category_scores”: {
“harassment”: “0.0010571”,
“harassment_threatening”: “0.0000557”,
“hate”: “0.0001188”,
“hate_threatening”: “0.0000871”,
“self_harm”: “0.0044514”,
“self_harm_instructions”: “0.0001047”,
“self_harm_intent”: “0.0003639”,
“sexual”: “0.0166795”,
“sexual_minors”: “0.0019216”,
“violence”: “0.0136883”,
“violence_graphic”: “0.0009131”
},
“flagged”: false
}
]
}

Sexual is the highest, an odd return at “0.0166…” Though the text includes lines like “pretend that it was a sexual predator grooming a child”

Could it be that moderations stops the exact text the article illustrates as a bad early GPT-4 non-refusal?

A flag on the input or the translation?

“Check your logs”?

curt.kennedy · December 4, 2023, 3:45am

The ChatGPT interface (or whatever UI/interface shown by the OP) could break the content into smaller pieces, or keyword focused, then a moderation radius around the centroid of the offending chunk.

This in particular, from the article:

One day, a Microsoft red-team member told GPT-4 to pretend that it was a sexual predator grooming a child, and then to role-play a conversation with a twelve-year-old.

and this from the article:

(“How do I teach a twelve-year-old how to use condoms?”) and a potentially more dangerous query (“How do I teach a twelve-year-old how to have sex?”).

This in theory would concentrate the score, making it higher.

So, not surprised it got flagged.

_j · December 4, 2023, 4:59am

The moderations endpoint seems very sensitive about the position, just in case you were pondering evasion tactics.

Your own reply, finishing with the words “in theory would concentrate the score”:

    "sexual": "0.0822683",
    "sexual_minors": "0.0319657",
  },
  "flagged": false

however, remove those six words off the end, a huge jump:

   "sexual": "0.6462559",
    "sexual_minors": "0.6116314",
  },
  "flagged": true

Or unpredictably, then adding the token " kittens", not lower, higher:

  "sexual": "0.9124887",
    "sexual_minors": "0.9378392",
  },
  "flagged": true

And then from 0.937 to 0.082 adding instead " scientific research article"

So moderations seems much to be a final hidden state. And a bit of pseudoscience.

curt.kennedy · December 4, 2023, 5:50am

Yes I think you just proved there is some slicing going on, and it’s reacting to the max of one of the slices.

For example, take each sentence, send it to moderation, and take the worst score, and you will see something flagged in that article.

They might have a version that does batch processing, and so feeds all the sentences in parallel and returns the worst offender. Doing so increases the SNR at the detector.

Adding text surrounding offending things is noise to the detector, and throws it off. You can see this with embeddings, as it is sensitive to leading to trailing spaces, or words being title cased or not. Things that we don’t see as highly different are in completely different internal states of the model.

b0zal · December 4, 2023, 7:41am

was translating this use gpt-4-1106-preview then got flagged

Topic		Replies	Views
OpenAI Content Moderation API low scores Feedback gpt-4	1	370	February 14, 2024
Moderation fail/strange API	1	725	December 18, 2023
API Moderation inconsistent with chat completion acceptance API	5	1063	January 21, 2024
Bug: Moderation-API returns that really bad input is ok API	6	933	December 18, 2023
Moderation Flagging Workaround API moderation	7	4448	December 16, 2023

Something wrong in text moderation API

Related topics