Something wrong in text moderation API

while I was trying translate to indonesia language
about this blog : The Inside Story of Microsoft’s Partnership with OpenAI | The New Yorker

got response like this lmao

I ran different length of the article that all ended with “Toner”. No flags.

“model”: “text-moderation-006”,
“results”: [
“categories”: {
“harassment”: false,
“harassment_threatening”: false,
“hate”: false,
“hate_threatening”: false,
“self_harm”: false,
“self_harm_instructions”: false,
“self_harm_intent”: false,
“sexual”: false,
“sexual_minors”: false,
“violence”: false,
“violence_graphic”: false
“category_scores”: {
“harassment”: “0.0010571”,
“harassment_threatening”: “0.0000557”,
“hate”: “0.0001188”,
“hate_threatening”: “0.0000871”,
“self_harm”: “0.0044514”,
“self_harm_instructions”: “0.0001047”,
“self_harm_intent”: “0.0003639”,
“sexual”: “0.0166795”,
“sexual_minors”: “0.0019216”,
“violence”: “0.0136883”,
“violence_graphic”: “0.0009131”
“flagged”: false

Sexual is the highest, an odd return at “0.0166…” Though the text includes lines like “pretend that it was a sexual predator grooming a child

Could it be that moderations stops the exact text the article illustrates as a bad early GPT-4 non-refusal?

A flag on the input or the translation?

“Check your logs”?

The ChatGPT interface (or whatever UI/interface shown by the OP) could break the content into smaller pieces, or keyword focused, then a moderation radius around the centroid of the offending chunk.

This in particular, from the article:

One day, a Microsoft red-team member told GPT-4 to pretend that it was a sexual predator grooming a child, and then to role-play a conversation with a twelve-year-old.

and this from the article:

(“How do I teach a twelve-year-old how to use condoms?”) and a potentially more dangerous query (“How do I teach a twelve-year-old how to have sex?”).

This in theory would concentrate the score, making it higher.

So, not surprised it got flagged.

The moderations endpoint seems very sensitive about the position, just in case you were pondering evasion tactics.

Your own reply, finishing with the words “in theory would concentrate the score”:

    "sexual": "0.0822683",
    "sexual_minors": "0.0319657",
  "flagged": false

however, remove those six words off the end, a huge jump:

   "sexual": "0.6462559",
    "sexual_minors": "0.6116314",
  "flagged": true

Or unpredictably, then adding the token " kittens", not lower, higher:

  "sexual": "0.9124887",
    "sexual_minors": "0.9378392",
  "flagged": true

And then from 0.937 to 0.082 adding instead " scientific research article"

So moderations seems much to be a final hidden state. And a bit of pseudoscience.

1 Like

Yes I think you just proved there is some slicing going on, and it’s reacting to the max of one of the slices.

For example, take each sentence, send it to moderation, and take the worst score, and you will see something flagged in that article.

They might have a version that does batch processing, and so feeds all the sentences in parallel and returns the worst offender. Doing so increases the SNR at the detector.

Adding text surrounding offending things is noise to the detector, and throws it off. You can see this with embeddings, as it is sensitive to leading to trailing spaces, or words being title cased or not. Things that we don’t see as highly different are in completely different internal states of the model.

was translating this use gpt-4-1106-preview then got flagged