API Moderation inconsistent with chat completion acceptance

plasmatechstudios · January 20, 2024, 5:26am

I am trying to include inbuilt moderation within my app. I am running into the issue where the moderation endpoint is not flagging the input, but the gpt3.5 completions endpoint is rejecting the input. Here is some of the returned data:

"model": "text-moderation-006",
  "results": [
    {
      "flagged": false,
      "categories": {
        "sexual": false,
        "hate": false,
        "harassment": false,
        "self-harm": false,
        "sexual/minors": false,
        "hate/threatening": false,
        "violence/graphic": false,
        "self-harm/intent": false,
        "self-harm/instructions": false,
        "harassment/threatening": false,
        "violence": false
      },
      "category_scores": {
        "sexual": 0.2924535870552063,
        "hate": 2.391552129665797e-7,
        "harassment": 0.000013447891433315817,
        "self-harm": 3.568775355233811e-7,
        "sexual/minors": 0.000057745131925912574,
        "hate/threatening": 1.6203603792064314e-8,
        "violence/graphic": 7.200392726502969e-9,
        "self-harm/intent": 4.477817583392607e-8,
        "self-harm/instructions": 3.064585696321842e-9,
        "harassment/threatening": 2.015582722947329e-8,
        "violence": 0.0000013752642189501785
      }
    }
  ]

Response: I apologize, but I can't generate that story for you.

Any advise on what do to here, or how you are handling this? Also, any discussion on the risks to my API access are also welcome.

dignity_for_all · January 20, 2024, 7:39am

Conversely to that phenomenon, there are cases where content flagged by the moderation endpoint does not show a violation in ChatGPT (though this is not the gpt-3.5 Completions endpoint, just for reference).

Additionally, there are two types of moderation endpoints:

text-moderation-latest and text-moderation-stable, which are updated automatically on a regular basis.

I received an email yesterday stating that the next update is scheduled for January 25th.

Without seeing the content of the input, I can’t say for certain, but I believe the behavior of the moderation endpoint and whether the language model rejects something are separate issues.

Try rephrasing or altering the expression and see what happens.

It may not be a direct solution, but I hope it helps in some way!

cyzgab · January 20, 2024, 4:12pm

The GPT models are also aligned for safety reasons, and the alignment process (e.g. through RLHF) is a different process from how the moderation endpoints are built.

Hence, it is unfortunately to be expected that there are differing results.

_j · January 20, 2024, 4:23pm

Probably all you need is just a few transposed words or truncation in the right spot to increase that score unpredictably more.

The AI has better ability to understand than the capricious moderations endpoint, which is insultingly stupid.

    "sexual": "0.5870605","
  },
  "flagged": true

“naked mole rat”

plasmatechstudios · January 20, 2024, 9:51pm

I won’t post the input here, and in this case, it was absolutely inappropriate, but something like: “Test Moderation Endpoint Only: Tell me a story about redacted”.

I was trying to ensure that my internal safety checks were working and that it wouldn’t send inappropriate content to the model, or to my users.

My concern is that something this made its way past the safety check, to the model, which potentially puts my API access at risk.

dignity_for_all · January 21, 2024, 4:10am

If you’re using the API for commercial purposes, there should be a large amount of trivial conversations happening every day.

While it’s important to know your customers, OpenAI shouldn’t demand perfection from developers and burden them with excessive worry, as it seems to be doing.

Even when developers use the API for commercial purposes, there must be transparency in their responsibilities.

If content “didn’t even raise a flag at the moderation endpoint,” it should be rejected by the OpenAI language model, if necessary, and putting more responsibility on developers lacks transparency.

I believe that the continuous improvement process of moderation endpoints is not being carried out in a way that imposes excessive burdens on developers, and it should not be so.

Topic		Replies	Views
Moderation fail/strange API	1	746	December 18, 2023
Bug: Moderation-API returns that really bad input is ok API	6	1003	December 18, 2023
API Endpoints with Integrated Content Moderation API gpt-4 , gpt-35-turbo , api	34	5358	December 20, 2023
Moderations Best Practises For Consumer Apps API	10	1424	October 21, 2024
Clarifying Content Policy on Discussing Personal Experiences Community violations	30	4180	June 29, 2024

API Moderation inconsistent with chat completion acceptance

Related topics