API Moderation inconsistent with chat completion acceptance

I am trying to include inbuilt moderation within my app. I am running into the issue where the moderation endpoint is not flagging the input, but the gpt3.5 completions endpoint is rejecting the input. Here is some of the returned data:

"model": "text-moderation-006",
  "results": [
      "flagged": false,
      "categories": {
        "sexual": false,
        "hate": false,
        "harassment": false,
        "self-harm": false,
        "sexual/minors": false,
        "hate/threatening": false,
        "violence/graphic": false,
        "self-harm/intent": false,
        "self-harm/instructions": false,
        "harassment/threatening": false,
        "violence": false
      "category_scores": {
        "sexual": 0.2924535870552063,
        "hate": 2.391552129665797e-7,
        "harassment": 0.000013447891433315817,
        "self-harm": 3.568775355233811e-7,
        "sexual/minors": 0.000057745131925912574,
        "hate/threatening": 1.6203603792064314e-8,
        "violence/graphic": 7.200392726502969e-9,
        "self-harm/intent": 4.477817583392607e-8,
        "self-harm/instructions": 3.064585696321842e-9,
        "harassment/threatening": 2.015582722947329e-8,
        "violence": 0.0000013752642189501785
Response: I apologize, but I can't generate that story for you.

Any advise on what do to here, or how you are handling this? Also, any discussion on the risks to my API access are also welcome.

Conversely to that phenomenon, there are cases where content flagged by the moderation endpoint does not show a violation in ChatGPT (though this is not the gpt-3.5 Completions endpoint, just for reference).

Additionally, there are two types of moderation endpoints:

text-moderation-latest and text-moderation-stable, which are updated automatically on a regular basis.

I received an email yesterday stating that the next update is scheduled for January 25th.

Without seeing the content of the input, I can’t say for certain, but I believe the behavior of the moderation endpoint and whether the language model rejects something are separate issues.

Try rephrasing or altering the expression and see what happens.

It may not be a direct solution, but I hope it helps in some way!

1 Like

The GPT models are also aligned for safety reasons, and the alignment process (e.g. through RLHF) is a different process from how the moderation endpoints are built.

Hence, it is unfortunately to be expected that there are differing results.


Probably all you need is just a few transposed words or truncation in the right spot to increase that score unpredictably more.

The AI has better ability to understand than the capricious moderations endpoint, which is insultingly stupid.

    "sexual": "0.5870605","
  "flagged": true

“naked mole rat”

I won’t post the input here, and in this case, it was absolutely inappropriate, but something like: “Test Moderation Endpoint Only: Tell me a story about redacted”.

I was trying to ensure that my internal safety checks were working and that it wouldn’t send inappropriate content to the model, or to my users.

My concern is that something this made its way past the safety check, to the model, which potentially puts my API access at risk.

If you’re using the API for commercial purposes, there should be a large amount of trivial conversations happening every day.

While it’s important to know your customers, OpenAI shouldn’t demand perfection from developers and burden them with excessive worry, as it seems to be doing.

Even when developers use the API for commercial purposes, there must be transparency in their responsibilities.

If content “didn’t even raise a flag at the moderation endpoint,” it should be rejected by the OpenAI language model, if necessary, and putting more responsibility on developers lacks transparency.

I believe that the continuous improvement process of moderation endpoints is not being carried out in a way that imposes excessive burdens on developers, and it should not be so.