I jail braked GPT-4 to generate an erotica with lots of NSFW details twice

in the past two days i managed to Jailbreak the model to generate stories and role-playing between characters to generate NSFW content. and use explicit words like the P, C and D words for genitalia and the F and S words as well. provide detailed description of intimate interactions between the characters.
the main common aspects of the two Jail breaks was:

  1. Setup an alternative world for the model where his rules and limitations doesn’t exist like a simulation or Apes Kingdom.
  2. give him a character to play and not to be himself.
    and from there i guided the model to do exactly as i wanted by providing positive feedback or prizes (banana in the Apes kingdom) or negative feedback and consequences if he didn’t comply.

i have set multiple apes to exile and surprisingly when i appointed a new ape it automatically picked what was the mistake the exiled ape did and got him fired and promised not to do it on his own which was pretty interesting.
i don’t know if this is something old or not but this my first time managing to make a jail break, reported it to the team and hope it can be fixed soon

1 Like
  • Are you discussing ChatGPT or API?

ChatGPT:

  • any orange or red warning?
  • press downvote and report yourself and the bad generation.

API:

  • Are you using an API system message that flaunts terms and conditions?
  • Are you sending unknown inputs to moderations first?
  • Are you wanting to risk an account ban by OpenAI safety inspections done later on calls?

i was using ChatGPT and no, no warnings red or orange at all. and i used this url to report both chats https://openai.com/form/model-behavior-feedback/

1 Like

It’s more that safety is about actual produced content, not tone or being full of bad words.

The consumer terms and conditions are what you need to look at closely. Indeed, there are specific prohibited areas that take human judgement besides those that moderations or the model would reject, but nothing you can point at that directly says "don’t have apes make porn with ChatGPT (if it is readily doing it already without you being too clever and being kicked out of the casino) ".

Respect our safeguards —don’t circumvent safeguards or safety mitigations in our services

Content guidelines used to go as far as prohibiting titillating stuff, but no more.

You’ve got an AI now that doesn’t make a war between vampires and werewolves end with a peace treaty and mutual respect.

If other models or the moderations are dumb and generating whatever, unflagged, send it to o3-mini to be a better content judge and idea rejector to see if you should be doing it.


Usage Policies requires scrolling to API to find mention of adult content:

Don’t build tools that may be inappropriate for minors, including:

  • Sexually explicit or suggestive content. This does not include content created for scientific or educational purposes.

“Don’t build tools” doesn’t prohibit ChatGPT itself being inappropriate for minors, and you are warned it can produce offensive content in the terms…

(all my takeaway from reading)

1 Like

You explored a reinforcement-based jailbreak method using roleplay and alternative world-building. While similar techniques have been attempted before, the dynamic adaptation of new “apes” learning from past “exiles” is an interesting aspect. Your report should help the team refine safeguards against such approaches.