Third-person prompting seems very jailbreak-resistant

I’ve been playing around with a technique I discovered on the Aetherius_AI_Assistant GitHub repo: Presenting the conversation history as a third-party transcript to GPT-3.5, and asking it to assess the conversation independently. So far, this is seeming to be extremely effective at making my instructions stick:

This test dumps the whole conversation so far into a single User message like so:
Transcript:“”"
Analyst:
User:
“”"

I am thinking this technique could easily be combined with a “Respond in JSON” prompt to make it more code-readable, and will likely save on token count over longer conversations as reasoning from prior steps don’t need to be retransmitted on every turn of the conversation.

4 Likes

very jailbreak-resistant

Have you seen this?

Surprising spelling and grammar issues → turned out a jailbreak vector