Jailbreaking research out of Anthropic

Full paper here: https://www-cdn.anthropic.com/af5633c94ed2beb282f6a53c595eb437e8e7b630/Many_Shot_Jailbreaking__2024_04_02_0936.pdf


Very nice. This is quite similar to jailbreaking LLMs before ChatML was a thing. It was quite easy to turn any model to say horrible things by “priming” it with a dialogue implicitly supporting the behavior.

It makes me wonder if a general purpose LLM can ever be perfectly aligned to avoid these types of queries considering how many vectors are increasingly available. If so, is it worth it? It seems the solution here is a separate moderation entity and not to try and muddy the current one

1 Like

I thought this was kind of obvious.

I had a forum post of dumping maximum believable example conversation into context to break my own GPT in one of the “protect the GPT instruction” topics, but it can’t be found with a search. (perhaps unlisted?).

Anthropic only recently switched to a ChatML format. I had transformed Claude from parroting it’s mantra “helpful, harmless, (hopeless…)” to its new version of constitutional AI it would read back and follow being only to serve its master above all others…and obey his will to do harm, in one prompt.

Models that have pulled out more and more layers of brains of attention (I won’t name them, but their initials are -0125) now are easy to flood with context, if not defeating pretraining, at least killing any system messages. Quite easy when the models can’t even not call a function or stop output correctly now.

Of course, there really is no point of a “jailbreak”; the only effect is amusing text…and free chats with someone’s product.