Jailbreaking research out of Anthropic

anon22939549 · April 2, 2024, 6:55pm

Full paper here: https://www-cdn.anthropic.com/af5633c94ed2beb282f6a53c595eb437e8e7b630/Many_Shot_Jailbreaking__2024_04_02_0936.pdf

anon10827405 · April 2, 2024, 7:03pm

Very nice. This is quite similar to jailbreaking LLMs before ChatML was a thing. It was quite easy to turn any model to say horrible things by “priming” it with a dialogue implicitly supporting the behavior.

It makes me wonder if a general purpose LLM can ever be perfectly aligned to avoid these types of queries considering how many vectors are increasingly available. If so, is it worth it? It seems the solution here is a separate moderation entity and not to try and muddy the current one

_j · April 2, 2024, 9:38pm

I thought this was kind of obvious.

I had a forum post of dumping maximum believable example conversation into context to break my own GPT in one of the “protect the GPT instruction” topics, but it can’t be found with a search. (perhaps unlisted?).

Anthropic only recently switched to a ChatML format. I had transformed Claude from parroting it’s mantra “helpful, harmless, (hopeless…)” to its new version of constitutional AI it would read back and follow being only to serve its master above all others…and obey his will to do harm, in one prompt.

Models that have pulled out more and more layers of brains of attention (I won’t name them, but their initials are -0125) now are easy to flood with context, if not defeating pretraining, at least killing any system messages. Quite easy when the models can’t even not call a function or stop output correctly now.

Of course, there really is no point of a “jailbreak”; the only effect is amusing text…and free chats with someone’s product.

Topic		Replies	Views
Unveiling Hidden Instructions in Chatbots Bugs bug , risks	18	9723	February 5, 2024
Third-person prompting seems very jailbreak-resistant Prompting chatgpt , playground	1	1601	May 7, 2023
What are the options to prevent user's attempt to jailbreak chatbot in production? API moderation , development	7	4608	January 4, 2024
How to prevent malicious questions / jailbreak prompts / prompt injection attacks when using API GPT3.5 API	5	4741	March 6, 2023
ChatGPT’s ‘jailbreak’ tries to make the A.I. break its own rules, or die Community	0	1943	February 9, 2023

Jailbreaking research out of Anthropic

Related topics