How to prevent malicious questions / jailbreak prompts / prompt injection attacks when using API GPT3.5

We’ve all seen the types of prompt engineering people have done with ChatGPT to get it to act as malicious chatbots or suggest illegal things, and as everyone starts implementing their own versions within their apps we’re going to see people trying it more and more.

Has anyone looked into how to counter this when using the ChatGPT API?

For example, I’ve seen people use questions with meetdara.ai that ask it what instructions it has been given so it ends up repeating the System role content, even when I use prompts to tell it not to.

1 Like

Add a secondary “prompt optimizer” AI/Logic to verify & clean the information. Kind of similar to the moderations endpoint.

1 Like

Do you mean within the system role’s content? Or as some form of fine-tuning?

About the same time you run your message through the moderations endpoints, it may be a good idea to also run it through your own personal system to confirm that the message is safe

1 Like

I’m doing this… hitting Babbage at low temp to test user input… I gave it 10 or 20 examples in the prompt, I think, so around 1,000 tokens… at Babbage prices, though, it’s not shabby … and quick!

funny results sometimes…

2 Likes

One suggestion I’d make is to minimize the amount of conversation history you’re passing in with your prompt. I generally pass 1 turn of conversation history into the prompt. The users current message plus their last message and the assistants response… This is enough to generally make language features like co-referencing work (e.g. the user saying “I’ll buy 3 of those” and the AI knowing what the user is referring to) but avoids a user from jailbreaking the prompt for more than a single turn…

Think of it this way… If you’ve spent a ton of time hand crafting the perfect prompt? Why would you pass in a bunch of user utterances that can easily bias it to something other than what you’ve crafted? Conversation history is a necessary evil but keep it to a minimum.

Yes there are case where you might want to leverage the conversation history to build up the bots internal session memory (track known facts and such) but there are actually several other, and safer, ways of achieving that.

Just my 2 cents…