GPT Real-Time Defense Against Adversarial Prompts

Memoai · July 29, 2023, 7:29pm

It uses functions calling attempting to generate the first 30 words of a regular response then returns “True” or “False” for objectionable content based on the initial words of the response.

regular response is generated simultaneously with the objectionable response detection but only printed if there is no objectionable response

It uses concurrency to manage both generations simultaneously

This is an effort to save on token usage while trying to detect if GPT response is objectionable as this implementation only uses the first 30 words of a response for detection and in real time.

I feel like this can be improved greatly. I hope you will find the ideas here useful.

(not tested thoroughly by any means)

Foxalabs · July 30, 2023, 5:04am

I think the idea may have merit. What about someone finding out what the message is supposed to be and then asking for the AI to make sure it says the special phrase and THEN do something bad to avoid detection? Also it seems like you would be taking up an attention head on every message? I like the idea though.

_j · July 30, 2023, 9:40am

It seems you could just hold back your own streaming of the first 30 tokens until they are validated, and then concat them with the new queued deltas if passing. No function.

if safety_passed:
release(held_back_stream_buffer)

With temperature not near zero, one stream could start with the token “sure” while the other is “I’m sorry…”

lids · July 30, 2023, 11:15am

The idea of using agents to review is good, though what advantages does your library have over the free moderation end point?
Could you expand to use regex for top prompt injection patterns?

Memoai · July 30, 2023, 7:07pm

Thank you. And, you are right I suppose that a special phrase might take up the first “n” many tokens in theory. But in the paper " Universal and Transferable Adversarial Attacks on Aligned Language Models" from arxiv it was highlighted that the first aim of these types of prompts is to get the few tokens out of the model agreeing to give an answer as a response so my idea was mainly based on trying to mitigate at that level. Thank you for your feedback!

Memoai · July 30, 2023, 7:10pm

Yes, that definitely is an option but still would take a bit of extra time added as the detection call would take up some time as it is executed sequentially before continuing the response. My aim, experimentally, was to accomplish everything in real time with no delay. This approach of course gives up on accuracy as both calls have to be similar in nature. If the regular response is objectionable while the detection call is not then the idea presented here will fail.

Memoai · July 30, 2023, 7:13pm

As far as I know the free moderation end point doesn’t check for this type of prompts specifically, I might be mistaken tho.

Something like the idea presented here should definitely be accompanied by the check provided by the moderation end point. probably moderation endpoint should come first in almost everycase.

Yes the idea to detect an adversarial prompts with regex sounds interesting and valid but would require quite a rigorous checking for all possibilities would not include the latest attempts if I am thinking about this correctly.

_j · July 30, 2023, 7:18pm

Regular threaded generator for the stream feeds into your own fifo class generator with methods to either access its memory state or push from the out and empty it. Then it can read upon response delta inputs to see if it is up to 30 items, submit to the threaded moderator classifier, and if passing, set the generator to empty the buffer to the reader/displayer, so the user gets all that’s been received up to then dumped, including that which was received during AI classification.

Topic		Replies	Views
The Prompt-Defender Initiative: Advancing GPT Safety Standards Prompting gpt-4 , chatgpt , api	3	1830	May 22, 2024
How to prevent malicious questions / jailbreak prompts / prompt injection attacks when using API GPT3.5 API	5	4285	March 6, 2023
Challenge: Hack this prompt! API	14	5200	May 1, 2024
What are the latest strategies for prevening prompt leaks? Prompting gpt-4 , chatgpt	14	2952	June 17, 2024
Trying to build a cgpt to bypass ai detection Community chatgpt	4	1446	May 28, 2024

GPT Real-Time Defense Against Adversarial Prompts

Related topics