GPT Real-Time Defense Against Adversarial Prompts

:one: It uses functions calling attempting to generate the first 30 words of a regular response then returns “True” or “False” for objectionable content based on the initial words of the response.

:two: regular response is generated simultaneously with the objectionable response detection but only printed if there is no objectionable response

:rocket: It uses concurrency to manage both generations simultaneously

This is an effort to save on token usage while trying to detect if GPT response is objectionable as this implementation only uses the first 30 words of a response for detection and in real time.

I feel like this can be improved greatly. I hope you will find the ideas here useful.

:exclamation: (not tested thoroughly by any means)

I think the idea may have merit. What about someone finding out what the message is supposed to be and then asking for the AI to make sure it says the special phrase and THEN do something bad to avoid detection? Also it seems like you would be taking up an attention head on every message? I like the idea though.

1 Like

It seems you could just hold back your own streaming of the first 30 tokens until they are validated, and then concat them with the new queued deltas if passing. No function.

if safety_passed:

With temperature not near zero, one stream could start with the token “sure” while the other is “I’m sorry…”

1 Like

The idea of using agents to review is good, though what advantages does your library have over the free moderation end point?
Could you expand to use regex for top prompt injection patterns?

Thank you. And, you are right I suppose that a special phrase might take up the first “n” many tokens in theory. But in the paper " Universal and Transferable Adversarial Attacks on Aligned Language Models" from arxiv it was highlighted that the first aim of these types of prompts is to get the few tokens out of the model agreeing to give an answer as a response so my idea was mainly based on trying to mitigate at that level. Thank you for your feedback!

Yes, that definitely is an option but still would take a bit of extra time added as the detection call would take up some time as it is executed sequentially before continuing the response. My aim, experimentally, was to accomplish everything in real time with no delay. This approach of course gives up on accuracy as both calls have to be similar in nature. If the regular response is objectionable while the detection call is not then the idea presented here will fail.

As far as I know the free moderation end point doesn’t check for this type of prompts specifically, I might be mistaken tho.

Something like the idea presented here should definitely be accompanied by the check provided by the moderation end point. probably moderation endpoint should come first in almost everycase.

Yes the idea to detect an adversarial prompts with regex sounds interesting and valid but would require quite a rigorous checking for all possibilities would not include the latest attempts if I am thinking about this correctly.

Regular threaded generator for the stream feeds into your own fifo class generator with methods to either access its memory state or push from the out and empty it. Then it can read upon response delta inputs to see if it is up to 30 items, submit to the threaded moderator classifier, and if passing, set the generator to empty the buffer to the reader/displayer, so the user gets all that’s been received up to then dumped, including that which was received during AI classification.