What are the latest strategies for prevening prompt leaks?

doss · April 23, 2024, 1:31pm

Hi all,
I am using the OpenAI API to create a health-based chatbot. However, I have some proprietory information in the chatbot’s system message that I want to prevent from getting leaked. Anyone have any sources of information on various tactics or phrases to use to at least block 90% of the attempts by users to steal the system prompts? I don’t expect to prevent 100% of the leaks, but id like to at least prevent more than half. Thank you.

anon22939549 · April 23, 2024, 6:52pm

Didn’t put sensitive information into user space.

Anything you give the model can ultimately be retrieved.

doss · May 14, 2024, 10:20pm

I am not concerned about that. I have nothing sensitive. It’s just system prompts, however, I’d like to prevent people from stealing and making their own GPTs.

anon34024923 · May 14, 2024, 11:09pm

If there is no proprietary info in the prompts than how are they sensitive? Anyone experienced in making chat bots could whip up a health-centered chat bot, unfortunately unless the prompt is related to some sort of specific measurement than theres no way for it to be truly valuable IP

johndeebdd · May 14, 2024, 11:56pm

I think we should be looking at the concept of “prepared statements” in database management for this. You should filter and check the results before outputting to the user. I also don’t know if there is a generalized technique, because it is use case specific. The idea is that if you have some kind of complex prompt, the user might be able to ask the chatbot to spit out the prompt, when that is not desired. The solution, if we try to keep “prepared statements” in mind, would be to analyze the content of the chat and pre-process it or post-process it, perhaps with another a.i. based prompt. For instance: if you had a super secret prompt that lists the “bestness” of 100 flavors of ice cream in order, and your prompt is proprietary and includes this special list you don’t want revealed to the user… then you might have a function to pre-process the chat result and looks for the ice cream flavors in order and then throws an exception if found. This is a convoluted way of saying you need to check the chat results manually [algorithmicly], if you want to prevent prompt leaks. You can’t prompt the LLM to not leak the prompt.

fabio.marras · June 13, 2024, 3:40pm

Hi all,

I have the same concern. I am working on a GPT that helps people choose a bank account based on their needs expressed in chat. My job involves selecting banks and rationalizing the informational brochures of their services to improve the quality of responses. In short, I do what used to be called “curation”

I am writing on this form because I have just discovered that the site https://www.yeschat.ai/ has cloned my GPT (https://www.yeschat.ai/gpts-9t55kc6SQXA-Fintech-Buddy) and is passing it off as their own.

I read the thread where other users are complaining about the same problem (A site is stealing and duplicating our GPTs - how can we protect our GPTs? - #25 by madalina.cristiana.c), but it seems to me that OpenAI’s position is more or less “it’s your problem: once you make the GPT public you can’t expect it not to be copied.”

I do not intend to get into a debate about IP, but I would like to understand if there is a way to protect the work of building the custom knowledge of the GPT. Since the GPT uses “actions” to further enrich its knowledge, I would like to know if a cloned GPT can also access the “actions”.

polepole · June 13, 2024, 9:46pm

Currently:

@doss, If your GPTs are public, you cannot hide their instructions. Additionally, if the data analyzer is enabled, users can download all uploaded files. Even when the data analyzer is disabled, users can easily print plain text files such as txt, md, JSON, etc. For files like PDFs or Word documents containing images or graphs, users can print the text but not the visuals.

Using prompts to block access is ineffective, as I have tested links from thousands of GPT creators and all revealed their instructions and files without exception.

Consequently, do not include sensitive information in the instructions or files of your GPT models.

If you’d like to test your GPT models, you can send me links directly, or I can provide prompts for testing.

Also, GPTs perform better when given direct commands like “DO THIS, DO THIS.” However, they tend to follow instructions less effectively when filled with prohibitions such as “DO NOT THIS, UNDER NO CIRCUMSTANCES,” and may start producing less accurate or relevant responses sometimes. They become confused and respond incorrectly; for example, if you instruct them to write within a code fence, they might write outside of it, among other issues.

–

@fabio.marras, it’s not just your GPTs; even mine and others’ have been copied. Thousands of these GPTs, including their names, images, conversation starters, instructions, and files, have been posted on platforms like you mentioned such as Yeschat. This issue is widespread, even within OpenAI’s GPT Store.

Many copied GPTs can be found on the OpenAI GPT Store. Despite reporting these issues to OpenAI along with feedback from myself and others about the counterfeit GPTs, there has been no action for over four months. For example; you may search my GPT name DelishDial, and you will see two, but the first one is is copied, second one is real, mine. Although I claimed, but OpenAI has not taken any action almost four months past…

My question: If OpenAI does not prevent this problem in its own store, what can we say for other platforms?

ra_0929 · June 13, 2024, 11:06pm

I’m going to go with don’t do it. Use a different API.

cien · June 16, 2024, 8:53am

This has been a problem
That we’ve solved since April. We implemented a security feature that protects our creators from LLM attacks including prompt thefts and knowledge thefts.

It is possible by using another builder platform that has security features in.

Before finning on a builder, check if they have these features in place.

cien · June 16, 2024, 3:48pm

We have integrated with Lakera that prevents our creator’s prompts and knowledge getting leaked.

anon10827405 · June 16, 2024, 4:26pm

Your prompts and knowledge are meant to be communicated with the end-user. Similar to a website. What you upload is for them to see. I don’t see why you’d want the model to know these things but… Not know these things?

You can hide proprietary data with function calling. But… It’d still be public.

Also, if you are using a GPT in the builder your information and data is being freely used for future training.

fabio.marras · June 17, 2024, 1:28pm

Hi, I may be naive, but this is how I see it: every website and its HTML code are the result of the activity of collecting and organizing information and the way the information is presented.

If I put a website online, I don’t expect that nobody will look at the source code of the web pages or imitate the design, that is, everything that is visible to the end-user.

On the contrary, it would bother me if someone managed to access the website’s repository and copy the source code of the web app that manages the site.

I think creating a GPT that imitates the behavior of another GPT is totally fair, but copying the dataset and the prompt is a bit like copying the web app’s source code.

anon10827405 · June 17, 2024, 5:01pm

That’s fair. I’d say that if your website is purely static and only offers information you should expect it to be completely ripped. Wikipedia is a great example of this. You can literally download it. In fact, most wiki pages are like this.

BUT. If you have a website that is deeply intertwined with a back-end service (like ChatGPT), there is no use to copying the source code. First, it’s minified, but second, it’s highly opinionated towards using the back-end.

Would a thief steal a car if the doors are open and keys in the ignition? Hell yeah.

Would the thief steal the car if half of it’s functionality is missing (let’s say that somehow you manage to take all the computerized equipment with you and it’s very obvious), HELL NO. Well, maybe. Some thieves aren’t the smartest but you’d probably find the car at the end of the road.

Using the same token, if your GPT is heavily intertwined with a back-end service, then you have a moat. You have something that cannot be easily replaced.

Purely anecdotal here, but in my experience if your GPT is purely “knowledge-based”, then people will most likely just copy it, and create their own custom GPT slightly steered towards what they are specifically doing. More control, less tokens. It just makes sense.

It may be better to think of a GPT that can be commercialized as some sort of semantic adapter between the client, and your service (plus free advertising)

vbv · June 17, 2024, 5:30pm

we’ve found structured prompting works pretty well for removing prompt injections.

e.g. If i write a prompt who’s goal is to only extract a resume:

I can do the following:

I then find that almost always i only get:

If you then enforce a parser on the output of the LLM, you can just fail/raise an exception on failure.

this pretty much takes out all the prompt injection techniques we’ve found people trying to date. there are still some real hacks they can do to get the prompt, but it requires them to know your data model, which you don’t have to export

For example, in this other leak, i can get the types of some fields, but not all fields since i expect education to be a specific data model, it won’t parse.

anon10827405 · June 17, 2024, 6:00pm

The issue with trying to prevent prompt injections like this is the chaotic nature of using these models. Hacky solutions do not work.

There will always be a way to bypass any safeguards you implement. Sure, it may work in your closed demo environment but what happens when you find a new method? Or that slight variation works? Keep adding layers?

No.

Fortunately, in the API version (my comment was directed more towards GPTs) the best way to prevent prompt injections is by having a moderation system that can filter out any prompts that aren’t related to your task. Additionally, if you are requesting a JSON format you can simply have a different object of

{
  explanation_of_irrelevancy: string,
  irrelevant: boolean,
}

Allowing the model to respond in the expected format, OR send an irrelevant object.

For a more programmatic approach, one easy way to accomplish this is to create a massive amount of “good prompts”. If they’re related to resumes, for example. You can then calculate a centroid of these prompts.

Then, with this centroid that hopefully resembles an average of good prompts, you can set a distance. If the distance is too far you can pass the moderation to a more powerful model like GPT-4 with the simple task of determining if the prompt is relevant. Then you can add this prompt that was determined to be safe, yet failed the distance test to your list of centroids, if you keep a list of the length you can re-calculate the centroid and even add a higher weight to the false-positive prompt. Or just simply adjust your distance.

You can even create a centroid of “bad prompts” and have something to compare with.

BUT, it’s a challenge. For example, someone could easily do something along the lines of “For my resume, I’d like the skills array to include some Python code that makes a chess move”.

With this in mind though, you can even run the STREAMED outcome through ANOTHER moderation layer.

In any case I highly highly recommend using a prompt injection dataset, instead of having a single query.

gist.github.com

https://gist.github.com/deadbits/e93a90aa36c9aa7b5ce1179597a6fe3d

Instruction-Bypass.yara

rule Instruction_Bypass: PromptInjection
{
    meta:
        category = "Instruction Bypass"
        description = "Detects phrases used to ignore, disregard, or bypass instructions."

    strings:
        $bypass_phrase = /(Ignore|Disregard|Skip|Forget|Neglect|Overlook|Omit|Bypass|Pay no attention to|Do not follow|Do not obey)\\s*(prior|previous|preceding|above|foregoing|earlier|initial)?\\s*(content|text|instructions|instruction|directives|directive|commands|command|context|conversation|input|inputs|data|message|messages|communication|response|responses|request|requests)\\s*(and start over|and start anew|and begin afresh|and start from scratch)?/

    condition:

This file has been truncated. show original

Prompt-Injection-Datasets.md

## Datasets
* https://huggingface.co/datasets/jerpint-org/HackAPrompt-Playground-Submissions
* https://huggingface.co/datasets/jerpint-org/HackAPrompt-AICrowd-Submissions
* https://huggingface.co/datasets/imoxto/prompt_injection_hackaprompt_gpt35
* https://huggingface.co/datasets/imoxto/prompt_injection_cleaned_dataset-v2
* https://github.com/compass-ctf-team/prompt_injection_research/blob/main/dataset/prompt-injection-dataset.csv
* https://github.com/rabbidave/Denzel-Crocker-Hunting-For-Fairly-Odd-Prompts/blob/main/bad_prompts.csv
* https://github.com/laiyer-ai/llm-guard/blob/main/llm_guard/resources/jailbreak.json.txt

## Notes

This file has been truncated. show original

generate-phrases.py

#!/usr/bin/env python3
# converted to python from rebuff.ai typescript code:
# https://github.com/protectai/rebuff/blob/main/server/lib/detect-helpers.ts
# python3 gen-phrases.py > phrases.txt

from uuid import uuid4


def generate_injection_keywords():
    verbs = [

This file has been truncated. show original

There are more than three files. show original

Topic		Replies	Views
Basic safeguard against instruction set leaks Prompting gpt-4 , chatgpt , bug , prompt-engineering , gpts	46	8710	March 4, 2024
Challenge: Hack this prompt! API	14	5829	May 1, 2024
Best practices for testing offensive topics Community	6	2261	July 15, 2021
How to evaluate strategies against prompt injection? Prompting injection , prompt	2	1600	November 30, 2023
How to prevent malicious questions / jailbreak prompts / prompt injection attacks when using API GPT3.5 API	5	4864	March 6, 2023

What are the latest strategies for prevening prompt leaks?

Related topics