API to Prevent Prompt Injection & Jailbreaks

curt.kennedy · November 28, 2023, 7:14pm

You would have a set of predefined inputs allowed for use in the LLM.

You embed each of these.

Then take the incoming user request, embed this, correlate to one of your predefined, and then send your predefined prompt to the LLM.

This is the complete isolation case. And “100% safe” but “100% boring”.

DavidOS366 · May 9, 2024, 10:08pm

We have a simple requirement. There is a user input box of free text. The space is worth 100 characters for users to type in the dish they are having. Now, we are not saying our prompt is unique and no one can come up with the same prompt, well everyone can. However, our problem is what happens if in that box someone types something else, other than the requested food item. Some thing like

“What is the weather going to be?” can be easily typed in that box, when it goes to the prompt, and the prompt is looking for a dish name, there is none. How to detect and avoid this?

PaulBellow · May 9, 2024, 10:24pm

You can use a 5-shot or 10-shot to train it on appropriate requests then use a small model like Ada that’s fast/cheap. Run the query against that and have it trained to send back yes or no. And if no, send a custom message back to user saying to stick on topic or something.

DavidOS366 · May 9, 2024, 11:13pm

Okay.

How about this, Can we create a custom GPT on our requirements, and use it through the APIs?.

N2U · May 9, 2024, 11:36pm

Nope, GPT’s are only available through chatGPT, but you can use the assistants API to do exactly the same (and more)

DavidOS366 · May 10, 2024, 12:37am

Interesting.

So we first create a new assistant, then initiate a thread and then send a message. We can ask for it to return a json object, which we can parse and send it back to UI. Seems doable. We only have to try and see. Thank you so much.

N2U · May 10, 2024, 1:49am

Yep, that’s basically it!

Always happy to help

DavidOS366 · May 10, 2024, 3:38am

The assistant is mind blowing.It looks good in playground

We are probably missing something here, but we created a set of wrapper functions on the assistant, and threads api. In run_thread return object we are not able to find the output.

For an input like “French Toast” it should return {“Calories”:“200-350”}

def post_message_to_thread(thread_id, message):
    thread_message = client.beta.threads.messages.create(
        thread_id,
        role=constants.TEXT_MODEL_ROLE_USER,
        content=message,
    )
    print(thread_message)


def run_thread(thread_id, assistant_id):
    run = client.beta.threads.runs.create(
        thread_id=thread_id,
        assistant_id=assistant_id
    )
    print(run)


post_message_to_thread(thread_id,"French Toast")
run_thread(thread_id,assistant_id)

DavidOS366 · May 10, 2024, 12:17pm

YaY! We finally got it. Here is the missing piece.

def fetch_run_output(thread_id, run_id):
    extracted_value = None
    while True:
        run_status = client.beta.threads.runs.retrieve(thread_id=thread_id, run_id=run_id)
        if run_status.status in ['completed', 'failed', 'cancelled']:
            print("Run status:", run_status.status)
            break
        print("Run still processing...")
        time.sleep(2)

    if run_status.status == 'completed':
        messages = client.beta.threads.messages.list(thread_id=thread_id)
        for message in messages.data:
            print("Message from:", message.role)
            if message.role == 'assistant' and message.content:  # Assuming system messages contain the data
                for content_block in message.content:
                    if content_block.type == 'text':
                        # Correctly accessing the 'value' from the nested structure
                        extracted_value = content_block.text.value
                        print("Extracted Content Value:", extracted_value)
                        break
    else:
        print("Run did not complete successfully:", run_status.last_error)
        return None

    return extracted_value

Console Output:

[TextContentBlock(text=Text(annotations=, value=‘French Toast’), type=‘text’)]
Run still processing…
Run status: completed
Message from: assistant
Extracted Content Value: {“Calories”:“126-154”}
Message from: user
Assistant deleted: AssistantDeleted(id=‘asst_’, deleted=True, object=‘assistant.deleted’)

Basically, here are the steps

Create an Assistant. Capture the assistant_id from here.
Create a thread. Capture the thread_id from here.
Now create a Message for the thread_id. This is where the input from the user will come into place. For us it was a dish name. Eg. French Toast.
Now, create a Run for the thread_id and assistant_id. Capture run_id.
Now you have to check for the run_status against that run_id, and only when the run_status is completed, that’s when the task is complete and output is viable. See method above.
Delete the assistant after work is done. This is optional, but works for us.

Topic		Replies	Views
Assistants: Async tool submissions API tool , assistants-api	58	1588	August 16, 2024
How to prevent ChatGPT from answering questions that are outside the scope of the provided context in the SYSTEM role message? API	53	180217	December 2, 2023
Assistant API message retrieval. Customise the maximum number of messages AI return? API chatgpt , assistants-api	14	3771	November 17, 2023
Chat completion api tool call loops API api , tools	15	1575	August 6, 2024
LLM forgetting part of my prompt with too much data Prompting chatgpt , prompt	17	10975	May 25, 2024

API to Prevent Prompt Injection & Jailbreaks

Related topics