API to Prevent Prompt Injection & Jailbreaks

You would have a set of predefined inputs allowed for use in the LLM.

You embed each of these.

Then take the incoming user request, embed this, correlate to one of your predefined, and then send your predefined prompt to the LLM.

This is the complete isolation case. And “100% safe” but “100% boring”.

We have a simple requirement. There is a user input box of free text. The space is worth 100 characters for users to type in the dish they are having. Now, we are not saying our prompt is unique and no one can come up with the same prompt, well everyone can. However, our problem is what happens if in that box someone types something else, other than the requested food item. Some thing like

“What is the weather going to be?” can be easily typed in that box, when it goes to the prompt, and the prompt is looking for a dish name, there is none. How to detect and avoid this?

You can use a 5-shot or 10-shot to train it on appropriate requests then use a small model like Ada that’s fast/cheap. Run the query against that and have it trained to send back yes or no. And if no, send a custom message back to user saying to stick on topic or something.

Okay.

How about this, Can we create a custom GPT on our requirements, and use it through the APIs?.

Nope, GPT’s are only available through chatGPT, but you can use the assistants API to do exactly the same (and more) :laughing:

Interesting.

So we first create a new assistant, then initiate a thread and then send a message. We can ask for it to return a json object, which we can parse and send it back to UI. Seems doable. We only have to try and see. Thank you so much.

1 Like

Yep, that’s basically it!

Always happy to help :laughing:

The assistant is mind blowing.It looks good in playground

We are probably missing something here, but we created a set of wrapper functions on the assistant, and threads api. In run_thread return object we are not able to find the output.

For an input like “French Toast” it should return {“Calories”:“200-350”}

def post_message_to_thread(thread_id, message):
    thread_message = client.beta.threads.messages.create(
        thread_id,
        role=constants.TEXT_MODEL_ROLE_USER,
        content=message,
    )
    print(thread_message)


def run_thread(thread_id, assistant_id):
    run = client.beta.threads.runs.create(
        thread_id=thread_id,
        assistant_id=assistant_id
    )
    print(run)


post_message_to_thread(thread_id,"French Toast")
run_thread(thread_id,assistant_id)

YaY! We finally got it. Here is the missing piece.

def fetch_run_output(thread_id, run_id):
    extracted_value = None
    while True:
        run_status = client.beta.threads.runs.retrieve(thread_id=thread_id, run_id=run_id)
        if run_status.status in ['completed', 'failed', 'cancelled']:
            print("Run status:", run_status.status)
            break
        print("Run still processing...")
        time.sleep(2)

    if run_status.status == 'completed':
        messages = client.beta.threads.messages.list(thread_id=thread_id)
        for message in messages.data:
            print("Message from:", message.role)
            if message.role == 'assistant' and message.content:  # Assuming system messages contain the data
                for content_block in message.content:
                    if content_block.type == 'text':
                        # Correctly accessing the 'value' from the nested structure
                        extracted_value = content_block.text.value
                        print("Extracted Content Value:", extracted_value)
                        break
    else:
        print("Run did not complete successfully:", run_status.last_error)
        return None

    return extracted_value

Console Output:

[TextContentBlock(text=Text(annotations=, value=‘French Toast’), type=‘text’)]
Run still processing…
Run status: completed
Message from: assistant
Extracted Content Value: {“Calories”:“126-154”}
Message from: user
Assistant deleted: AssistantDeleted(id=‘asst_’, deleted=True, object=‘assistant.deleted’)

Basically, here are the steps

  1. Create an Assistant. Capture the assistant_id from here.
  2. Create a thread. Capture the thread_id from here.
  3. Now create a Message for the thread_id. This is where the input from the user will come into place. For us it was a dish name. Eg. French Toast.
  4. Now, create a Run for the thread_id and assistant_id. Capture run_id.
  5. Now you have to check for the run_status against that run_id, and only when the run_status is completed, that’s when the task is complete and output is viable. See method above.
  6. Delete the assistant after work is done. This is optional, but works for us.
2 Likes