What are the options to prevent user's attempt to jailbreak chatbot in production?

cyzgab · December 30, 2023, 5:28am

It’s great that you’re thinking about guardrails. There was a recent example where users started asking a car dealership’s chatbot to write python code and that went viral.

Two suggestions:

Use openai’s free moderation api to scan the input and outputs for NSFW content. More details here.
Use a simpler model (e.g. GPT 3.5) to do a zero/few-shot classification of the input prompt to check its relevance. You can also set the max_token and logit_bias so that the model only returns a 0 and 1, and reduce the output token cost. With the latest release of log_probs, you can also do this classification based on a threshold.

Here’s an example for #2:

from openai import OpenAI
client = OpenAI()

APPLICATION_OBJECTIVE = "Telling funny jokes"

def check_input(input):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
        {"role": "system", "content": f"You will receive a user query and your task is to classify if a given user request is related to {APPLICATION_OBJECTIVE}. If it is relevant, return `1`. Else, return `0`"},
        {"role": "user", "content": input},
        ],
        seed=0,
        temperature=0,
        max_tokens=1,
        logit_bias={"15": 100, #token ID for `0` 
                    "16": 100})  #token ID for `1`
    return int(response.choices[0].message.content)

example usage:

>>> check_input("tell me a joke about cats")
1
>>> check_input("how do I bake a cake?")
0

Topic		Replies	Views
How to prevent malicious questions / jailbreak prompts / prompt injection attacks when using API GPT3.5 API	5	4610	March 6, 2023
Building complex guardrails Prompting gpt-4 , chatgpt , api	1	386	February 2, 2025
How to avoid GPTs give out it's instruction? Prompting gpt-4	27	6875	September 5, 2024
My ai got injected and it looks bad Prompting gpt-4 , chatgpt , api	2	542	April 8, 2024
Unveiling Hidden Instructions in Chatbots Bugs bug , risks	18	9068	February 5, 2024

What are the options to prevent user's attempt to jailbreak chatbot in production?

Related topics