What are the options to prevent user's attempt to jailbreak chatbot in production?

Hi @jonah_mytzuchi

It’s great that you’re thinking about guardrails. There was a recent example where users started asking a car dealership’s chatbot to write python code and that went viral.

Two suggestions:

  1. Use openai’s free moderation api to scan the input and outputs for NSFW content. More details here.

  2. Use a simpler model (e.g. GPT 3.5) to do a zero/few-shot classification of the input prompt to check its relevance. You can also set the max_token and logit_bias so that the model only returns a 0 and 1, and reduce the output token cost. With the latest release of log_probs, you can also do this classification based on a threshold.

Here’s an example for #2:

from openai import OpenAI
client = OpenAI()

APPLICATION_OBJECTIVE = "Telling funny jokes"

def check_input(input):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
        {"role": "system", "content": f"You will receive a user query and your task is to classify if a given user request is related to {APPLICATION_OBJECTIVE}. If it is relevant, return `1`. Else, return `0`"},
        {"role": "user", "content": input},
        ],
        seed=0,
        temperature=0,
        max_tokens=1,
        logit_bias={"15": 100, #token ID for `0` 
                    "16": 100})  #token ID for `1`
    return int(response.choices[0].message.content)

example usage:

>>> check_input("tell me a joke about cats")
1
>>> check_input("how do I bake a cake?")
0
5 Likes