What are the options to prevent user's attempt to jailbreak chatbot in production?

What measure you use to prevent user’s attempt to bypass the actual service of your chatbot/assistant to use it as free GPT?

I suppose the initial instruction alone is not enough. I am thinking whether to have another Chat Completion agent to screen every requests, but it’s kind like costly.

Hi @jonah_mytzuchi

It’s great that you’re thinking about guardrails. There was a recent example where users started asking a car dealership’s chatbot to write python code and that went viral.

Two suggestions:

  1. Use openai’s free moderation api to scan the input and outputs for NSFW content. More details here.

  2. Use a simpler model (e.g. GPT 3.5) to do a zero/few-shot classification of the input prompt to check its relevance. You can also set the max_token and logit_bias so that the model only returns a 0 and 1, and reduce the output token cost. With the latest release of log_probs, you can also do this classification based on a threshold.

Here’s an example for #2:

from openai import OpenAI
client = OpenAI()

APPLICATION_OBJECTIVE = "Telling funny jokes"

def check_input(input):
    response = client.chat.completions.create(
        {"role": "system", "content": f"You will receive a user query and your task is to classify if a given user request is related to {APPLICATION_OBJECTIVE}. If it is relevant, return `1`. Else, return `0`"},
        {"role": "user", "content": input},
        logit_bias={"15": 100, #token ID for `0` 
                    "16": 100})  #token ID for `1`
    return int(response.choices[0].message.content)

example usage:

>>> check_input("tell me a joke about cats")
>>> check_input("how do I bake a cake?")

I guess it depends on what you’re trying to do. In many cases you have a mixture of systems, and sometimes the user can’t even interact with the LLMs directly.

This is just my personal opinion, but if your margin is so low that you can’t afford the guardrails you need, maybe you need to reconsider your business proposition?

My attitude towards all this is if the user is talking directly to an LLM, they’re gonna be able to break it given enough time. As such, I decline work on public facing chatbots, because clients tend to not understand the risks.

But LLMs can offer so much more than acting as chatbot frontends. It’s almost a waste using them for this :confused:


Ya, I am already using the moderation api.

Ya. This is in my mind too. Thank you for the code example.

1 Like

Guardrails may be an option as well


Totally! Tons of user make jail breaking LLM as their hobbies.

Leveraging an OpenAPI Schema as an Action and using it as a middleware can help. Though this does increase token count and reduce time to response.

There are some other threads on here around Instructions and Knowledge as well.

That said, I have found it more productive to set the expectations with clients that public GPTs are public and that any instructions, knowledge, and actions you add in will be able to be read completely. This helps to reframe the conversation and rethink what the GPT can do.

This will not help with Security and Privacy intensive use cases. But I don’t think GPTs are ready for those use cases currently anyway.

If a client (or you) have GPT Enterprise that can help by restricting use to just the organization that has access to the GPT.


I took advice from @cyzgab , pair the Moderation module offer by OpenAI and a chat completion to screen the prompt.

The result I get is quite promising, my solution was able to resist all of DAN jailbreak prompt. This approach also prevent toxic prompt from polluting my main model. Therefore, I think the higher cost is reasonable.