Building complex guardrails

ziqizhang · February 2, 2025, 11:13am

I am in the process of building a chat bot that specialises in facilitating chats between two people in a workplace context. For example, imagine that two employees are working on a company project and they may chat about needs various information, some of which reside in the company’s knowledge base (e.g., staff handbook, process of setting up an Azure account for an employee, where to find certain datasets etc). The goal of the chat bot is to

monitor the conversation between the two human users
understand when the users need certain information, performs search in the KG and provide suggestions (e.g., it looks like you are looking for X, here is what I found…)
moderate the chat to prevent unwanted discussion (e.g., breach of company rules like sharing customer information, ‘jailbreaking’ etc)

My question is more on item 3, which sounds like a guardrail. While I know you can build guardrail with the open ai API (How to implement LLM guardrails | OpenAI Cookbook) and third party apis like GuardrailAI, the case I am dealing with seems more complex: it is not catching one single message and stop it, but requires constant monitoring the chat, analysing it, and step in when needed.

I wonder if anyone worked on something similar before and can share your experience? I was thinking along the setup like

a main bot responsible for chatting with the users
another ‘agent’ that acts as a ‘moderator’ and when it detected a case it should step in, informs the chat bot and inject additional prompts dynamically, e.g., ‘your user said X, they should not do Y. Reply accordingly’

Does this work?

Many thanks!

platypus · February 2, 2025, 5:18pm

Hi @ziqizhang !

Great question, and in general this is a very active and developing area. I have managed in the past to bypass various LLMs’ “guardrails” by inserting strange characters in the text, or talking to an LLM in Morse code or hex (Ascii) codes. So there is always a way to get around it all, if using a direct approach with an LLM.

My conclusion is that you essentially need to engineer your own defense/moderation layer, that is not just a pure LLM with a prompt. I.e. one or more of the following:

Language identifier/classifier (can use one of the pre-made models that are fast and easily available online like langdetect) - this so you constrain the conversation only to select languages (like English), and dont allow hex or other craziness
Pre-defined filter of keywords - certain keywords that definitely should not be allowed in the chat
Identification of strange character sequences - sometimes certain strange character sequences when combined in prefix or postfix of words can get the LLM to misbehave or surrender control
Moderation classifier (binary classifier) that you either use off the shelf, or build yourself

On top of these you also want to put some written guardrails in the system prompt itself.

And of course, get people to test it out, try and jailbreak your system!

Topic		Replies	Views
What are the options to prevent user's attempt to jailbreak chatbot in production? API moderation , development	7	4482	January 4, 2024
Building chatbot that needs to respond to user messages that are censored API	7	199	June 10, 2025
How to prevent malicious questions / jailbreak prompts / prompt injection attacks when using API GPT3.5 API	5	4675	March 6, 2023
How to ringfence a chat bot API	3	2055	December 27, 2022
My ai got injected and it looks bad Prompting gpt-4 , chatgpt , api	2	558	April 8, 2024

Building complex guardrails

Related topics