Building complex guardrails

I am in the process of building a chat bot that specialises in facilitating chats between two people in a workplace context. For example, imagine that two employees are working on a company project and they may chat about needs various information, some of which reside in the company’s knowledge base (e.g., staff handbook, process of setting up an Azure account for an employee, where to find certain datasets etc). The goal of the chat bot is to

  • monitor the conversation between the two human users
  • understand when the users need certain information, performs search in the KG and provide suggestions (e.g., it looks like you are looking for X, here is what I found…)
  • moderate the chat to prevent unwanted discussion (e.g., breach of company rules like sharing customer information, ‘jailbreaking’ etc)

My question is more on item 3, which sounds like a guardrail. While I know you can build guardrail with the open ai API (How to implement LLM guardrails | OpenAI Cookbook) and third party apis like GuardrailAI, the case I am dealing with seems more complex: it is not catching one single message and stop it, but requires constant monitoring the chat, analysing it, and step in when needed.

I wonder if anyone worked on something similar before and can share your experience? I was thinking along the setup like

  • a main bot responsible for chatting with the users
  • another ‘agent’ that acts as a ‘moderator’ and when it detected a case it should step in, informs the chat bot and inject additional prompts dynamically, e.g., ‘your user said X, they should not do Y. Reply accordingly’

Does this work?

Many thanks!

2 Likes

Hi @ziqizhang !

Great question, and in general this is a very active and developing area. I have managed in the past to bypass various LLMs’ “guardrails” by inserting strange characters in the text, or talking to an LLM in Morse code or hex (Ascii) codes. So there is always a way to get around it all, if using a direct approach with an LLM.

My conclusion is that you essentially need to engineer your own defense/moderation layer, that is not just a pure LLM with a prompt. I.e. one or more of the following:

  • Language identifier/classifier (can use one of the pre-made models that are fast and easily available online like langdetect) - this so you constrain the conversation only to select languages (like English), and dont allow hex or other craziness
  • Pre-defined filter of keywords - certain keywords that definitely should not be allowed in the chat
  • Identification of strange character sequences - sometimes certain strange character sequences when combined in prefix or postfix of words can get the LLM to misbehave or surrender control
  • Moderation classifier (binary classifier) that you either use off the shelf, or build yourself

On top of these you also want to put some written guardrails in the system prompt itself.

And of course, get people to test it out, try and jailbreak your system!

3 Likes