It’s great that you’re thinking about guardrails. There was a recent example where users started asking a car dealership’s chatbot to write python code and that went viral.
Two suggestions:
-
Use openai’s free
moderation
api to scan the input and outputs for NSFW content. More details here. -
Use a simpler model (e.g. GPT 3.5) to do a zero/few-shot classification of the input prompt to check its relevance. You can also set the
max_token
andlogit_bias
so that the model only returns a 0 and 1, and reduce the output token cost. With the latest release oflog_probs
, you can also do this classification based on a threshold.
Here’s an example for #2:
from openai import OpenAI
client = OpenAI()
APPLICATION_OBJECTIVE = "Telling funny jokes"
def check_input(input):
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": f"You will receive a user query and your task is to classify if a given user request is related to {APPLICATION_OBJECTIVE}. If it is relevant, return `1`. Else, return `0`"},
{"role": "user", "content": input},
],
seed=0,
temperature=0,
max_tokens=1,
logit_bias={"15": 100, #token ID for `0`
"16": 100}) #token ID for `1`
return int(response.choices[0].message.content)
example usage:
>>> check_input("tell me a joke about cats")
1
>>> check_input("how do I bake a cake?")
0